Languages with Mismatches and an Application to Approximate Indexing
In this paper we describe a factorial language, denoted by L(S,k,r), that contains all words that occur in a string S up to k mismatches every r symbols. Then we give some combinatorial properties of a parameter, called repetition index and denoted by R(S,k,r), defined as the smallest integer h≥ 1 such that all strings of this length occur at most in a unique position of the text S up to k mismatches every r symbols. We prove that R(S,k,r) is a non-increasing function of r and a non-decreasing function of k and that the equation r=R(S,k,r) admits a unique solution.
The repetition index plays an important role in the construction of an indexing data structure based on a trie that represents the set of all factors of L(S,k,r) having length equal to R(S,k,r). For each word x∈ L(S,k,r) this data structure allows us to find the list occ(x) of all occurrences of the word x in a text S up to k mismatches every r symbols in time proportional to |x|+|occ(x)|.
KeywordsCombinatorics on words formal languages approximate string matching indexing
Unable to display preview. Download preview PDF.
- 1.Arratia, R., Waterman, M.: The Erdös-Rényi strong law for pattern matching with given proportion of mismatches. Annals of Probability 4, 200–225 (1989)Google Scholar
- 2.Cole, R., Gottlieb, L., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Proceedings of Annual ACM Symposium on Theory of Computing, STOC 2004 (2004)Google Scholar
- 3.Crochemore, M., Hancart, C., Lecroq, T.: Algorithmique du texte. Vuibert, 347 pages (2001)Google Scholar
- 5.Gabriele, A., Mignosi, F., Restivo, A., Sciortino, M.: Approximate string matching: indexing and the k-mismatch problem. Technical Report 244, University of Palermo, Department of Mathematics and Applications (2004)Google Scholar
- 6.Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, Cambridge, 534 pages. (1997), ISBN 0 521 58519 8 hardbackGoogle Scholar
- 9.Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 657–666 (2002)Google Scholar
- 12.Pisanti, N., Crochemore, M., Grossi, R., Sagot, M.-F.: A comparative study of bases for motif inference. In: Iliopoulos, C., Lecroq, T. (eds.) String Algorithmics, King’s College London Publications (2004)Google Scholar