Abstract
Given two independent sequences of letters, we seek the probability distribution of the length of the longest matching word. This word can be in different positions in the two sequences and we consider both perfect and nearly perfect matching. We derive bounds and approximations for the probability and compare them with other bounds and approximations. The results can be applied to DNA sequences in molecular biology and generalized matching between two independent random sequences.
Similar content being viewed by others
Literature
Arratia, R., L. Gordon and M. S. Waterman. 1986. An extreme value theory for sequence matching.Ann. Statist. 14, 971–993.
Arratia, R., L. Gordon and M. S. Waterman. 1990. The Erdos-Renyi law in distribution, for coin tossing and sequence matching.Ann. Statist. 18, 539–570.
Chen, L. H. Y. 1975. Poisson approximation for dependent trials.Ann. probab. 3, 534–545.
Erdos, P. and P. Revesz. 1975. On the length of the longest head-run.Topics in Information Theory, Colloquia Math Soc. J. Bolyai 16, 219–228. Keszthely, Hungary.
Fu, Y. X. and R. N. Curnow. 1990. Locating a changed segment in a sequence of Bernoulli variables.Biometrika V77, 295–304.
Glaz, J. 1993. Extreme order statistics for a sequence of dependent random variables. InStochastic Inequalities, IMS Lecture Notes—Monograph Series, Vol. 22, pp. 100–115.
Glaz, J. and J. I. Naus 1991. Tight bounds and approximations for scan statistic probabilities for discrete data.Ann. Appl. Probab. 1, 306–318.
Gordon, L., M. F. Schilling and M. S. Waterman. 1986. An extreme value theory for long head runs.Probab. Theor. Rel. Fields 72, 279–287.
Hoover, D. R. 1990. Subset complement addition upper bounds—an improved inclusion-exclusion method.J. Statist. Plann. Inf. 24, 195–202.
Hunter, D. 1976. An upper bound for the probability of a union.J. Appl. Probab. 13, 597–603.
Karlin, S. and F. Ost. 1987. Counts of long aligned word matches among random letter sequences.Adv. Appl. Prob. 19, 293–351.
Karlin, S. and F. Ost. 1988. Maximal length of common words among random letter sequences.Ann. Probab. 16, 535–563.
Mott, R. F., T. B. L. Kirkwood and R. N. Curnow. 1990. An accurate approximation to the distribution of the length of the longest matching word between two random DNA sequences.Bull. math. Biol. 52, 773–784.
Naus, J. I. 1974. Probabilities for a generalized birthday problem.J. Am. Statist. Assoc. 69, 810–815.
Naus, J. I. 1982. Approximations for distributions of scan statistics.J. Am. Statist. Assoc. 77, 177–183.
Stein, C. M. 1986.Approximate Computation of Expectations. Hayward, CA: IMS.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Sheng, KN., Naus, J.I. Pattern matching between two non-aligned random sequences. Bltn Mathcal Biology 56, 1143–1162 (1994). https://doi.org/10.1007/BF02460290
Received:
Issue Date:
DOI: https://doi.org/10.1007/BF02460290