Abstract
DNA and protein sequence comparisons are performed by a number of computational algorithms. Most of these algorithms search for the alignment of two sequences that optimizes some alignment score. It is an important problem to assess the statistical significance of a given score. In this paper we use newly developed methods for Poisson approximation to derive estimates of the statistical significance ofk-word matches on a diagonal of a sequence comparison. We require at leastq of thek letters of the words to match where 0<q≤k. The distribution of the number of matches on a diagonal is approximated as well as the distribution of the order statistics of the sizes of clumps of matches on the diagonal. These methods provide an easily computed approximation of the distribution of the longest exact matching word between sequences. The methods are validated using comparisons of vertebrate andE. coli protein sequences. In addition, we compare two HLA class II transplantation antigens by this method and contrast the results with a dynamic programming approach. Several open problems are outlined in the last section.
Similar content being viewed by others
Literature
Altschul, S. F., W. Gish, W. Miller, E. W. Myers and D. J. Lipman. 1990. Basic local alignment search tool.J. molec. Biol. 214, 1–8.
Aldous, D. J. 1989.Probability Approximations via the Poisson Clumping Heuristic. New York: Springer-Verlag.
Arratia, R., L. Gordon and M. S. Waterman. 1986. An extreme value theory for sequence matching.Ann. Statist. 14, 971–993.
Arratia, R., P. Morris and M. S. Waterman. 1988. Stochastic scrabble: a law of large numbers for sequence matching with scores.J. appl. Prob. 25, 106–119.
Arratia, R. and L. Gordon. 1989. Tutorial on large deviations for the binomial distribution.Bull. math. Biol. 51, 125–131.
Arratia, R., L. Goldstein and L. Gordon. 1989. Two moments suffice for Poisson approximations: the Chen-Stein method.Ann. Prob. 17, 9–25.
Arratia, R. and M. S. Waterman. 1989. The Erdös-Rényi strong law for pattern matching with a given proportion of mismatches.Ann Prob. 3, 1152–1169.
Arratia, R., L. Goldstein and L. Gordon. 1990a. Poisson approximation and the Chen-Stein method.Stat. Sci. 5, 403–423.
Arratia, R., L. Gordon and M. Waterman. 1990b. The Erdös-Rényi Law in distribution, for coin tossing and sequence matching.Ann. Stat. 18, 539–570.
Billingsley, P. 1986.Probability and Measure. New York: John Wiley and Sons.
Feller, W. 1968.An Introduction to Probability Theory and its Applications, Vol.I, 3rd Edn. New York: John Wiley and Sons.
Goldstein, L. 1990. Poisson approximation and DNA sequence matching.Communs Stat. Theory Meth. 19, 4167–4179.
Haiman, G. 1987. Étude des extrêmes d'une suite stationnairem-dépendante avec une application relative aux accroissements du processus de Wiener.Ann. Inst. Henri Poincaré,23, 425–258.
Karlin, S., G. Ghandour, F. Ost, S. Tavare and L. J. Korn. 1983. New approaches for computer analysis of nucleic acid sequences.Proc. natn. Acad. Sci. U.S.A. 80, 5660–5664.
Karlin, S. and F. Ost. 1987. Counts of long aligned word matches among random letter sequences.Adv. appl. Prob. 19, 293–351.
Karlin, S. and S. F. Altschul. 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.Proc. natn. Acad. Sci. U.S.A. 87, 2264–2268.
Lipman, D. J. and W. R. Pearson. 1985. Rapid and sensitive protein similarity searches.Science 227, 1435–1441.
Mott, R. F., T. B. L. Kirkwood and R. N. Curnow. 1989. A test for the statistical significance of DNA sequence similarities for application in databank searches.CABIOS 5, 123–131.
Mott, R. F., T. B. L. Kirkwood and R. N. Curnow. 1990. An accurate approximation to the distribution of the length of the longest matching word between two random DNA sequences.Bull. math. Biol. 6, 773–784.
Needleman, S. B. and C. D. Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequences of two proteins.J. molec. Biol. 48, 444–453.
Oosterhoff, J. 1969.Combination of one-sided statistical tests. Mathematical Centre Tracts, No. 28, Mathematical Centre, Amsterdam, The Netherlands.
Pearson, W. R. and D. J. Lipman. 1988. Improved tools for biological sequence comparison.Proc. natn. Acad. Sci. U.S.A. 85, 2444–2448.
Pearson, W. R. 1990. Rapid and sensitive sequence comparison with FASTP and FASTA. In:Methods in Enzymology, Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences. R. F. Doolittle (Ed.). Vol. 183, pp. 63–98. New York: Academic Press.
Smith, T. F. and M. S. Waterman. 1981. Identification of common molecular subsequences.J. molec. Biol. 147, 195–197.
Smith, T. F., M. S. Waterman and C. Burks. 1985. The statistical distribution of nucleic acid similarities.Nucl. Acids Res 13, 645–656.
Tavaré, S. and B. Giddings. 1989. Some statistical aspects of the primary structure of nucleotide sequences. InMathematical Methods for DNA Sequences, M. S. Waterman (Ed.), Florida, U.S.A., CRC Press.
Taylor, W. R. 1986. The classification of amino acid conservation.J. theor. Biol. 119, 205–218.
Waterman, M. S. 1984. General methods of sequence comparions.Bull. math. Biol. 46, 473–500.
Waterman, M. S., L. Gordon and R. Arratia. 1987. Phase transitions in sequence matches and nucleic structure.Proc. natn. Acad. Sci. 84, 1239–1243.
Waterman, M. S. 1989.Mathematical Methods for DNA Sequences. M. S. Waterman (Ed.). Florida, U.S.A.: CRC Press.
Waterman, M. and R. Jones. 1990. Consensus methods for DNA and protein sequence alignments. InMethods in Enzymology, Vol. 183, R. Doolittle (Ed.), New York: Academic Press.
Wilbur, W. J. and D. Lipman. 1983. Rapid similarity searches of nucleic acid and protein databanks.Proc. natn. Acad. Sci. U.S.A. 80, 726–730.
Author information
Authors and Affiliations
Additional information
This work was supported by grants DMS 90-05833 from NSF and GM 36230 from NIH.
Rights and permissions
About this article
Cite this article
Goldstein, L., Waterman, M.S. Poisson, compound poisson and process approximations for testing statistical significance in sequence comparisons. Bltn Mathcal Biology 54, 785–812 (1992). https://doi.org/10.1007/BF02459930
Received:
Issue Date:
DOI: https://doi.org/10.1007/BF02459930