Skip to main content
Log in

Poisson, compound poisson and process approximations for testing statistical significance in sequence comparisons

  • Published:
Bulletin of Mathematical Biology Aims and scope Submit manuscript

Abstract

DNA and protein sequence comparisons are performed by a number of computational algorithms. Most of these algorithms search for the alignment of two sequences that optimizes some alignment score. It is an important problem to assess the statistical significance of a given score. In this paper we use newly developed methods for Poisson approximation to derive estimates of the statistical significance ofk-word matches on a diagonal of a sequence comparison. We require at leastq of thek letters of the words to match where 0<qk. The distribution of the number of matches on a diagonal is approximated as well as the distribution of the order statistics of the sizes of clumps of matches on the diagonal. These methods provide an easily computed approximation of the distribution of the longest exact matching word between sequences. The methods are validated using comparisons of vertebrate andE. coli protein sequences. In addition, we compare two HLA class II transplantation antigens by this method and contrast the results with a dynamic programming approach. Several open problems are outlined in the last section.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Literature

  • Altschul, S. F., W. Gish, W. Miller, E. W. Myers and D. J. Lipman. 1990. Basic local alignment search tool.J. molec. Biol. 214, 1–8.

    Article  Google Scholar 

  • Aldous, D. J. 1989.Probability Approximations via the Poisson Clumping Heuristic. New York: Springer-Verlag.

    Google Scholar 

  • Arratia, R., L. Gordon and M. S. Waterman. 1986. An extreme value theory for sequence matching.Ann. Statist. 14, 971–993.

    MATH  MathSciNet  Google Scholar 

  • Arratia, R., P. Morris and M. S. Waterman. 1988. Stochastic scrabble: a law of large numbers for sequence matching with scores.J. appl. Prob. 25, 106–119.

    Article  MATH  MathSciNet  Google Scholar 

  • Arratia, R. and L. Gordon. 1989. Tutorial on large deviations for the binomial distribution.Bull. math. Biol. 51, 125–131.

    MATH  MathSciNet  Google Scholar 

  • Arratia, R., L. Goldstein and L. Gordon. 1989. Two moments suffice for Poisson approximations: the Chen-Stein method.Ann. Prob. 17, 9–25.

    MATH  MathSciNet  Google Scholar 

  • Arratia, R. and M. S. Waterman. 1989. The Erdös-Rényi strong law for pattern matching with a given proportion of mismatches.Ann Prob. 3, 1152–1169.

    MATH  MathSciNet  Google Scholar 

  • Arratia, R., L. Goldstein and L. Gordon. 1990a. Poisson approximation and the Chen-Stein method.Stat. Sci. 5, 403–423.

    MATH  MathSciNet  Google Scholar 

  • Arratia, R., L. Gordon and M. Waterman. 1990b. The Erdös-Rényi Law in distribution, for coin tossing and sequence matching.Ann. Stat. 18, 539–570.

    MATH  MathSciNet  Google Scholar 

  • Billingsley, P. 1986.Probability and Measure. New York: John Wiley and Sons.

    Google Scholar 

  • Feller, W. 1968.An Introduction to Probability Theory and its Applications, Vol.I, 3rd Edn. New York: John Wiley and Sons.

    Google Scholar 

  • Goldstein, L. 1990. Poisson approximation and DNA sequence matching.Communs Stat. Theory Meth. 19, 4167–4179.

    MATH  Google Scholar 

  • Haiman, G. 1987. Étude des extrêmes d'une suite stationnairem-dépendante avec une application relative aux accroissements du processus de Wiener.Ann. Inst. Henri Poincaré,23, 425–258.

    MATH  MathSciNet  Google Scholar 

  • Karlin, S., G. Ghandour, F. Ost, S. Tavare and L. J. Korn. 1983. New approaches for computer analysis of nucleic acid sequences.Proc. natn. Acad. Sci. U.S.A. 80, 5660–5664.

    Article  MATH  Google Scholar 

  • Karlin, S. and F. Ost. 1987. Counts of long aligned word matches among random letter sequences.Adv. appl. Prob. 19, 293–351.

    Article  MATH  MathSciNet  Google Scholar 

  • Karlin, S. and S. F. Altschul. 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.Proc. natn. Acad. Sci. U.S.A. 87, 2264–2268.

    Article  MATH  Google Scholar 

  • Lipman, D. J. and W. R. Pearson. 1985. Rapid and sensitive protein similarity searches.Science 227, 1435–1441.

    Google Scholar 

  • Mott, R. F., T. B. L. Kirkwood and R. N. Curnow. 1989. A test for the statistical significance of DNA sequence similarities for application in databank searches.CABIOS 5, 123–131.

    Google Scholar 

  • Mott, R. F., T. B. L. Kirkwood and R. N. Curnow. 1990. An accurate approximation to the distribution of the length of the longest matching word between two random DNA sequences.Bull. math. Biol. 6, 773–784.

    Article  MATH  Google Scholar 

  • Needleman, S. B. and C. D. Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequences of two proteins.J. molec. Biol. 48, 444–453.

    Article  Google Scholar 

  • Oosterhoff, J. 1969.Combination of one-sided statistical tests. Mathematical Centre Tracts, No. 28, Mathematical Centre, Amsterdam, The Netherlands.

    Google Scholar 

  • Pearson, W. R. and D. J. Lipman. 1988. Improved tools for biological sequence comparison.Proc. natn. Acad. Sci. U.S.A. 85, 2444–2448.

    Article  Google Scholar 

  • Pearson, W. R. 1990. Rapid and sensitive sequence comparison with FASTP and FASTA. In:Methods in Enzymology, Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences. R. F. Doolittle (Ed.). Vol. 183, pp. 63–98. New York: Academic Press.

    Google Scholar 

  • Smith, T. F. and M. S. Waterman. 1981. Identification of common molecular subsequences.J. molec. Biol. 147, 195–197.

    Article  Google Scholar 

  • Smith, T. F., M. S. Waterman and C. Burks. 1985. The statistical distribution of nucleic acid similarities.Nucl. Acids Res 13, 645–656.

    Google Scholar 

  • Tavaré, S. and B. Giddings. 1989. Some statistical aspects of the primary structure of nucleotide sequences. InMathematical Methods for DNA Sequences, M. S. Waterman (Ed.), Florida, U.S.A., CRC Press.

    Google Scholar 

  • Taylor, W. R. 1986. The classification of amino acid conservation.J. theor. Biol. 119, 205–218.

    Article  Google Scholar 

  • Waterman, M. S. 1984. General methods of sequence comparions.Bull. math. Biol. 46, 473–500.

    Article  MATH  MathSciNet  Google Scholar 

  • Waterman, M. S., L. Gordon and R. Arratia. 1987. Phase transitions in sequence matches and nucleic structure.Proc. natn. Acad. Sci. 84, 1239–1243.

    Article  MathSciNet  Google Scholar 

  • Waterman, M. S. 1989.Mathematical Methods for DNA Sequences. M. S. Waterman (Ed.). Florida, U.S.A.: CRC Press.

    Google Scholar 

  • Waterman, M. and R. Jones. 1990. Consensus methods for DNA and protein sequence alignments. InMethods in Enzymology, Vol. 183, R. Doolittle (Ed.), New York: Academic Press.

    Google Scholar 

  • Wilbur, W. J. and D. Lipman. 1983. Rapid similarity searches of nucleic acid and protein databanks.Proc. natn. Acad. Sci. U.S.A. 80, 726–730.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Additional information

This work was supported by grants DMS 90-05833 from NSF and GM 36230 from NIH.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Goldstein, L., Waterman, M.S. Poisson, compound poisson and process approximations for testing statistical significance in sequence comparisons. Bltn Mathcal Biology 54, 785–812 (1992). https://doi.org/10.1007/BF02459930

Download citation

  • Received:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02459930

Keywords

Navigation