Poisson, compound poisson and process approximations for testing statistical significance in sequence comparisons

Goldstein, Larry; Waterman, Michael S.

doi:10.1007/BF02459930

Poisson, compound poisson and process approximations for testing statistical significance in sequence comparisons

Published: September 1992

Volume 54, pages 785–812, (1992)
Cite this article

Bulletin of Mathematical Biology Aims and scope Submit manuscript

Larry Goldstein¹ &
Michael S. Waterman¹

83 Accesses
26 Citations
Explore all metrics

Abstract

DNA and protein sequence comparisons are performed by a number of computational algorithms. Most of these algorithms search for the alignment of two sequences that optimizes some alignment score. It is an important problem to assess the statistical significance of a given score. In this paper we use newly developed methods for Poisson approximation to derive estimates of the statistical significance ofk-word matches on a diagonal of a sequence comparison. We require at leastq of thek letters of the words to match where 0<q≤k. The distribution of the number of matches on a diagonal is approximated as well as the distribution of the order statistics of the sizes of clumps of matches on the diagonal. These methods provide an easily computed approximation of the distribution of the longest exact matching word between sequences. The methods are validated using comparisons of vertebrate andE. coli protein sequences. In addition, we compare two HLA class II transplantation antigens by this method and contrast the results with a dynamic programming approach. Several open problems are outlined in the last section.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations

Article Open access 01 April 2016

Particle swarm optimization algorithm: an overview

Article 17 January 2017

Classical Molecular Dynamics in a Nutshell

Literature

Altschul, S. F., W. Gish, W. Miller, E. W. Myers and D. J. Lipman. 1990. Basic local alignment search tool.J. molec. Biol. 214, 1–8.
Article Google Scholar
Aldous, D. J. 1989.Probability Approximations via the Poisson Clumping Heuristic. New York: Springer-Verlag.
Google Scholar
Arratia, R., L. Gordon and M. S. Waterman. 1986. An extreme value theory for sequence matching.Ann. Statist. 14, 971–993.
MATH MathSciNet Google Scholar
Arratia, R., P. Morris and M. S. Waterman. 1988. Stochastic scrabble: a law of large numbers for sequence matching with scores.J. appl. Prob. 25, 106–119.
Article MATH MathSciNet Google Scholar
Arratia, R. and L. Gordon. 1989. Tutorial on large deviations for the binomial distribution.Bull. math. Biol. 51, 125–131.
MATH MathSciNet Google Scholar
Arratia, R., L. Goldstein and L. Gordon. 1989. Two moments suffice for Poisson approximations: the Chen-Stein method.Ann. Prob. 17, 9–25.
MATH MathSciNet Google Scholar
Arratia, R. and M. S. Waterman. 1989. The Erdös-Rényi strong law for pattern matching with a given proportion of mismatches.Ann Prob. 3, 1152–1169.
MATH MathSciNet Google Scholar
Arratia, R., L. Goldstein and L. Gordon. 1990a. Poisson approximation and the Chen-Stein method.Stat. Sci. 5, 403–423.
MATH MathSciNet Google Scholar
Arratia, R., L. Gordon and M. Waterman. 1990b. The Erdös-Rényi Law in distribution, for coin tossing and sequence matching.Ann. Stat. 18, 539–570.
MATH MathSciNet Google Scholar
Billingsley, P. 1986.Probability and Measure. New York: John Wiley and Sons.
Google Scholar
Feller, W. 1968.An Introduction to Probability Theory and its Applications, Vol.I, 3rd Edn. New York: John Wiley and Sons.
Google Scholar
Goldstein, L. 1990. Poisson approximation and DNA sequence matching.Communs Stat. Theory Meth. 19, 4167–4179.
MATH Google Scholar
Haiman, G. 1987. Étude des extrêmes d'une suite stationnairem-dépendante avec une application relative aux accroissements du processus de Wiener.Ann. Inst. Henri Poincaré,23, 425–258.
MATH MathSciNet Google Scholar
Karlin, S., G. Ghandour, F. Ost, S. Tavare and L. J. Korn. 1983. New approaches for computer analysis of nucleic acid sequences.Proc. natn. Acad. Sci. U.S.A. 80, 5660–5664.
Article MATH Google Scholar
Karlin, S. and F. Ost. 1987. Counts of long aligned word matches among random letter sequences.Adv. appl. Prob. 19, 293–351.
Article MATH MathSciNet Google Scholar
Karlin, S. and S. F. Altschul. 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.Proc. natn. Acad. Sci. U.S.A. 87, 2264–2268.
Article MATH Google Scholar
Lipman, D. J. and W. R. Pearson. 1985. Rapid and sensitive protein similarity searches.Science 227, 1435–1441.
Google Scholar
Mott, R. F., T. B. L. Kirkwood and R. N. Curnow. 1989. A test for the statistical significance of DNA sequence similarities for application in databank searches.CABIOS 5, 123–131.
Google Scholar
Mott, R. F., T. B. L. Kirkwood and R. N. Curnow. 1990. An accurate approximation to the distribution of the length of the longest matching word between two random DNA sequences.Bull. math. Biol. 6, 773–784.
Article MATH Google Scholar
Needleman, S. B. and C. D. Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequences of two proteins.J. molec. Biol. 48, 444–453.
Article Google Scholar
Oosterhoff, J. 1969.Combination of one-sided statistical tests. Mathematical Centre Tracts, No. 28, Mathematical Centre, Amsterdam, The Netherlands.
Google Scholar
Pearson, W. R. and D. J. Lipman. 1988. Improved tools for biological sequence comparison.Proc. natn. Acad. Sci. U.S.A. 85, 2444–2448.
Article Google Scholar
Pearson, W. R. 1990. Rapid and sensitive sequence comparison with FASTP and FASTA. In:Methods in Enzymology, Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences. R. F. Doolittle (Ed.). Vol. 183, pp. 63–98. New York: Academic Press.
Google Scholar
Smith, T. F. and M. S. Waterman. 1981. Identification of common molecular subsequences.J. molec. Biol. 147, 195–197.
Article Google Scholar
Smith, T. F., M. S. Waterman and C. Burks. 1985. The statistical distribution of nucleic acid similarities.Nucl. Acids Res 13, 645–656.
Google Scholar
Tavaré, S. and B. Giddings. 1989. Some statistical aspects of the primary structure of nucleotide sequences. InMathematical Methods for DNA Sequences, M. S. Waterman (Ed.), Florida, U.S.A., CRC Press.
Google Scholar
Taylor, W. R. 1986. The classification of amino acid conservation.J. theor. Biol. 119, 205–218.
Article Google Scholar
Waterman, M. S. 1984. General methods of sequence comparions.Bull. math. Biol. 46, 473–500.
Article MATH MathSciNet Google Scholar
Waterman, M. S., L. Gordon and R. Arratia. 1987. Phase transitions in sequence matches and nucleic structure.Proc. natn. Acad. Sci. 84, 1239–1243.
Article MathSciNet Google Scholar
Waterman, M. S. 1989.Mathematical Methods for DNA Sequences. M. S. Waterman (Ed.). Florida, U.S.A.: CRC Press.
Google Scholar
Waterman, M. and R. Jones. 1990. Consensus methods for DNA and protein sequence alignments. InMethods in Enzymology, Vol. 183, R. Doolittle (Ed.), New York: Academic Press.
Google Scholar
Wilbur, W. J. and D. Lipman. 1983. Rapid similarity searches of nucleic acid and protein databanks.Proc. natn. Acad. Sci. U.S.A. 80, 726–730.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics, University of Southern California, 90089-1113, Los Angeles, CA, USA
Larry Goldstein & Michael S. Waterman

Authors

Larry Goldstein
View author publications
You can also search for this author in PubMed Google Scholar
Michael S. Waterman
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

This work was supported by grants DMS 90-05833 from NSF and GM 36230 from NIH.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Goldstein, L., Waterman, M.S. Poisson, compound poisson and process approximations for testing statistical significance in sequence comparisons. Bltn Mathcal Biology 54, 785–812 (1992). https://doi.org/10.1007/BF02459930

Download citation

Received: 03 March 1991
Issue Date: September 1992
DOI: https://doi.org/10.1007/BF02459930

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Poisson, compound poisson and process approximations for testing statistical significance in sequence comparisons

Abstract

Access this article

Similar content being viewed by others

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations

Particle swarm optimization algorithm: an overview

Classical Molecular Dynamics in a Nutshell

Literature

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Poisson, compound poisson and process approximations for testing statistical significance in sequence comparisons

Abstract

Access this article

Similar content being viewed by others

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations

Particle swarm optimization algorithm: an overview

Classical Molecular Dynamics in a Nutshell

Literature

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation