Abstract
A “semi-probabilistic” alignment algorithm which combines ideas from Smith-Waterman and probabilistic alignment is proposed and studied in detail. It is predicted that the score statistics of this “hybrid” algorithm is of the universal Gumbel form, with the key Gumbel parameter λ taking on a fixed asymptotic value for a wide variety of scoring parameters.We have also characterized the “extremal ensemble”, i.e., the collection of sequence pairs exhibiting similarities that a given scoring system is most sensitive to. Based on this extremal ensemble, a simple recipe for the computation of the “relative entropy”, and from it the correction to λ due to finite sequence length is also given. This allows us to assign p-values to the alignment results for arbitrary scoring parameters and gap costs. The predictions compare well with direct numerical simulations for a broad range of sequence lengths with various choices of the substitution scores and affine gap parameters.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J., 1990. Basic Local Alignment Search Tool. J. Mol. Biol. 215: 403–410.
Altschul, S.F., 1991. Substitution Matrices from an Information Theoretic Perspective. J. Mol. Biol. 119: 555–565.
Altschul, S.F., and Gish, W., 1996. Local Alignment Statistics. Methods in Enzymology 266: 460–480.
Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25: 3389–3402.
Altschul, S.F., Bundschuh, R., Hwa, T., and Olsen, R., 2001. The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Research 29: 351–361.
Arratia, R., Morris, P., and Waterman, M.S., 1988. Stochastic scrabbles: a law of large numbers for sequence matching with scores. J. Appl. Prob. 25: 106–119.
Bishop, M.J., and Thompson, E.A., 1986. Maximum likelihood alignment of DNA sequences. J. Mol. Biol. 190: 159–165.
Bundschuh, R., 2000. An Analytic Approach to Significance Assessment in Local Sequence Alignment with Gaps. RECOMB 2000.
Collins, J.F., Coulson, A.F.W., and Lyall, A., 1988. The significance of protein sequence similarities. CABIOS 4: 67–71.
Dayhoff, M.O., Schwartz, R.M., and Orcutt, B.C., 1978. A Model of Evolutionary Change in Proteins. In Atlas of Protein Sequence and Structure, Dayho. M.O. and Eck, R.V., eds., 5 supp. 3: 345–358, Natl. Biomed. Res. Found.
Drasdo, D., Hwa, T., and Lassig, M., 1998. A Scaling Theory of Sequence Alignment with Gaps. ISMB98: 52–58.
Gumbel, E.J., 1958. Statistics of Extremes. New York, NY: Columbia University Press.
Heniko., S., and Heniko., J.G., 1994. Position-based Sequence Weights. J. Mol. Biol. 162: 705–708.
Hughey, R., and Krogh, A., 1996. Hidden Markov Models for Sequence Analysis: Extension and Analysis of the Basic Method. CABIOS 12: 95–107.
Hwa, T., and Nattermann, T., 1995. Disorder-induced depinning transition. Phys. Rev. B 51: 455–469.
Hwa, T., and Lässig, M., 1996. Similarity Detection and Localization. Phys. Rev. Lett. 76:2591–2594.
Karlin, S., and Altschul, S.F. 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87: 2264–2268.
Karlin, S., and Dembo, A., 1992. Limit distributions of maximal segmental score among Markov-dependent partial sums. Adv. Appl. Prob. 24: 113–140.
Karlin, S., and Altschul, S.F., 1993. Applications and statistics for multiple highscoring segments in molecular sequences. Proc. Natl. Acad. Sci. USA 90: 5873–5877.
Mott, R., 1992. Maximum likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores. Bull. Math. Biol. 54: 59–75.
Needleman, S.B., and Wunsch, C.D., 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48: 443–453.
Olsen, R., Bundschuh, R., and Hwa, T., 1999. Rapid Assessment of Extremal Statistics for Gapped Local Alignment. Proceedings of The Seventh International Conference on Intelligent Systems for Molecular Biology (ISMB99). T. Lengauer et al. eds., 211–222 (AAAI Press, Menlo Park).
Pearson, W.R., 1988. Improved Tools for Biological Sequence Comparison. Proc. Natl. Acad. Sci. USA 85: 2444–2448.
Smith, T.F., and Waterman, M.S., 1981. Identification of Common Molecular Subsequences. J. Mol. Biol. 147: 195–197.
Smith, T.F., Waterman, M.S., and Burks, C., 1985. The statistical distribution of nucleic acid similarities. Nucleic Acids Research 13: 645–656.
Thorne, J.L., Kishino, H., and Felsenstein, J. 1991. An Evolutionary Model for Maximum Likelihood Alignment of DNA Sequences. J. Mol. Evol. 33: 114–124.
Thorne, J.L., Kishino, H., and Felsenstein, J., 1992. Inching toward Reality: An Improved Likelihood Model of Sequence Evolution. J. Mol. Evol. 34: 3–16.
Waterman, M.S., and Vingron, M., 1994a. Sequence Comparison Significance and Poisson Approximation. Stat. Sci. 9: 367–381.
Waterman, M.S., and Vingron, M., 1994b. Rapid and accurate estimates of statistical significance for sequence data base searches. Proc. Natl. Acad. Sci. U.S.A. 91: 4625–4628.
Yu, Y.-K., and Hwa, T., 1999 Statistical Significance of Probabilistic Sequence Alignment and Related Local Hidden Markov Models. Submitted to J. Comp. Biol..
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Yu, YK., Bundschuh, R., Hwa, T. (2002). Statistical significance and extremal ensemble of gapped local hybrid alignment. In: Lässig, M., Valleriani, A. (eds) Biological Evolution and Statistical Physics. Lecture Notes in Physics, vol 585. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45692-9_1
Download citation
DOI: https://doi.org/10.1007/3-540-45692-9_1
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43188-6
Online ISBN: 978-3-540-45692-6
eBook Packages: Springer Book Archive