Abstract
Sequence alignment has been an invaluable tool for finding homologous sequences. The significance of the homology found is often quantified statistically by p-values. Theory for computing p-values exists for gapless alignments [Karlin, S., Altschul, S.F., 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87, 2264–2268; Karlin, S., Dembo A., 1992. Limit distributions of maximal segmental score among Markov-dependent partial sums. Adv. Appl. Probab. 24, 13–140], but a full generalization to alignments with gaps is not yet complete. We present a unified statistical analysis of two common sequence comparison algorithms: maximum-score (Smith-Waterman) alignments and their generalized probabilistic counterparts, including maximum-likelihood alignments and hidden Markov models. The most important statistical characteristic of these algorithms is the distribution function of the maximum score S max, resp. the maximum free energy F max, for mutually uncorrelated random sequences. This distribution is known empirically to be of the Gumbel form with an exponential tail P(S max > x) ∼ exp(−λx) for maximum-score alignment and P(F max > x) ∼ exp(−λx) for some classes of probabilistic alignment. We derive an exact expression for λ for particular probabilistic alignments. This result is then used to obtain accurate λ values for generic probabilistic and maximum-score alignments. Although the result demonstrated uses a simple match-mismatch scoring system, it is expected to be a good starting point for more general scoring functions.
Similar content being viewed by others
References
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J., 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403–410.
Arratia, R., Waterman, M.S., 1994. A phase transition for the score in matching random sequences allowing deletions. Ann. Appl. Probab. 4, 200–225.
Bundschuh, R., 2002. Asymmetric exclusion process and extremal statistics of random sequences. Phys. Rev. E 65, 031911.
Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C., 1978. A model of evolutionary change in proteins. In: Dayhoff, M.O., Eck, R.V. (Eds.), Atlas of Protein Sequence and Structure 5 supp. Natl. Biomed. Res. Found, vol. 3. pp. 345–358.
Drasdo, D., Hwa, T., Lässig, M. 1998. A scaling theory of sequence alignment with gaps. In: Proceedings of the Sixth International Conference on Intelligent Systems for Molecular Biology. ISMB98, pp. 52–58.
Drasdo, D., Hwa, T., Lässig, M., 2000. Scaling laws and similarity detection in sequence alignment with gaps. J. Comput. Biol. 7, 115–141.
Durbin, R., Eddy, S., Krogh, A., Mitchinson, G., 1998. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, U.K.
Eddy, S.R., 1998. Profile hidden Markov models. Bioinformatics 14, 755–763.
Friedberg, R., Yu, Y.-K., 1994. Directed waves in random media: an analytical calculation. Phys. Rev. E. 49, 5755–5762.
Gumbel, E.J., 1958. Statistics of Extremes. Columbia University Press, New York, NY.
Halpin-Healy, T., Zhang, Y.C., 1995. Kinetic roughening phenomena, stochastic growth, directed polymers and all that. Phys. Rep. 254, 215–414.
Hwa, T., Lässig, M., 1998. Optimal detection of sequence similarity by local alignment. In: RECOMB98. pp. 109–116.
Karlin, S., Altschul, S.F., 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87, 2264–2268.
Karlin, S., Dembo, A., 1992. Limit distributions of maximal segmental score among Markov-dependent partial sums. Adv. Appl. Probab. 24, 13–140.
Karplus, K., Barrett, C., Hughey, R., 1998. Hidden Markov models for detecting remote protein homologies. Bioinformatics 14, 846–856.
Kschischo, M., Lässig, M., 2000. Finite-temperature sequence alignment. Pac. Symp. Biocomput. 5, 621–632.
Metzler, D., 2002. A Poisson model for gapped local alignments. Stat. Prob. Lett. 60, 91–100.
Miyazawa, S., 1996. A reliable sequence alignment method based on probabilities of residue correspondences. Protein Eng. 8, 999–1009.
Mott, R., Tribe, R., 1999. Approximate statistics of gapped alignment. J. Comput. Biol. 6, 91–112.
Olsen, R., Bundschuh, R., Hwa, T., 1999. Rapid assessment of extremal statistics for gapped local alignment. In: Lengauer, T. et al. (Eds.), Proceedings of The Seventh International Conference on Intelligent Systems for Molecular Biology. ISMB99, AAAI Press, Menlo Park, pp. 211–222.
Pearson, W.R., 1988. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85, 2444–2448.
Siegmund, D., Yakir, B., 2000. Approximate p-value for local sequence alignments. Ann. Statist. 28, 657–680.
Smith, T.F., Waterman, M.S., 1981. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197.
Spang, R., Vingron, M., 2000. Limits of Homology detection by pairwise sequence comparison. Bioinformatics 17, 338–342.
Yu, Y.-K., 1999. Calculation of wave center deflection and multifractal analysis of directed waves through the study of su(1,1) ferromagnets. In: Batchelor, M.T., Wille, L.T. (Eds.), Statistical Physics on the Eve of the 21st Century. World Scientific, NJ.
Yu, Y.-K., 2004. Replica model for an unusual directed polymer in 1+1 dimensions and prediction of the extremal parameter of gapped sequence alignment statistics. Phys. Rev. E. 69, 061904.
Yu, Y.-K., Bundschuh, R., Hwa, T., 2002. Hybrid alignment: high performance with universal statistics. Bioinformatics 18, 864–872.
Yu, Y.-K., Hwa, T., 2001. Statistical significance of probabilistic sequence alignment and related local hidden Markov models. J. Comput. Biol. 8, 249–282.
Zhang, M.Q., Marr, T.G., 1995. Alignment of molecular sequences seen as random path analysis. J. Theor. Biol. 174, 119–129.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kschischo, M., Lässig, M. & Yuc, YK. Toward an accurate statistics of gapped alignments. Bull. Math. Biol. 67, 169–191 (2005). https://doi.org/10.1016/j.bulm.2004.07.001
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1016/j.bulm.2004.07.001