Skip to main content
Log in

Toward an accurate statistics of gapped alignments

  • Published:
Bulletin of Mathematical Biology Aims and scope Submit manuscript

Abstract

Sequence alignment has been an invaluable tool for finding homologous sequences. The significance of the homology found is often quantified statistically by p-values. Theory for computing p-values exists for gapless alignments [Karlin, S., Altschul, S.F., 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87, 2264–2268; Karlin, S., Dembo A., 1992. Limit distributions of maximal segmental score among Markov-dependent partial sums. Adv. Appl. Probab. 24, 13–140], but a full generalization to alignments with gaps is not yet complete. We present a unified statistical analysis of two common sequence comparison algorithms: maximum-score (Smith-Waterman) alignments and their generalized probabilistic counterparts, including maximum-likelihood alignments and hidden Markov models. The most important statistical characteristic of these algorithms is the distribution function of the maximum score S max, resp. the maximum free energy F max, for mutually uncorrelated random sequences. This distribution is known empirically to be of the Gumbel form with an exponential tail P(S max > x) ∼ exp(−λx) for maximum-score alignment and P(F max > x) ∼ exp(−λx) for some classes of probabilistic alignment. We derive an exact expression for λ for particular probabilistic alignments. This result is then used to obtain accurate λ values for generic probabilistic and maximum-score alignments. Although the result demonstrated uses a simple match-mismatch scoring system, it is expected to be a good starting point for more general scoring functions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J., 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403–410.

    Article  Google Scholar 

  • Arratia, R., Waterman, M.S., 1994. A phase transition for the score in matching random sequences allowing deletions. Ann. Appl. Probab. 4, 200–225.

    MathSciNet  MATH  Google Scholar 

  • Bundschuh, R., 2002. Asymmetric exclusion process and extremal statistics of random sequences. Phys. Rev. E 65, 031911.

    Google Scholar 

  • Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C., 1978. A model of evolutionary change in proteins. In: Dayhoff, M.O., Eck, R.V. (Eds.), Atlas of Protein Sequence and Structure 5 supp. Natl. Biomed. Res. Found, vol. 3. pp. 345–358.

  • Drasdo, D., Hwa, T., Lässig, M. 1998. A scaling theory of sequence alignment with gaps. In: Proceedings of the Sixth International Conference on Intelligent Systems for Molecular Biology. ISMB98, pp. 52–58.

  • Drasdo, D., Hwa, T., Lässig, M., 2000. Scaling laws and similarity detection in sequence alignment with gaps. J. Comput. Biol. 7, 115–141.

    Article  Google Scholar 

  • Durbin, R., Eddy, S., Krogh, A., Mitchinson, G., 1998. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, U.K.

    MATH  Google Scholar 

  • Eddy, S.R., 1998. Profile hidden Markov models. Bioinformatics 14, 755–763.

    Article  Google Scholar 

  • Friedberg, R., Yu, Y.-K., 1994. Directed waves in random media: an analytical calculation. Phys. Rev. E. 49, 5755–5762.

    Article  Google Scholar 

  • Gumbel, E.J., 1958. Statistics of Extremes. Columbia University Press, New York, NY.

    MATH  Google Scholar 

  • Halpin-Healy, T., Zhang, Y.C., 1995. Kinetic roughening phenomena, stochastic growth, directed polymers and all that. Phys. Rep. 254, 215–414.

    Article  Google Scholar 

  • Hwa, T., Lässig, M., 1998. Optimal detection of sequence similarity by local alignment. In: RECOMB98. pp. 109–116.

  • Karlin, S., Altschul, S.F., 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87, 2264–2268.

    Article  MATH  Google Scholar 

  • Karlin, S., Dembo, A., 1992. Limit distributions of maximal segmental score among Markov-dependent partial sums. Adv. Appl. Probab. 24, 13–140.

    Article  MathSciNet  Google Scholar 

  • Karplus, K., Barrett, C., Hughey, R., 1998. Hidden Markov models for detecting remote protein homologies. Bioinformatics 14, 846–856.

    Article  Google Scholar 

  • Kschischo, M., Lässig, M., 2000. Finite-temperature sequence alignment. Pac. Symp. Biocomput. 5, 621–632.

    Google Scholar 

  • Metzler, D., 2002. A Poisson model for gapped local alignments. Stat. Prob. Lett. 60, 91–100.

    Article  MATH  MathSciNet  Google Scholar 

  • Miyazawa, S., 1996. A reliable sequence alignment method based on probabilities of residue correspondences. Protein Eng. 8, 999–1009.

    Article  Google Scholar 

  • Mott, R., Tribe, R., 1999. Approximate statistics of gapped alignment. J. Comput. Biol. 6, 91–112.

    Article  Google Scholar 

  • Olsen, R., Bundschuh, R., Hwa, T., 1999. Rapid assessment of extremal statistics for gapped local alignment. In: Lengauer, T. et al. (Eds.), Proceedings of The Seventh International Conference on Intelligent Systems for Molecular Biology. ISMB99, AAAI Press, Menlo Park, pp. 211–222.

    Google Scholar 

  • Pearson, W.R., 1988. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85, 2444–2448.

    Article  Google Scholar 

  • Siegmund, D., Yakir, B., 2000. Approximate p-value for local sequence alignments. Ann. Statist. 28, 657–680.

    MathSciNet  MATH  Google Scholar 

  • Smith, T.F., Waterman, M.S., 1981. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197.

    Article  Google Scholar 

  • Spang, R., Vingron, M., 2000. Limits of Homology detection by pairwise sequence comparison. Bioinformatics 17, 338–342.

    Article  Google Scholar 

  • Yu, Y.-K., 1999. Calculation of wave center deflection and multifractal analysis of directed waves through the study of su(1,1) ferromagnets. In: Batchelor, M.T., Wille, L.T. (Eds.), Statistical Physics on the Eve of the 21st Century. World Scientific, NJ.

    Google Scholar 

  • Yu, Y.-K., 2004. Replica model for an unusual directed polymer in 1+1 dimensions and prediction of the extremal parameter of gapped sequence alignment statistics. Phys. Rev. E. 69, 061904.

    Google Scholar 

  • Yu, Y.-K., Bundschuh, R., Hwa, T., 2002. Hybrid alignment: high performance with universal statistics. Bioinformatics 18, 864–872.

    Article  Google Scholar 

  • Yu, Y.-K., Hwa, T., 2001. Statistical significance of probabilistic sequence alignment and related local hidden Markov models. J. Comput. Biol. 8, 249–282.

    Article  Google Scholar 

  • Zhang, M.Q., Marr, T.G., 1995. Alignment of molecular sequences seen as random path analysis. J. Theor. Biol. 174, 119–129.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yi-Kuo Yuc.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kschischo, M., Lässig, M. & Yuc, YK. Toward an accurate statistics of gapped alignments. Bull. Math. Biol. 67, 169–191 (2005). https://doi.org/10.1016/j.bulm.2004.07.001

Download citation

  • Received:

  • Accepted:

  • Issue Date:

  • DOI: https://doi.org/10.1016/j.bulm.2004.07.001

Keywords

Navigation