Skip to main content
Log in

A protein alignment scoring system sensitive at all evolutionary distances

  • Published:
Journal of Molecular Evolution Aims and scope Submit manuscript

Summary

Protein sequence alignments generally are constructed with the aid of a “substitution matrix” that specifies a score for aligning each pair of amino acids. Assuming a simple random protein model, it can be shown that any such matrix, when used for evaluating variable-length local alignments, is implicitly a “log-odds” matrix, with a specific probability distribution for amino acid pairs to which it is uniquely tailored. Given a model of protein evolution from which such distributions may be derived, a substitution matrix adapted to detecting relationships at any chosen evolutionary distance can be constructed. Because in a database search it generally is not known a priori what evolutionary distances will characterize the similarities found, it is necessary to employ an appropriate range of matrices in order not to overlook potential homologies. This paper formalizes this concept by defining a scoring system that is sensitive at all detectable evolutionary distances. The statistical behavior of this scoring system is analyzed, and it is shown that for a typical protein database search, estimating the originally unknown evolutionary distance appropriate to each alignment costs slightly over two bits of information, or somewhat less than a factor of five in statistical significance. A much greater cost may be incurred, however, if only a single substitution matrix, corresponding to the wrong evolutionary distance, is employed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Altschul SF (1991) Amino acid substitution matrices from an information theoretic perspective. J Mol Biol 219:555–565

    Google Scholar 

  • Altschul SF, Erickson BW (1986) A nonlinear measure of sub-alignment similarity and its significance levels. Bull Math Biol 48:617–632

    Google Scholar 

  • Altschul SF, Erickson BW (1988) Significance levels for biological sequence comparison using non-linear similarity functions. Bull Math Biol 50:77–92

    Google Scholar 

  • Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410

    Google Scholar 

  • Argos P (1987) A sensitive procedure to compare amino acid sequences. J Mol Biol 193:385–396

    Google Scholar 

  • Arratia R, Gordon L, Waterman MS (1986) An extreme value theory for sequence matching. Ann Star 14:971–993

    Google Scholar 

  • Arratia R, Morris P, Waterman MS (1988) Stochastic scrabble: large deviations for sequences with scores. J Appl Prob 25: 106–119

    Google Scholar 

  • Arratia R, Waterman MS (1989) The Erdos-Renyi strong law for pattern matching with a given proportion of mismatches. Ann Prob 17:1152–1169

    Google Scholar 

  • Barker WC, George DG, Hunt LT (1990) Protein sequence database. Methods Enzymol 183:31–49

    Google Scholar 

  • Chow ET, Hunkapiller T, Peterson JC, Zimmerman BA, Waterman MS (1991) A systolic array processor for biological information signal processing. In: Proceedings of the 1991 international conference on supercomputing. ACM Press, New York, pp 216–223

    Google Scholar 

  • Collins JF, Coulson AFW, Lyall A (1988) The significance of protein sequence similarities. Comput Appl Biosci 4:67–71

    Google Scholar 

  • Coulson AFW, Collins JF, Lyall A (1987) Protein and nucleic acid database searching: a suitable case for parallel processing. Computer J 30:420–424

    Google Scholar 

  • Cover TM, Thomas JA (1991) Elements of information theory. Wiley, New York, pp 326–329

    Google Scholar 

  • Dembo A, Karlin S (1991) Strong limit theorems of empirical functionals for large exceedances of partial sums of i.i.d. variables. Ann Prob 19:1737–1755

    Google Scholar 

  • Dayhoff MO, Schwartz RM, Orcutt BC (1978) A model of evolutionary change in proteins. In: Dayhoff MO (ed) Atlas of protein sequence and structure, vol 5, suppl 3. Natl Biomed Res Found, Washington, pp 345–352

    Google Scholar 

  • Feng DF, Johnson MS, Doolittle RF (1985) Aligning amino acid sequences: comparison of commonly used methods. J Mol Evol 21:112–125

    Google Scholar 

  • Fisher RA (1925) Theory of statistical estimation. Proc Cambridge Phil Soc 22:700–725

    Google Scholar 

  • Goad WB, Kanehisa MI (1982) Pattern recognition in nucleic acid sequences. I. A general method for finding local homologies and symmetries. Nucl Acids Res 10:247–263

    Google Scholar 

  • Gonnet GH (1993) A tutorial introduction to computational biochemistry using Darwin. Manuscript in preparation

  • Gonnet GH, Cohen MA, Benner SA (1992) Exhaustive matching of the entire protein sequence database. Science 256:1443–1445

    Google Scholar 

  • Gumbel EJ (1958) Statistics of extremes. Columbia University Press, New York

    Google Scholar 

  • Gribskov M, McLachlan AD, Eisenberg D (1987) Profile analysis: detection of distantly related proteins. Proc Natl Acad Sci USA 84:4355–4358

    Google Scholar 

  • Hamming RW (1986) Coding and information theory. Prentice-Hall, Englewood Cliffs, p 106

    Google Scholar 

  • Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 89:10915–10919

    Google Scholar 

  • Hughey RP (1991) Programmable systolic arrays. PhD Thesis, Brown University, Providence

    Google Scholar 

  • Hyldig-Nielsen JJ, Jensen EO, Paludan K, Wiborg O, Garrett R, Jorgensen P, Marcker KA (1982) The primary structures of two leghemoglobin genes from soybean. Nucl Acids Res 10: 689–701

    Google Scholar 

  • Jones DT, Taylor WR, Thornton JM (1992) The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci 8:275–282

    Google Scholar 

  • Karlin S, Altschul SF (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci USA 87:2264–2268

    Google Scholar 

  • Karlin S, Bucher P, Brendel V, Altschul SF (1991) Statistical methods and insights for protein and DNA sequences. Annu Rev Biophys Biophys Chem 20:175–203

    Google Scholar 

  • Karlin S, Dembo A, Kawabata T (1990) Statistical composition of high-scoring segments from molecular sequences. Ann Stat 18:571–581

    Google Scholar 

  • Karlin S, Ost F (1988) Maximum length of common words among random letter sequences. Ann Prob 16:535–563

    Google Scholar 

  • Lipman DJ, Pearson WR (1985) Rapid and sensitive protein similarity searches. Science 227:1435–1441

    Google Scholar 

  • Mauri F, Omnaas J, DavidsonL, Whitfill C, Kitto GB (1991) Amino acid sequence of a globin from the sea cucumber Caudina (Molpadia) arenicola. Biochim Biophys Acta 1078:63–67

    Google Scholar 

  • McLachlan AD (1971) Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome c551. J Mol Biol 61:409–424

    Google Scholar 

  • Mott R (1992) Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores. Bull Math Biol 54:59–75

    Google Scholar 

  • Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequences of two proteins. J Mol Biol 48:443–453

    Google Scholar 

  • Patthy L (1987) Detecting homology of distantly related proteins with consensus sequences. J Mol Biol 198:567–577

    Google Scholar 

  • Pearson WR, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 85:2444–2448

    Google Scholar 

  • Rao JKM (1987) New scoring matrix for amino acid residue exchanges based on residue characteristic physical parameters. Int J Peptide Protein Res 29:276–281

    Google Scholar 

  • Risler JL, Delorme MO, Delacroix H, Henaut A (1988) Amino acid substitutions in structurally related proteins. A pattern recognition approach. Determination of a new and efficient scoring matrix. J Mol Biol 204:1019–1029

    Google Scholar 

  • Sankoff D, Kruskal JB (1983) Time warps, string edits and macromolecules: the theory and practice of sequence comparison. Addison-Wesley, Reading

    Google Scholar 

  • Schwartz RM, Dayhoff MO (1978) Matrices for detecting distant relationships. In: Dayhoff MO (ed) Atlas of protein sequence and structure, vol 5, suppl 3. Natl Biomed Res Found, Washington, pp 353–358

    Google Scholar 

  • Sellers PH (1974) On the theory and computation of evolutionary distances. SIAM J Appl Math 26:787–793

    Google Scholar 

  • Sellers PH (1984) Pattern recognition in genetic sequences by mismatch density. Bull Math Biol 46:501–514

    Google Scholar 

  • Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197

    Google Scholar 

  • Smith TF, Waterman MS, Burks C (1985) The statistical distribution of nucleic acid similarities. Nucl Acids Res 13:645–656

    Google Scholar 

  • States DJ, Gish W, Altschul SF (1991) Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. Methods 3:66–70

    Google Scholar 

  • Stougaard J, Petersen TE, Marcker KA (1987) Expression of a complete soybean leghemoglobin gene in root nodules of transgenic Lotus corniculatus. Proc Natl Acad Sci USA 84: 5754–5757

    Google Scholar 

  • Taylor WR (1986) Identification of protein sequence homology by consensus template alignment. J Mol Biol 188:233–258

    Google Scholar 

  • Vogt G, Argos P (1992) Searching for distantly related protein sequences in large databases by parallel processing on a transputer machine. Comput Appl Biosci 8:49–55

    Google Scholar 

  • Wakabayashi S, Matsubara H, Webster DA (1986) Primary sequence of a dimeric bacterial haemoglobin from Vitreoscilla. Nature 322:481–483

    Google Scholar 

  • Waterman MS, Gordon L (1990) Multiple hypothesis testing for sequence comparisons. In: Bell GI, Marr TG (eds) Computers and DNA. Addison-Wesley, Reading, pp 127–135

    Google Scholar 

  • Waterman MS, Gordon L, Arratia R (1987) Phase transitions in sequence matches and nucleic acid structure. Proc Natl Acad Sci USA 84:1239–1243

    Google Scholar 

  • White C, Singh RK, Reintjes PB, Lampe J, Erickson BW, Dettloff WD, Chi VL, Altschul SF (1991) BioSCAN: A VLSI-based system for biosequence analysis. In: Proceedings of the 1991 IEEE international conference on computer design: VLSI in computers and processors. IEEE Comp Soc Press, Los Alamitos, pp 504–509

    Google Scholar 

  • Wilbur WJ (1985) On the PAM matrix model of protein evolution. Mol Biol Evol 2:434–447

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Altschul, S.F. A protein alignment scoring system sensitive at all evolutionary distances. J Mol Evol 36, 290–300 (1993). https://doi.org/10.1007/BF00160485

Download citation

  • Received:

  • Accepted:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF00160485

Key words

Navigation