Abstract
A class of non-linear similarity functionss 1 has been proposed for comparing subalignments of biological sequences. The distribution of maximals 1-similarities is well approximated by the extreme value distribution. The significance levels ofs 1 are studied for a variety of nucleotide frequency distributions as well as for several matrices of amino acid substitution costs. Also, the significance levels ofs 1 are explored for comparing three biological sequences. Several previously described subalignments of bovine proenkephalin and porcine prodynorphin are shown to be highly significant.
Similar content being viewed by others
Literature
Altschul, S. F. 1987. “Aspects of Biological Sequence Comparison.” Ph.D. thesis, Massachusetts Institute of Technology.
— and B. W. Erickson. 1985. “Significance of Nucleotide Sequence Alignments: A Method for Random Sequence Permutation That Preserves Dinucleotide and Codon Usage.”Mol. Biol. Evol. 2, 526–538.
— and —. 1986a. “A Non-linear Measure of Subalignment Similarity and its Significance Levels.”Bull. math. Biol. 48, 617–632.
— and —. 1986b. “Locally Optimal Subalignments Using Non-linear Similarity Functions.”Bull. math. Biol. 48, 633–660.
Arratia, R., L. Gordon and M. S. Waterman. 1986. “An Extreme Value Theory for Sequence Matching.”Ann. Stat. 14, 971–993.
— and M. S. Waterman. 1985. “Critical Phenomena in Sequence Matching.”Ann. Prob. 13, 1236–1249.
Dayhoff, M. O., R. M. Schwartz and B. C. Orcutt. 1978. “A Model of Evolutionary Change in Proteins.” InAtlas of Protein Sequence and Structure, Vol. 5, (Suppl. 3), M. O. Dayhoff (Ed.), pp. 345–352. Washington: National Biomedical Research Foundation.
Erickson, B. W. and P. H. Sellers. 1983. “Recognition of Patterns in Genetic Sequences.” InTime Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, D. Sankoff and J. B. Kruskal (Eds), pp. 55–91. Reading, MA: Addison-Wesley.
Fitch, W. M. 1983a. “Calculating the Expected Frequencies of Potential Secondary Structure in Nucleic Acids as a Function of Stem Length, Loop Size, Base Composition and Nearest-Neighbor Frequencies.”Nucl. Acids Res. 11, 4655–4663.
—. 1983b. “Random Sequences.”J. mol. Biol. 163, 171–176.
Goad, W. B. and M. I. Kanehisa. 1982. “Pattern Recognition in Nucleic Acid Sequences. I. A. General Method for Finding Local Homologies and Symmetries.”Nucl. Acids Res. 10, 247–263.
Gordon, L., M. F. Schilling and M. S. Waterman. 1986. “An Extreme Value Theory for Long Head Runs.”Prob. Th. Rel. 72, 279–287.
Gumbel, E. J. 1962. “Statistical Theory of Extreme Values (Main Results).” InContributions to Order Statistics, A. E. Sarhan and B. G. Greenberg (Eds), pp. 56–93. New York: Wiley.
Kakidani, H., Y. Furutani, H. Takahashi, M. Noda, Y. Morimoto, T. Hirose, M. Asai, S. Inayama, S. Nakanishi and S. Numa. 1982. “Cloning and Sequence Analysis of cDNA for Porcine β-Neo-endorphin/Dynorphin Precursor.”Nature 298, 577–579.
Kruskal, J. B. 1983. “An Overview of Sequence Comparison.” InTime Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison, D. Sankoff and J. B. Kruskal (Eds), pp. 1–44. Reading, MA: Addison-Wesley.
Larsen, R. J. and M. L. Marx. 1981.An Introduction to Mathematical Statistics and its Applications. Englewood Cliffs, NJ: Prentice-Hall.
Lawrence, C. B., D. A. Goldman and R. T. Hood. 1986. “Optimized Homology Searches of the Gene and Protein Sequence Data Banks.”Bull. math. Biol. 48, 569–583.
Lewis, R. V. and B. W. Erickson. 1986. “Evolution of Proenkephalin and Prodynorphin.”Am. Zool. 26, 1027–1032.
Lipman, D. J., W. J. Wilbur, T. F. Smith and M. S. Waterman. 1984. “On the Statistical Significance of Nucleic-Acid Similarities.”Nucl. Acids Res. 12, 215–226.
Noda, M., Y. Furutani, H. Takahashi, M. Toyosata, T. Hirose, S. Inayama, S. Nakanishi and S. Numa. 1982. “Cloning and Sequence Analysis of cDNA for Bovine Adrenal Preproenkephalin.”Nature 295, 202–206.
Schwartz, R. M. and M. O. Dayhoff. 1978. “Matrices for Detecting Distant Relationships.” InAtlas of Protein Sequence and Structure, Vol. 5, Suppl. 3, M. O. Dayhoff (Ed.), pp. 353–358. Washington: National Biomedical Research Foundation.
Sellers, P. H. 1984. “Pattern Recognition in Genetic Sequences by Mismatch Density.”Bull. math. Biol. 46, 501–514.
Smith, T. F., M. S. Waterman and C. Burks. 1985. “The Statistical Distribution of Nucleic Acid Similarities.”Nucl. Acids Res. 13, 645–656.
—— and J. R. Sadler. 1983. “Statistical Characterization of Nucleic Acid Sequence Functional Domains.”Nucl. Acids Res. 11, 2205–2220.
Swartz, M. N., T. A. Trautner and A. Kornberg. 1962. “Enzymatic Synthesis of Deoxyribonucleic Acid—XI. Further Studies on Nearest Neighbor Base Sequences in Deoxyribonucleic Acids.”J. biol. Chem. 237, 1961–1967.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Altschul, S.F., Erickson, B.W. Significance levels for biological sequence comparison using non-linear similarity functions. Bltn Mathcal Biology 50, 77–92 (1988). https://doi.org/10.1007/BF02459979
Received:
Revised:
Issue Date:
DOI: https://doi.org/10.1007/BF02459979