Abstract
A new measure of subalignment similarity is introduced. Specifically, similaritys(l,c) is defined as the logarithm to the basep of the probability of findingc or fewer mismatches in a subalignment of lengthl, wherep is the probability of a match. Previous algorithms can not use this measure to find locally optimal subalignments because, unlike Needleman-Wunsch and Sellers similarities, this measure is nonlinear. A new pattern recognition algorithm is described for finding all locally optimal subalignments of two nucleotide sequences. The DD algorithm can uses(l, c) or any other reasonable similarity function to assess the relative interest of subalignments. The DD algorithm searches only the diagonal graph, which lacks insertions and deletions. This search strategy greatly decreases the computation time and does not require an arbitrary choice of gap cost. The paths of the resulting DD graph usually draw attention to likely locations for insertions and deletions. A heuristic formula is derived for estimating significance levels fors(l, c) in the context of the lengths of the two aligned sequences. The DD algorithm has been used to find interesting subalignments between the nucleotide sequences for human and murine interleukin 2.
Similar content being viewed by others
Literature
Altschul, S. F. and B. W. Erickson. 1985. “Significance of Nucleotide Sequence Alignments: A Method for Random Sequence Permutation that Preserves Dinucleotide and Codon Usage.”Molec. Biol. Evol. 2, 526–538.
— and —. 1986a. “Optimal Sequence Alignment Using Affine Gap Costs.”Bull. math. Biol. 48, 603–616.
— and —. 1986b. “Locally Optimal Subalignments Using Nonlinear Similarity Functions.”Bull. math. Biol. 48, 633–660.
Arratia, R. and M. S. Waterman. 1985. “Critical Phenomena in Sequence Matching.”Ann. Prob. 13, 1236–1249.
—, L. Gordon and M. S. Waterman. 1986. “An Extreme Value Theory for Sequence Matching.”Ann. Stat. 14, 971–993.
Erickson, B. W. and P. H. Sellers. 1983. “Recognition of Patterns in Genetic Sequences.” InTime Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison, D. Sankoff and J. B. Kruskal (Eds), pp. 55–91. Reading, MA: Addison-Wesley.
—, L. T. May and P. B. Sehgal. 1984. “Internal Duplication in Human Alpha-1 and Beta-1 Interferon.”Proc. natn. Acad. Sci. U.S.A. 81, 7171–7175.
Fitch, W. M. 1983. “Calculating the Expected Frequencies of Potential Secondary Structure in Nucleic Acids as a Function of Stem Length, Loop Size, Base Composition and Nearest-neighbor Frequencies.”Nucl. Acids Res. 11, 4655–4663.
Fujita, T., C. Takaoka, H. Matsui and T. Taniguchi. 1983. “Structure of the Human Interleukin 2 Gene.”Proc. natn. Acad. Sci. U.S.A. 80, 7437–7441.
Gordon, L., M. F. Schilling and M. S. Waterman. 1986. “An Extreme Value Theory for Long Head Runs.”Prob. theor. Rel. 72, 279–287.
Gotoh, O. 1982. “An Improved Algorithm for Matching Biological Sequences.”J. molec. Biol. 162, 705–708.
Gumbel, E. J. 1962. “Statistical Theory of Extreme Values (Main Results)”. InContributions to Order Statistics, A. E. Sarhan and B. G. Greenberg (Eds), pp. 56–93. New York: Wiley.
Karlin, S. and G. Ghandour. 1985. “Comparative Statistics for DNA and Protein Sequences: Single Sequence Analysis.”Proc. natn. Acad. Sci. U.S.A. 82, 5800–5804.
Kruskal, J. B. and D. Sankoff. 1983. “An Anthology of Algorithms and Concepts for Sequence Comparison.” InTime Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, D. Sankoff and J. B. Kruskal (Eds), pp. 265–310. Reading, MA: Addison-Wesley.
Lewis, R. V. and B. W. Erickson. 1986. “Evolution of Proenkephalin and Prodynorphin.”Am. Zool., in press.
Lipman, D. J., W. J. Wilbur, T. F. Smith and M. S. Waterman. 1984. “On the Statistical Significance of Nucleic-acid Similarities.”Nucl. Acids Res. 12, 215–226.
Litman, G. W., L. Berger, K. Murphy, R. Litman, K. Hinds, C. L. Jahn and B. W. Erickson. 1983. “Complete Nucleotide Sequence of an Immunoglobulin VH Gene Homologue fromCaiman, a Phylogenetically Ancient Reptile.”Nature 303, 349–352.
————, F. Podlaski, K. Hinds, C. L. Jahn, G. Dingerkus and B. W. Erickson. 1984. “Phylogenetic Diversification of ImmunoglobulinV H Genes.”Dev. Comp. Immunol. 8, 499–514.
—, K. Murphy, L. Berger, R. Litman, K. Hinds and B. W. Erickson. 1985a. “Complete Nucleotide Sequence of Three VH Genes inCaiman, a Phylogenetically Ancient Reptile: Evolutionary Diversification in Coding Segments and Variation in the Structure and Organization of Recombination Elements.”Proc natn. Acad. Sci. U.S.A. 82, 844–848.
————— and — 1985b. “Immunoglobulin VH Gene Structure and Diversity inHeterodontus, a Phylogenetically Primitive Shark.”Proc. natn. Acad. Sci. U.S.A. 82, 2082–2086.
Needleman, S. B. and C. D. Wunsch. 1970. “A General Method Applicable to the Search for Similarities in the Amino Acid Sequences of Two Proteins.”J. molec. Biol. 48, 443–453.
Sellers, P. H. 1974. “On the Theory and Computation of Evolutionary Distances.”SIAM J. appl. Math. 26, 787–793.
—. 1980. “The Theory and Computation of Evolutionary Distances: Pattern Recognition.”J. Algorithms 1, 359–373.
— 1984. “Pattern Recognition in Genetic Sequences by Mismatch Density.”Bull. math. Biol. 46, 501–514.
Shaw, M. W., R. A. Lamb, D. J. Briedis, B. W. Erickson and P. W. Choppin. 1982. “Complete Nucleotide Sequence of the Neuraminidase Gene of Influenza B Virus.”Proc. natn. Acad. Sci. U.S.A. 79, 6817–6821.
Smith, T. F. and M. S. Waterman. 1981. “Comparison of Biosequences.”Adv. appl. Math. 2, 482–489.
—— and C. Burks. 1985. “The Statistical Distribution of Nucleic Acid Similarities.”Nucl. Acids Res. 13, 645–656.
—— and J. R. Sadler. 1983. “Statistical Characterization of Nucleic Acid Sequence Functional Domains.”Nucl. Acids Res. 11, 2205–2220.
Swartz, M. N., T. A. Trautner and A. Kornberg. 1962. “Enzymatic Synthesis of Deoxyribonucleic Acid. XI. Further Studies on Nearest Neighbor Base Sequences in Deoxyribonucleic Acids.”J. biol. Chem. 237, 1961–1967.
Taniguchi, T., H. Matsui, T. Fujita, C. Takaoka, N. Kashima, R. Yoshimoto and J. Hamuro. 1983. “Structure and Expression of a Cloned cDNA for Human Interleukin-2.”Nature 302, 305–310.
Waterman, M. S. 1984. “Efficient Sequence Alignment Algorithms.”J. theor. Biol. 108, 333–337.
—, T. F. Smith and W. A. Beyer. 1976. “Some Biological Sequence Metrics.”Adv. Math. 20, 367–387.
Yokota, T., N. Arai, F. Lee, D. Rennick, T. Mosmann and K. Arai. 1985. “Use of a cDNA Expression Vector for Isolation of Mouse Interleukin 2 cDNA Clones: Expression of T-cell Growth-factor Activity After Transfection of Monkey Cells.”Proc. natn. Acad. Sci U.S.A. 82, 68–72.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Altschul, S.F., Erickson, B.W. A nonlinear measure of subalignment similarity and its significance levels. Bltn Mathcal Biology 48, 617–632 (1986). https://doi.org/10.1007/BF02462327
Received:
Revised:
Issue Date:
DOI: https://doi.org/10.1007/BF02462327