Abstract
Multiple sequence comparison refers to the search for similarity in three or more sequences. This article presents a survey of the exhaustive (optimal) and heuristic (possibly sub-optimal) methods developed for the comparison of multiple macromolecular sequences. Emphasis is given to the different approaches of the heuristic methods. Four distance measures derived from information engineering and genetic studies are introduced for the comparison between two alignments of sequences. The use ofentropy, which plays a central role in information theory as measures of information, choice and uncertainty, is proposed as a simple measure for the evaluation of the optimality of an alignment in the absence of anya priori knowledge about the structures of the sequences being compared. This article also gives two examples of comparison between alternative alignments of the same set of 5SRNAs as obtained by several different heuristic methods.
Similar content being viewed by others
Literature
Altschul, S. F. and B. W. Erickson. 1986. Optimal sequence alignment using affine gap costs.Bull. math. Biol. 48, 603–616.
Altschul, S. F. 1989. Gap costs for multiple sequence alignment.J. theor. Biol. 138, 297–309.
Altschul, S. F. and D. J. Lipman. 1989. Trees, stars, and multiple biological sequence alignment.SIAM J. appl. Math. 49, 197–209.
Altschul, S. F., R. J. Carroll and D. J. Lipman. 1989. Weights for data related by a tree.J. mol. Biol. 207, 647–653.
Bacon, D. J. and W. F. Anderson. 1986. Multiple sequence alignment.J. molec. Biol. 191, 153–161.
Bains, W. 1986. MULTAN: A program to align multiple DNA sequences.Nucl. Acids Res. 14, 159–177.
Barton, G. J. and M. J. E. Sternberg. 1987. A strategy for the rapid multiple alignment of protein sequences: confidence levels from tertiary structure comparisons.J. molec. Biol. 198, 327–337.
Blanken, R. L., L. C. Klotz and A. G. Hinnebusch. 1982. Computer comparison of new and existing criteria for constructing evolutionary trees from sequence data.J. molec. Evol. 19, 9–19.
Bradley, D. W. and R. A. Bradley. 1983. Application of sequence comparison to the study of bird songs. In:Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, D. Sankoff and J. B. Kruskal (Eds). London: Addison-Wesley.
Carrillo, H. and D. Lipman. 1988. The multiple sequence alignment problem in biology.SIAM J. appl. Math. 48, 1073–1082.
Cavalli-Sforza, L. L. and W. F. Bodmer. 1971.The Genetics of Human Population. San Francisco: Freeman, pp. 704–706.
Chan, S. C. 1990. Random Graph and Sequence Synthesis. Ph.D. Thesis, Department of Systems Design Engineering, University of Waterloo, Canada.
Chan, S. C. and A. K. C. Wong. 1990. Synthesis and recognition of sequences.IEEE Trans. Pattern. Anal. Machine Intell., in press.
Chan, S. C., A. K. C. Wong and D. K. Y. Chiu. 1992. A multiple sequence comparison method.Bull. math. Biol., in press.
Chiu, D. K. Y. and A. K. C. Wong. 1986. Synthesizing knowledge: a cluster analysis approach using event covering.IEEE Trans. Syst. Man. Cyber. 16, 251–259.
Cohen, D. N., T. A. Reichert and A. K. C. Wong. 1975. Matching code sequences utilizing context free quality measure.Math. Biosci. 24, 25–30.
Corpet, F. 1988. Multiple sequence alignment with hierarchical clustering.Nucl. Acids Res. 16, 10881–10890.
Dantzig, G. B. 1963.Linear Programming and Extensions. Princeton: Princeton University Press.
Davison, D. 1985. Sequence similarity (‘homology’) searching for molecular biologists.Bull. math. Biol. 47, 437–474.
Delcoigne, A. and P. Hansen. 1975. Sequence comparison by dynamic programming.Biometrika 62, 661–664.
Delihas, N. and J. Anderson. 1982. Generalized structures of the 5S ribosomal RNAs.Nucl. Acids Res. 10, 7323–7344.
Doolittle, R. F. (Ed.) 1990. Molecular evolution: computer analysis of protein and nucleic acid sequences.Methods Enzymol. 183.
Dumas, J. P. and J. Ninio. 1982. Efficient algorithms for folding and comparing nucleic acid sequences.Nucl. Acids Res. 10, 197–206.
Dumey, A. I. 1956. Indexing for rapid random-access memory.Comput. Automat. 5, 6–8.
Dunn, G. and B. S. Everitt. 1982.An Introduction to Mathematical Taxonomy. Cambridge, U.K.: Cambridge University Press.
Edwards, A. W. F. and L. L. Cavalli-Sforza. 1964. Reconstruction of evolutionary trees. In:Phenetic and Phylogenetic Classification, V. H. Heywood and J. McNeill (Eds). London: Systematics Association.
Erickson, B. W. and P. H. Sellers. 1983. Recognition of patterns in genetic sequences. In:Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, D. Sankoff and J. B. Kruskal (Eds). London: Addison-Wesley.
Fager, E. W. 1972. Diversity: a sampling study.Am. Nat. 106, 293–310.
Feng, D. F., M. S. Johnson and R. F. Doolittle. 1985. Aligning amino acid sequences: comparison of commonly used methods.J. molec. Evol. 21, 112–125.
Feng, D. F. and R. F. Doolittle. 1987. Progressive sequence alignment as a prerequisite to correct phylogenetic trees.J molec. Evol. 25, 351–360.
Fickett, J. W. 1984. Fast optimal alignment.Nucl. Acids Res. 12, 175–180.
Fitch, W. M. 1971. Towards defining the course of evolution: minimum change for a specific tree topology.Syst. Zool. 20, 406–416.
Fitch, W. M. and T. Smith. 1983. Optimal sequence alignments.Proc. natn. Acad. Sci. U.S.A. 80, 1382–1386.
Fox, G. E. and C. R. Woese. 1975. 5S RNA secondary structure.Nature 256, 505–507.
Fredman, M. L. 1984. Algorithms for computing evolutionary similarity measures with length independent gap penalties.Bull. math. Biol. 46, 553–566.
Gatlin, L. L. 1972.Information Theory and the Living System. New York: Columbia University Press.
Gonzalez, R. C. and M. G. Thomason. 1978. Syntactic pattern recognition: an introduction. London: Addison-Wesley.
Gordon, A. D. 1973. A sequence-comparison statistic and algorithm.Biometrika 60, 197–200.
Gotoh, O. 1982. An improved algorithm for matching biological sequences,J. molec. Biol. 162, 705–708.
Gotoh, O. 1986. Alignment of three biological sequences with an efficient traceback procedure.J. theor. Biol. 121, 327–337.
Gribskov, M., R. Lüthy and D. Eisenberg. 1990. Profile analysis.Methods Enzymol. 183, 146–159.
Hartigan, J. A. 1973. Minimum mutation fits to a given tree.Biometrics 29, 53–65.
von Heijne, G. 1987.Sequence Analysis in Molecular Biology: Treasure Trove or Trivial Pursuit. London: Academic Press.
Hein, J. 1989a. A new method that simultaneously aligns and reconstructs ancestral sequences for any number of homologous sequences, when the phylogeny is given.Molec. Biol. Evol. 6, 649–668.
Hein, J. 1989b. A tree reconstruction method that is economical in the number of pairwise comparisons used.Molec. Biol. Evol. 6, 669–684.
Higgins, D. G. and P. M. Sharp. 1988. CLUSTAL: a package for performing multiple sequence alignment on a microcomputer.Gene 73, 237–244.
Hogeweg, P. and B. Hesper. 1984. The alignment of sets of sequences and the construction of phyletic trees: an integrated method.J. molec. Evol. 20, 175–186.
Hori, H. and S. Osawa. 1979. Evolutionary change in 5SRNA secondary structure and a phylogenic tree of 54 5SRNA species.Proc. natn. Acad. Sci. USA 76, 381–385.
Johnson, M. S. and R. F. Doolittle. 1986. A method for the simultaneous alignment of three or more amino acid sequences.J. molec. Evol. 23, 267–278.
Jue, R. A., N. W. Woodbury and R. F. Doolittle. 1980. Sequence homologies amongE. coli ribosomal proteins: evidence for evolutionary related groupings and internal duplications.J. molec. Evol. 15, 129–148.
Karlin, S., G. Ghandour, F. Ost, S. Tavare and L. J. Korn. 1983. New approaches for computer analysis of nucleic acid sequences.Proc. natn. Acad. Sci. U.S.A. 80, 5660–5664.
Klotz, L. C. and R. L. Blanken. 1981. A practical method for calculating evolutionary trees from sequence data.J. theor. Biol. 91, 216–272.
Klotz, L. C., N. Komar, R. L. Blanken and R. M. Mitchell. 1979. Calculation of evolutionary trees from sequence data.Proc. natn. Acad. Sci. U.S.A. 76, 4516–4520.
Konings, D. A. M., P. Hogeweg and B. Hesper. 1987. Evolution of the primary and secondary structures of the E1a mRNAs of the Adenovirus.Molec. Biol. Evol. 4, 300–314.
Krishnan, G., R. K. Kaul and P. Jagadeeswaran. 1986. DNA sequence analysis: a procedure to find homologies among many sequences.Nucl. Acids Res. 14, 543–550.
Lathrop, R. H., T. A. Webster and T. F. Smith. 1987. ARIADNE: pattern-directed inference and hierarchical abstraction in protein structure recognition.Comm. ACM 30, 909–921.
Lipman, D. J., S. F. Altschul and J. D. Kececioglu. 1989. A tool for multiple sequence alignment.Proc. natn. Acad. Sci. U.S.A. 86, 4412–4415.
Martinez, H. M. 1983. An efficient method for finding repeats in molecular sequences.Nucl. Acids Res. 11, 4629–4634.
Martinez, H. M. 1988. A flexible multiple sequence alignment program.Nucl. Acids Res. 16, 1683–1691.
Miclet, L. 1986.Structural Methods in Pattern Recognition. Oxford, U.K.: North Oxford Academic.
Miller, W. and E. W. Myers. 1988. Sequence comparison with concave weighting functions.Bull. math. Biol. 50, 97–120.
Murata, M., J. S. Richardson and J. L. Sussman. 1985. Simultaneous comparison of three protein sequences.Proc. natn. Acad. Sci. U.S.A. 82, 3073–3077.
Needleman, S. B. and C. D. Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequences of two proteins.J. molec. Biol. 48, 444–453.
Nei, M., F. Tajima and Y. Tateno. 1983. Accuracy of estimated phylogenetic trees from molecular data: II. gene frequency data.J. molec. Evol. 19, 153–170.
Patthy, L. 1987. Detecting homology of distantly related proteins with consensus sequences.J. molec. Biol. 198, 567–577.
Penny, D. 1976. Criteria for optimizing phylogenetic trees and the problem of determining the root of a tree.J. molec. Evol. 8, 95–116.
Reichert, T. A., D. N. Cohen and A. K. C. Wong. 1973. An application of information theory to genetic mutations and matching of polypeptide sequences.J. theor. Biol. 42, 245–261.
Rempe, U. 1987. Characterizing DNA variability by stochastic matrices. In:Classification and Related Methods of Data Analysis, H. H. Bock (Ed.). Amsterdam: Elsevier.
Rulot, H. and E. Vidal. 1987. Modelling (sub)string-length based constraints through a grammatical inference method. In:Pattern Recognition Theory and Applications, NATO ASI Series, Vol. F30, P. A. Devijver and J. Kittler (Eds). New York: Springer-Verlag.
Sankoff, D. 1972. Matching sequences under deletion-insertion constraints.Proc. natn. Acad. Sci. U.S.A. 68, 4–6.
Sankoff, D. and P. Sellers. 1973. Shortcuts, diversions and maximal chains in partially ordered sets.Discrete Math. 4, 287–293.
Sankoff, D., C. Morel and R. J. Cedergren. 1973. Evolution of 5S RNA and the nonrandomness of base replacement.Nature New Biol. 245, 232–234.
Sankoff, D. 1975. Minimum mutation trees of sequences.SIAM J. appl. Math. 78, 35–42.
Sankoff, D., R. J. Cedergren and G. Lapalme. 1976. Frequency of insertion-deletion, transversion, and transition in the evolution of 5S ribosomal RNA.J. molec. Evol. 7, 133–149.
Sankoff, D., R. J. Cedergren and W. Mckay. 1982. A strategy for sequence phylogeny research.Nucl. Acids Res. 10, 421–431.
Sankoff, D. and R. J. Cedergren. 1983. Simultaneous comparison of three or more sequences related by a tree. In:Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, D. Sankoff and J. B. Kruskal (Eds). London: Addison-Wesley.
Sankoff, D. and J. B. Kruskal (Eds). 1983.Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. London: Addison-Wesley.
Sankoff, D., Y. Abel, R. J. Cedergren and M. W. Gray. 1987. Supercomputing for molecular cladistics. In:Classification and Related Methods of Data Analysis, H. H. Bock (Ed.). Amsterdam: Elsevier.
Schneider, T. D., G. D. Stormo, L. Gold and A. Ehrenfeucht. 1986. Information content of binding sites on nucleotide sequences.J. molec. Biol. 188, 415–431.
Sellers, P. 1974a. An algorithm for the distance between two finite sequences.Comb. Theory 16, 253–258.
Sellers, P. 1974b. On the theory and computation of evolutionary distances.SIAM J. appl. Math. 26, 787–793.
Shannon, C. E. 1948. A mathematical theory of communication.Bell System Techn. J. 27, 379–432, 623–656.
Sneath, H. A. and R. R. Sokal. 1973.Numerical Taxonomy. San Francisco: W. H. Freeman.
Sobel, E. and H. M. Martinez. 1986. A multiple sequence alignment program.Nucl. Acids Res. 14, 363–374.
Subbiah, S. and S. C. Harrison. 1989. A method for multiple sequence alignment with gaps.J. molec. Biol. 209, 539–548.
Taylor, P. 1984. A fast homology program for aligning biological sequences.Nucl. Acids Res. 12, 447–455.
Taylor, W. R. 1986a. The classification of amino acid conservation.J. theor. Biol. 119, 205–218.
Taylor, W. R. 1986b. Identification of protein sequence homology by consensus template alignment.J. molec. Biol. 188, 233–258.
Taylor, W. R. 1987. Multiple sequence alignment by a pairwise algorithm.CABIOS 3, 81–87.
Taylor, W. R. 1988. A flexible method to align large numbers of biological sequences.J. molec. Evol. 28, 161–169.
Ukkonen, E. 1983. On approximate string matching. In:Proc. Int. Conf. Found. Comp. Theor.
Ukkonen, E. 1985. Algorithms for approximate string matching.Informat. Control 64, 100–118.
Waterman, M. S., T. F. Smith and W. A. Beyer. 1976. Some biological sequence metrices.Adv. Math. 20, 367–387.
Waterman, M. S. 1984a. General methods of sequence comparison.Bull. math. Biol. 46, 473–500.
Waterman, M. S. 1984b. Efficient sequence alignment algorithms.J. theor. Biol. 108, 333–337.
Waterman, M. S., R. Arratia and D. J. Galas. 1984. Pattern recognition in several sequences: consensus and alignment.Bull. math. Biol. 46, 515–527.
Waterman, M. S. and M. D. Perlwitz. 1984. Line geometries for sequence comparisons.Bull. math. Biol. 46, 567–577.
Waterman, M. S. 1986. Multiple sequence alignment by consensus.Nucl. Acids Res. 14, 9095–9102.
Waterman, M. S. 1988. Computer analysis of nucleic acid sequences.Methods Enzymol. 164, 765–793.
Waterman, M. S. 1989. Consensus patterns in sequences. In:Mathematical Methods for DNA Sequences. Florida, U.S.A.: CRC Press.
Waterman, M. S. and R. Jones. 1990. Consensus methods for DNA and protein sequence alignment.Methods Enzymol. 183, 221–237.
Webster, T. A., R. H. Lathrop and T. F. Smith. 1987. Prediction of a common structural domain in aminoacyl-tRNA synthetases through use of a new pattern-directed inference system.Biochemistry 26, 6950–6957.
Wilbur, W. J. and D. J. Lipman. 1983. Rapid similarity searches of nucleic acid and protein data banks.Proc. natn. Acad. Sci. USA 80, 726–730.
Wilbur, W. J. and D. J. Lipman. 1984. The context dependent comparison of biological sequences.SIAM J. appl. Math. 44, 557–567.
Wong, A. K. C., T. A. Reichert, D. N. Cohen and B. O. Aygun. 1974. A generalized method for matching informational macromolecular code sequences.Comput. Biol. Med. 4, 43–57.
Wong, A. K. C., T. S. Liu and C. C. Wang. 1976. Statistical analysis of residue variability in cytochromec.J. molec. Biol. 102, 287–295.
Wong, A. K. C. and D. E. Ghahraman. 1980. Random graphs: structural-contextual dichotomy.IEEE Trans. Pattern Anal. Machine Intell. 2, 341–348.
Wong, A. K. C. and M. You. 1985. Entropy and distance of random graphs with application to structural pattern recognition.IEEE Trans. Pattern Anal. Machine Intell. 7, 599–609.
Wong, A. K. C. 1987. Structural pattern recognition: a random graph approach. In:Pattern Recognition Theory and Applications, NATO ASI Series, Vol. F30, P. A. Devijver and J. Kittler (Eds). New York: Springer-Verlag.
Wong, A. K. C., J. Constant and M. You. 1990. Random Graphs. In:Syntactic and Structural Pattern Recognition—Fundamentals, Advances, and Applications, H. Bunke and A. Sanfeliu (Eds). World Scientific Publishing Company Pte. Ltd.
You, M. 1983. A random graph approach to pattern recognition. Ph.D. Thesis, Department of Systems Design Engineering, University of Waterloo, Waterloo.
You, M. and A. K. C. Wong. 1984.An algorithm for graph optimal isomorphism. Proc. 7th. Int. Conf. on Pattern Recog., pp. 316–319.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Chan, S.C., Wong, A.K.C. & Chiu, D.K.Y. A survey of multiple sequence comparison methods. Bltn Mathcal Biology 54, 563–598 (1992). https://doi.org/10.1007/BF02459635
Received:
Revised:
Issue Date:
DOI: https://doi.org/10.1007/BF02459635