Journal of Molecular Evolution

, Volume 35, Issue 1, pp 77–89 | Cite as

Finite-state models in the alignment of macromolecules

  • L. Allison
  • C. S. Wallace
  • C. N. Yee


Minimum message length encoding is a technique of inductive inference with theoretical and practical advantages. It allows the posterior odds-ratio of two theories or hypotheses to be calculated. Here it is applied to problems of aligning or relating two strings, in particular two biological macromolecules. We compare the r-theory, that the strings are related, with the null-theory, that they are not related. If they are related, the probabilities of the various alignments can be calculated. This is done for one-, three-, and five-state models of relation or mutation. These correspond to linear and piecewise linear cost functions on runs of insertions and deletions. We describe how to estimate parameters of a model. The validity of a model is itself an hypothesis and can be objectively tested. This is done on real DNA strings and on artificial data. The tests on artificial data indicate limits on what can be inferred in various situations. The tests on real DNA support either the three- or five-state models over the one-state model. Finally, a fast, approximate minimum message length string comparison algorithm is described.

Key words

Alignment Edit distance Homology Inductive inference Minimum message length Similarity String 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Allison L, Yee CN (1990) Minimum message length encoding and the comparison of macro-molecules. Bull Math Biol 52(3): 431–453Google Scholar
  2. Allison L, Wallace CS, Yee CN (1990) When is a string like a string? Proceedings, Artificial Intelligence and Mathematics, Ft. Lauderdale FLGoogle Scholar
  3. Allison L, Wallace CS, Yee CN (1992) Minimum message length encoding, evolutionary trees and multiple alignment. Hawaii Int Conf Sys Sci (in press)Google Scholar
  4. Bains W (1986) The multiple origins of the human Alu sequences. J Mol Evol 23:189–199Google Scholar
  5. Bishop MJ, Friday AE (1986) Molecular sequences and hominoid phylogeny. In: Wood B, Martin L, Andrews P (eds) Major topics in primate and human evolution. Cambridge University Press, Cambridge, pp 150–156Google Scholar
  6. Bishop MJ, Rawlings CJ (eds) (1987) Nucleic acid and protein sequence analysis, a practical approach. IRL PressGoogle Scholar
  7. Bishop MJ, Friday AE, Thompson EA (1987) Inference of evolutionary relationships. In: Bishop MJ, Rawlings CJ (eds) Nucleic acid and protein sequence analysis, a practical approach. IRL Press, pp 359–385Google Scholar
  8. Boulton DM, Wallace CS (1969) The information content of a multistate distribution. J Theor Biol 23:269–278Google Scholar
  9. Boulton DM, Wallace CS (1973) An information measure for hierarchic classification. Comput J 16:254–261Google Scholar
  10. Chaitin GJ (1966) On the length of programs for computing finite binary sequences. J Assoc Comput Mach 13(4):547–569Google Scholar
  11. Cohen DN, Reichert TA, Wong AKC (1975) Matching code sequences utilizing context free quality measures. Math Biosci 24:25–30Google Scholar
  12. Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:368–376Google Scholar
  13. Georgeff MP, Wallace CS (1984) A general selection criterion for inductive inference. Proceedings, European Conference on Artificial Intelligence, pp 473–482Google Scholar
  14. Gotoh O (1982) An improved algorithm for matching biological sequences. J Mol Biol 162:705–708Google Scholar
  15. Gotoh O (1990) Optimal sequence alignment allowing for long gaps. Bull Math Biol 52(3):359–373Google Scholar
  16. Hamming RW (1980) Coding and information theory. Prentice Hall, Englewood Cliffs NJGoogle Scholar
  17. Hirschberg DS (1975) A linear space algorithm for computing maximal common subsequences. Commun Assoc Comput Mach 18(6):341–343Google Scholar
  18. Holmes EC (1989) Pattern and process in the evolution of the primates. PhD thesis, Cambridge UniversityGoogle Scholar
  19. Jurka J, Milosavljevic A (1991) Reconstruction and analysis of human Alu genes. J Mol Evol 32:105–121Google Scholar
  20. Kolmogorov AN (1965) Three approaches to the quantitative definition of information. Probl Inf Transmission 1(1):1–7Google Scholar
  21. Langdon GG (1984) An introduction to arithmetic coding. IBM J Res Dev 28(2):135–149Google Scholar
  22. Li M, Vitanyi PMB (1988) Two decades of applied Kolmogorov complexity. Proceedings of the Third Annual Conference on Structure in Complexity Theory. IEEE, pp 80–101Google Scholar
  23. Miller W, Myers EW (1988) Sequence comparison with concave weighting functions. Bull Math Biol 50(2):97–120Google Scholar
  24. Milosavljevic AD (1990) Categorization of macromolecular sequences by minimal length encoding. PhD thesis, University of California at Santa Cruz, UCSC-CRL-90–41Google Scholar
  25. Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48:443–453Google Scholar
  26. Reichert TA, Cohen DN, Wong KC (1973) An application of information theory to genetic mutations and the matching of polypeptide sequences. J Theor Biol 42:245–261Google Scholar
  27. Rissanen J (1983) A universal prior for integers and estimation by minimum description length. Ann Stats 11(2):416–431Google Scholar
  28. Sankoff D, Kruskal JB (eds) (1983) Time warps, string edits and macro-molecules. Addison WesleyGoogle Scholar
  29. Sellers PH (1974) On the theory and computation of evolutionary distances. SIAM J Appl Math 26(4):787–793Google Scholar
  30. Sellers PH (1980) The theory and computation of evolutionary distances: pattern recognition. J Algorithms 1:359–373Google Scholar
  31. Shepherd JCW (1981) Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary justification. Proc Natl Acad Sci USA 78:1596–1600Google Scholar
  32. Solomonoff R (1964) A formal theory of inductive inference, I and II. Inf Control 7:1–22, 224–254Google Scholar
  33. Thorne JL, Kishino H, Felsenstein J (1991) An evolutionary model for maximum likelihood alignment of DNA sequences. J Mol Evol (in press)Google Scholar
  34. Ukkonen E (1983) On approximate string matching. In: Karpinski M (ed) Proceedings of an international conference on foundations of computation theory, vol 158. Springer Verlag, pp 482–495Google Scholar
  35. Wallace CS (1990) Classification by minimum message length inference. AAAI Spring Symposium on the Theory and Application of Minimum Length Encoding, Stanford, pp 5–9Google Scholar
  36. Wallace CS, Boulton DM (1968) An information measure for classification. Comput J 11(2):185–194Google Scholar
  37. Wallace CS, Freeman PR (1987) Estimation and inference by compact coding. J R Star Soc B 49(3):240–265Google Scholar
  38. Waterman MS (1984) General methods of sequence comparison. Bull Math Biol 46(4):473–500Google Scholar
  39. Witten IH, Neal RM, Cleary JG (1987) Arithmetic coding for data compression. Commun Assoc Comput Mach 30(6):520–540Google Scholar
  40. Wong AKC, Reichert TA, Cohen DN, Aygun BO (1974) A generalized method for matching informational macromolecular code sequences. Comput Biol Med 4:43–57Google Scholar

Copyright information

© Springer-Verlag New York Inc 1992

Authors and Affiliations

  • L. Allison
    • 1
  • C. S. Wallace
    • 1
  • C. N. Yee
    • 1
  1. 1.Department of Computer ScienceMonash UniversityAustralia

Personalised recommendations