Bulletin of Mathematical Biology

, Volume 52, Issue 3, pp 431–453 | Cite as

Minimum message length encoding and the comparison of macromolecules

  • L. Allison
  • C. N. Yee


A method of inductive inference known asminimum message length encoding is applied to string comparison in molecular biology. The question of whether or not two strings are related and, if so, of how they are related and the problem of finding a good theory of string mutation are treated as inductive inference problems. The method allows the posterior odds-ratio of two string alignments or of two models of string mutation to be computed. The connection between models of mutation and existing string alignment algorithms is made explicit. A fast minimum message length alignment algorithm is also described.


Dynamic Programming Algorithm Edit Distance Alignment Algorithm Kolmogorov Complexity Edit Operation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Allison, L. and T. I. Dix. 1986. A bit-string longest common subsequence algorithm.Inf. Processing Lett. 23, 305–310.MathSciNetCrossRefGoogle Scholar
  2. Bains, W. 1986. The multiple origins of the human Alu sequences.J. molec. Evol. 23, 189–199.CrossRefGoogle Scholar
  3. Boulton, D. M. and C. S. Wallace. 1969. The information content of a multistate distribution.J. theor. Biol. 23, 269–278.MathSciNetCrossRefGoogle Scholar
  4. Chaitin, G. J. 1966. On the length of programs for computing finite binary sequences.J. ACM 13, 547–569.zbMATHMathSciNetCrossRefGoogle Scholar
  5. Cohen, D. N., T. A. Reichert and A. K. C. Wong. 1975. Matching code sequences utilizing context free quality measures.Math. Biosci. 24, 25–30.zbMATHMathSciNetCrossRefGoogle Scholar
  6. Dayhoff, M. O. 1978.Atlas of Protein Sequence and Structure, Vol. 5, Suppl. 3. Washington, DC: National Biomedical Research Foundation.Google Scholar
  7. Deken, J. 1983. Probabilistic behaviour of longest common subsequence lengths. InTime Warps, String Edits and Macro-Molecules, D. Sankoff and J. B. Kruskall (eds). Reading, MA: Addison Wesley.Google Scholar
  8. Gatlin, L. L. 1974. Conservation of Shannon's redundancy of proteins.J. mol. Evol. 3, 189–208.CrossRefGoogle Scholar
  9. Georgeff, M. P. and C. S. Wallace. 1984. A general selection criterion for inductive inference. Proceedings of the European Conference on Artificial Intelligence, pp. 473–482.Google Scholar
  10. Gotoh, O. 1982. An improved algorithm for matching biological sequences.J. molec. Biol. 162, 705–708.CrossRefGoogle Scholar
  11. Hamming, R. W. 1980.Coding and Information Theory. Englewood Cliffs, NJ: Prentice Hall.zbMATHGoogle Scholar
  12. Hasegawa, M. and Taka-Aki Yani. 1975. The genetic code and the entropy of protein.Math. Biosci. 24, 169–182.CrossRefGoogle Scholar
  13. Hirschberg, D. S. 1975. A linear space algorithm for computing maximal common subsequences.Commun. ACM 18, 341–343.zbMATHMathSciNetCrossRefGoogle Scholar
  14. Jimenez-Montano, M. A. 1984. On the syntactic structure of protein sequences and the concept of grammar complexity.Bull. math. Biol. 46, 641–659.zbMATHMathSciNetCrossRefGoogle Scholar
  15. Kolmogorov, A. N. 1965. Three approaches to the quantitative definition of information.Prob. Inf. Transmission 1, 1–7.zbMATHGoogle Scholar
  16. Langdon, G. G. 1984. An introduction to arithmetic coding.IBM J. Res. and Dev. 28, 135–149.zbMATHMathSciNetCrossRefGoogle Scholar
  17. Miller, W. and E. W. Myers. 1988. Sequence comparison with concave weighting functions.Bull. math. Biol. 50, 97–120.zbMATHMathSciNetCrossRefGoogle Scholar
  18. Ming Li and P. M. B. Vitanyi. 1988. Two decades of applied Kolmogorov Complexity. Proceedings of the Third Annual Conference on Structure in Complexity Theory. IEEE 80–101.Google Scholar
  19. Reichert, T. A., D. N. Cohen and K. C. Wong. 1973. An application of information theory to genetic mutations and the matching of polypeptide sequences.J. theor. Biol. 42, 245–261.CrossRefGoogle Scholar
  20. Rissanen, J. 1983. A universal prior for integers and estimation by minimum description length.Ann. Stat. 11, 416–431.zbMATHMathSciNetGoogle Scholar
  21. Sankoff, D. and J. B. Kruskall (eds). 1983.Time Warps, String Edits and Macro-Molecules. Reading, MA: Addison Wesley.Google Scholar
  22. Sellers, P. H. 1974. On the theory and computation of evolutionary distances.SIAM J. appl. Math. 26, 787–793.zbMATHMathSciNetCrossRefGoogle Scholar
  23. Sellers, P. H. 1980. The theory and computation of evolutionary distances: pattern recognition.J. Algorithms 1, 359–373.zbMATHMathSciNetCrossRefGoogle Scholar
  24. Shepherd, J. C. W. 1981. Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary justification.Proc. natl. Acad. Sci. 78, 1596–1600.CrossRefGoogle Scholar
  25. Smith, T. F. 1969. The genetic code, information density and evolution.Math. Biosci. 4, 179–187.CrossRefGoogle Scholar
  26. Smith, T. F. and M. S. Waterman. 1980. Protein constraints induced by multiframe encoding.Math. Biosci. 49, 17–26.zbMATHCrossRefGoogle Scholar
  27. Smith, T. F., M. S. Waterman and W. M. Fitch. 1981. comparative biosequence metrics.J. molec. Evol. 18, 38–46.CrossRefGoogle Scholar
  28. Solomonoff, R. 1964. A formal theory of inductive inference, I and II.Inf. Control 7, 1–22, 224–254.zbMATHMathSciNetCrossRefGoogle Scholar
  29. Staden, R. and A. D. McLachlan. 1982. Codon preference and its use in identifying protein coding regions in long DNA sequences.Nucleic Acids Res. 10, 141–156.Google Scholar
  30. Turing, A. M. 1936. On computable numbers, with an application to the entscheidungsproblem.Proc. Lon. math. Soc. 2, 230–265, 544–546.zbMATHGoogle Scholar
  31. Wallace, C. S. and D. M. Boulton. 1968. An information measure for classification.Comput. J. 11, 185–194.zbMATHGoogle Scholar
  32. Wallace, C. S. and P. R. Freeman. 1987. Estimation and inference by compact coding.J. R. Stat. Soc. B49, 240–265.zbMATHMathSciNetGoogle Scholar
  33. Wallace, C. S. 1989. Personal communication.Google Scholar
  34. Waterman, M. S. 1984. General methods of sequence comparison.Bull. math. Biol. 46, 473–500.zbMATHMathSciNetCrossRefGoogle Scholar
  35. Waterman, M. S. 1984b. Efficient sequence alignment algorithms.J. theor. Biol. 108, 333–337.MathSciNetGoogle Scholar
  36. Waterman, M. S. and M. Eggert. 1987. A new algorithm for best subsequence alignments and application to tRNA-rRNA comparison.J. molec. Biol. 197, 723–728.CrossRefGoogle Scholar
  37. Witten, I. H., R. M. Neal and J. G. Cleary. 1987. Arithmetic coding for data compression.Commun. ACM 30, 520–540.CrossRefGoogle Scholar
  38. Wong, A. K. C., T. A. Reichert, D. N. Cohen and B. O. Aygun. 1974. A generalized method for matching informational macromolecular code sequences.Comput. Biol. Med. 4, 43–57.CrossRefGoogle Scholar

Copyright information

© Society for Mathematical Biology 1990

Authors and Affiliations

  • L. Allison
    • 1
  • C. N. Yee
    • 1
  1. 1.Department of Computer ScienceMonash UniversityAustralia

Personalised recommendations