Skip to main content
Log in

Minimum message length encoding and the comparison of macromolecules

  • Published:
Bulletin of Mathematical Biology Aims and scope Submit manuscript

Abstract

A method of inductive inference known asminimum message length encoding is applied to string comparison in molecular biology. The question of whether or not two strings are related and, if so, of how they are related and the problem of finding a good theory of string mutation are treated as inductive inference problems. The method allows the posterior odds-ratio of two string alignments or of two models of string mutation to be computed. The connection between models of mutation and existing string alignment algorithms is made explicit. A fast minimum message length alignment algorithm is also described.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Literature

  • Allison, L. and T. I. Dix. 1986. A bit-string longest common subsequence algorithm.Inf. Processing Lett. 23, 305–310.

    Article  MathSciNet  Google Scholar 

  • Bains, W. 1986. The multiple origins of the human Alu sequences.J. molec. Evol. 23, 189–199.

    Article  Google Scholar 

  • Boulton, D. M. and C. S. Wallace. 1969. The information content of a multistate distribution.J. theor. Biol. 23, 269–278.

    Article  MathSciNet  Google Scholar 

  • Chaitin, G. J. 1966. On the length of programs for computing finite binary sequences.J. ACM 13, 547–569.

    Article  MATH  MathSciNet  Google Scholar 

  • Cohen, D. N., T. A. Reichert and A. K. C. Wong. 1975. Matching code sequences utilizing context free quality measures.Math. Biosci. 24, 25–30.

    Article  MATH  MathSciNet  Google Scholar 

  • Dayhoff, M. O. 1978.Atlas of Protein Sequence and Structure, Vol. 5, Suppl. 3. Washington, DC: National Biomedical Research Foundation.

    Google Scholar 

  • Deken, J. 1983. Probabilistic behaviour of longest common subsequence lengths. InTime Warps, String Edits and Macro-Molecules, D. Sankoff and J. B. Kruskall (eds). Reading, MA: Addison Wesley.

    Google Scholar 

  • Gatlin, L. L. 1974. Conservation of Shannon's redundancy of proteins.J. mol. Evol. 3, 189–208.

    Article  Google Scholar 

  • Georgeff, M. P. and C. S. Wallace. 1984. A general selection criterion for inductive inference. Proceedings of the European Conference on Artificial Intelligence, pp. 473–482.

  • Gotoh, O. 1982. An improved algorithm for matching biological sequences.J. molec. Biol. 162, 705–708.

    Article  Google Scholar 

  • Hamming, R. W. 1980.Coding and Information Theory. Englewood Cliffs, NJ: Prentice Hall.

    MATH  Google Scholar 

  • Hasegawa, M. and Taka-Aki Yani. 1975. The genetic code and the entropy of protein.Math. Biosci. 24, 169–182.

    Article  Google Scholar 

  • Hirschberg, D. S. 1975. A linear space algorithm for computing maximal common subsequences.Commun. ACM 18, 341–343.

    Article  MATH  MathSciNet  Google Scholar 

  • Jimenez-Montano, M. A. 1984. On the syntactic structure of protein sequences and the concept of grammar complexity.Bull. math. Biol. 46, 641–659.

    Article  MATH  MathSciNet  Google Scholar 

  • Kolmogorov, A. N. 1965. Three approaches to the quantitative definition of information.Prob. Inf. Transmission 1, 1–7.

    MATH  Google Scholar 

  • Langdon, G. G. 1984. An introduction to arithmetic coding.IBM J. Res. and Dev. 28, 135–149.

    Article  MATH  MathSciNet  Google Scholar 

  • Miller, W. and E. W. Myers. 1988. Sequence comparison with concave weighting functions.Bull. math. Biol. 50, 97–120.

    Article  MATH  MathSciNet  Google Scholar 

  • Ming Li and P. M. B. Vitanyi. 1988. Two decades of applied Kolmogorov Complexity. Proceedings of the Third Annual Conference on Structure in Complexity Theory. IEEE 80–101.

  • Reichert, T. A., D. N. Cohen and K. C. Wong. 1973. An application of information theory to genetic mutations and the matching of polypeptide sequences.J. theor. Biol. 42, 245–261.

    Article  Google Scholar 

  • Rissanen, J. 1983. A universal prior for integers and estimation by minimum description length.Ann. Stat. 11, 416–431.

    MATH  MathSciNet  Google Scholar 

  • Sankoff, D. and J. B. Kruskall (eds). 1983.Time Warps, String Edits and Macro-Molecules. Reading, MA: Addison Wesley.

    Google Scholar 

  • Sellers, P. H. 1974. On the theory and computation of evolutionary distances.SIAM J. appl. Math. 26, 787–793.

    Article  MATH  MathSciNet  Google Scholar 

  • Sellers, P. H. 1980. The theory and computation of evolutionary distances: pattern recognition.J. Algorithms 1, 359–373.

    Article  MATH  MathSciNet  Google Scholar 

  • Shepherd, J. C. W. 1981. Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary justification.Proc. natl. Acad. Sci. 78, 1596–1600.

    Article  Google Scholar 

  • Smith, T. F. 1969. The genetic code, information density and evolution.Math. Biosci. 4, 179–187.

    Article  Google Scholar 

  • Smith, T. F. and M. S. Waterman. 1980. Protein constraints induced by multiframe encoding.Math. Biosci. 49, 17–26.

    Article  MATH  Google Scholar 

  • Smith, T. F., M. S. Waterman and W. M. Fitch. 1981. comparative biosequence metrics.J. molec. Evol. 18, 38–46.

    Article  Google Scholar 

  • Solomonoff, R. 1964. A formal theory of inductive inference, I and II.Inf. Control 7, 1–22, 224–254.

    Article  MATH  MathSciNet  Google Scholar 

  • Staden, R. and A. D. McLachlan. 1982. Codon preference and its use in identifying protein coding regions in long DNA sequences.Nucleic Acids Res. 10, 141–156.

    Google Scholar 

  • Turing, A. M. 1936. On computable numbers, with an application to the entscheidungsproblem.Proc. Lon. math. Soc. 2, 230–265, 544–546.

    MATH  Google Scholar 

  • Wallace, C. S. and D. M. Boulton. 1968. An information measure for classification.Comput. J. 11, 185–194.

    MATH  Google Scholar 

  • Wallace, C. S. and P. R. Freeman. 1987. Estimation and inference by compact coding.J. R. Stat. Soc. B49, 240–265.

    MATH  MathSciNet  Google Scholar 

  • Wallace, C. S. 1989. Personal communication.

  • Waterman, M. S. 1984. General methods of sequence comparison.Bull. math. Biol. 46, 473–500.

    Article  MATH  MathSciNet  Google Scholar 

  • Waterman, M. S. 1984b. Efficient sequence alignment algorithms.J. theor. Biol. 108, 333–337.

    MathSciNet  Google Scholar 

  • Waterman, M. S. and M. Eggert. 1987. A new algorithm for best subsequence alignments and application to tRNA-rRNA comparison.J. molec. Biol. 197, 723–728.

    Article  Google Scholar 

  • Witten, I. H., R. M. Neal and J. G. Cleary. 1987. Arithmetic coding for data compression.Commun. ACM 30, 520–540.

    Article  Google Scholar 

  • Wong, A. K. C., T. A. Reichert, D. N. Cohen and B. O. Aygun. 1974. A generalized method for matching informational macromolecular code sequences.Comput. Biol. Med. 4, 43–57.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Additional information

Supported by Australian Research Council grant A48830856.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Allison, L., Yee, C.N. Minimum message length encoding and the comparison of macromolecules. Bltn Mathcal Biology 52, 431–453 (1990). https://doi.org/10.1007/BF02458580

Download citation

  • Received:

  • Revised:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02458580

Keywords

Navigation