Abstract
A method of inductive inference known asminimum message length encoding is applied to string comparison in molecular biology. The question of whether or not two strings are related and, if so, of how they are related and the problem of finding a good theory of string mutation are treated as inductive inference problems. The method allows the posterior odds-ratio of two string alignments or of two models of string mutation to be computed. The connection between models of mutation and existing string alignment algorithms is made explicit. A fast minimum message length alignment algorithm is also described.
Similar content being viewed by others
Literature
Allison, L. and T. I. Dix. 1986. A bit-string longest common subsequence algorithm.Inf. Processing Lett. 23, 305–310.
Bains, W. 1986. The multiple origins of the human Alu sequences.J. molec. Evol. 23, 189–199.
Boulton, D. M. and C. S. Wallace. 1969. The information content of a multistate distribution.J. theor. Biol. 23, 269–278.
Chaitin, G. J. 1966. On the length of programs for computing finite binary sequences.J. ACM 13, 547–569.
Cohen, D. N., T. A. Reichert and A. K. C. Wong. 1975. Matching code sequences utilizing context free quality measures.Math. Biosci. 24, 25–30.
Dayhoff, M. O. 1978.Atlas of Protein Sequence and Structure, Vol. 5, Suppl. 3. Washington, DC: National Biomedical Research Foundation.
Deken, J. 1983. Probabilistic behaviour of longest common subsequence lengths. InTime Warps, String Edits and Macro-Molecules, D. Sankoff and J. B. Kruskall (eds). Reading, MA: Addison Wesley.
Gatlin, L. L. 1974. Conservation of Shannon's redundancy of proteins.J. mol. Evol. 3, 189–208.
Georgeff, M. P. and C. S. Wallace. 1984. A general selection criterion for inductive inference. Proceedings of the European Conference on Artificial Intelligence, pp. 473–482.
Gotoh, O. 1982. An improved algorithm for matching biological sequences.J. molec. Biol. 162, 705–708.
Hamming, R. W. 1980.Coding and Information Theory. Englewood Cliffs, NJ: Prentice Hall.
Hasegawa, M. and Taka-Aki Yani. 1975. The genetic code and the entropy of protein.Math. Biosci. 24, 169–182.
Hirschberg, D. S. 1975. A linear space algorithm for computing maximal common subsequences.Commun. ACM 18, 341–343.
Jimenez-Montano, M. A. 1984. On the syntactic structure of protein sequences and the concept of grammar complexity.Bull. math. Biol. 46, 641–659.
Kolmogorov, A. N. 1965. Three approaches to the quantitative definition of information.Prob. Inf. Transmission 1, 1–7.
Langdon, G. G. 1984. An introduction to arithmetic coding.IBM J. Res. and Dev. 28, 135–149.
Miller, W. and E. W. Myers. 1988. Sequence comparison with concave weighting functions.Bull. math. Biol. 50, 97–120.
Ming Li and P. M. B. Vitanyi. 1988. Two decades of applied Kolmogorov Complexity. Proceedings of the Third Annual Conference on Structure in Complexity Theory. IEEE 80–101.
Reichert, T. A., D. N. Cohen and K. C. Wong. 1973. An application of information theory to genetic mutations and the matching of polypeptide sequences.J. theor. Biol. 42, 245–261.
Rissanen, J. 1983. A universal prior for integers and estimation by minimum description length.Ann. Stat. 11, 416–431.
Sankoff, D. and J. B. Kruskall (eds). 1983.Time Warps, String Edits and Macro-Molecules. Reading, MA: Addison Wesley.
Sellers, P. H. 1974. On the theory and computation of evolutionary distances.SIAM J. appl. Math. 26, 787–793.
Sellers, P. H. 1980. The theory and computation of evolutionary distances: pattern recognition.J. Algorithms 1, 359–373.
Shepherd, J. C. W. 1981. Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary justification.Proc. natl. Acad. Sci. 78, 1596–1600.
Smith, T. F. 1969. The genetic code, information density and evolution.Math. Biosci. 4, 179–187.
Smith, T. F. and M. S. Waterman. 1980. Protein constraints induced by multiframe encoding.Math. Biosci. 49, 17–26.
Smith, T. F., M. S. Waterman and W. M. Fitch. 1981. comparative biosequence metrics.J. molec. Evol. 18, 38–46.
Solomonoff, R. 1964. A formal theory of inductive inference, I and II.Inf. Control 7, 1–22, 224–254.
Staden, R. and A. D. McLachlan. 1982. Codon preference and its use in identifying protein coding regions in long DNA sequences.Nucleic Acids Res. 10, 141–156.
Turing, A. M. 1936. On computable numbers, with an application to the entscheidungsproblem.Proc. Lon. math. Soc. 2, 230–265, 544–546.
Wallace, C. S. and D. M. Boulton. 1968. An information measure for classification.Comput. J. 11, 185–194.
Wallace, C. S. and P. R. Freeman. 1987. Estimation and inference by compact coding.J. R. Stat. Soc. B49, 240–265.
Wallace, C. S. 1989. Personal communication.
Waterman, M. S. 1984. General methods of sequence comparison.Bull. math. Biol. 46, 473–500.
Waterman, M. S. 1984b. Efficient sequence alignment algorithms.J. theor. Biol. 108, 333–337.
Waterman, M. S. and M. Eggert. 1987. A new algorithm for best subsequence alignments and application to tRNA-rRNA comparison.J. molec. Biol. 197, 723–728.
Witten, I. H., R. M. Neal and J. G. Cleary. 1987. Arithmetic coding for data compression.Commun. ACM 30, 520–540.
Wong, A. K. C., T. A. Reichert, D. N. Cohen and B. O. Aygun. 1974. A generalized method for matching informational macromolecular code sequences.Comput. Biol. Med. 4, 43–57.
Author information
Authors and Affiliations
Additional information
Supported by Australian Research Council grant A48830856.
Rights and permissions
About this article
Cite this article
Allison, L., Yee, C.N. Minimum message length encoding and the comparison of macromolecules. Bltn Mathcal Biology 52, 431–453 (1990). https://doi.org/10.1007/BF02458580
Received:
Revised:
Issue Date:
DOI: https://doi.org/10.1007/BF02458580