Abstract
We apply the Minimal Length Encoding Principle to formalize inference about the evolution of macromolecular sequences. The Principle is shown to imply a combination of Weighted Parsimony and Compatibility methods that have long been used by biologists because of their good practical performance. The background assumptions are expressed as an encoding scheme for the observed data and as heuristic rules for selection of diagnostic positions in the sequences. The Principle was applied to discover new subfamilies of Alu sequences, the most numerous family of repetitive DNA sequences in the human genome.
Article PDF
Similar content being viewed by others
References
Allison, L., & Yee, C.N. (1990). Minimum message length encoding and the comparison of macromolecules. Bulletin of Mathematical Biology, 52, 431–453.
Babcock, Marla S., Olson, Wilma K., & Pednault, Edwin P.D. (1990). The use of the minimum description length principle to segment dna into structural and functional domains. In Working Notes, AAAI Spring Symposium Series, Stanford.
Bains, W. (1986). The multiple origins of human Alu sequences. Journal of Molecular Evolution, 23, 189–199.
Bell, T.C., Cleary, J.G., & Witten, I.H. (1990). Text compression. Englewood Cliffs, NJ: Prentice Hall.
Britten, R.J., Baron, W.F., Stout, D., & Davidson, E.H. (1988). Sources and evolution of human Alu repeated sequences. Proceedings of the National Academy of Sciences of the United States of America, 85, 4770–4774.
Chaitin, G.J., (1966). On the length of programs for computing finite binary sequences. Journal of the Association for Computing Machinery, 13, 547–569.
Cheeseman, P., Self, M., Kelly, J., Taylor, W., Freeman, D., & Stutz, J. (1988). Bayesian classification. In Proceedings of the Conference of the American Association for Artificial Intelligence. Los Altos, CA: Morgan Kaufmann.
Cheeseman, Peter, & Kanefsky, Bob. (1990). Evolutionary tree reconstruction. In Working Notes, AAAI Spring Symposium Series, Stanford.
Cover, Thomas & Thomas, Joy. (1991). Elements of information theory. New York: Wiley.
Duda, R.O., & Hart, P.E., (1973). Pattern recognition and scene analysis. New York: Wiley.
Farris, J.S. (1969). A successive approximations approach to character weighting. Systematics and Zoology, 18, 374–385.
Felsenstein, J. (1981). A likelihood approach to character weighting and what it tells us about parsimony and compatibility. Biological Journal of the Linnean Society, 16, 183–196.
Felsenstein, J. (1982). Numerical methods for inferring evolutionary trees. Quarterly Review of Biology, 57 (4), 379–404.
Fisher, D. (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2, 139–172.
Gennari, J.H., Langley, P., & Fisher, D. (1989). Models of incremental concept formation. Artificial Intelligence, 40, 11–61.
Hamming, R.W. (1986). Coding and information theory. Englewood Cliffs, NJ: Prentice-Hall.
Hein, Jotun. (1990). Unified approach to alignment and phylogenies. Methods of Enzymology, 183, 626–645.
Hwu, H.R., Roberts, J.W., Davidson, E.H., & Britten, R.J. (1986). Insertion and/or deletion of many repeated dna sequences in human and higher ape evolution. Proceedings of the National Academy of Sciences of the United States of America, 83, 3875–3879.
Jiang, Tao, & Ming, Li, (1991). On the complexity of learning strings and sequences. In Proceedings of the Fourth Annual Workshop on Computational Learning Theory (pp. 367–371). San Mateo, CA: Morgan Kaufmann.
Jimenez-Montano, M.A. (1984). On the syntactic structure of protein sequences and the concept of grammar complexity. Bulletin of Mathematical Biology, 46, 641–659.
Jurka, J. (1989). Subfamily structure and evolution of the human L1 family of repetitive sequences. Journal of Molecular Evolution, 29, 496–503.
Jurka, J. & Milosavljević (1991). Reconstruction and analysis of human Alu genes. Journal of Molecular Evolution, 32, 105–121.
Jurka, J. & Smith, T. (1988). A fundamental division in the Alu family of repeated sequences. Proceedings of the National Academy of Sciences of the United States of America, 85, 4775–4778.
Kolmogorov, A.N. (1968). Three approaches to the quantiative definition of information. International Journal for Computer Mathematics, 2, 157–168.
Konagaya, Akihiko, & Yamanishi, Kenji. (1991). Stochastic decision predicates: A scheme to represent motifs. In AAAI Workshop on AI Applications to Classification and Pattern Recognition in Molecular Biology, Anaheim, California.
Kuhn, T.S. (1957). The Copernican revolution. Cambridge, MA; Harvard University Press.
LeQuesne, W.J. (1969). A method of selection of characters in numerical taxonomy. Systematic Zoology, 18, 201.
Losee, J. (1980). A historical introduction to the philosophy of science. Oxford: Oxford University Press.
Mayr, Ernst. (1961). Cause and effect in biology. Science, 134, 1501–1506.
Michalski, R.S., & Stepp, R.E., (1983). Automated construction of classifications: Conceptual clustering versus numerical taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5, 396–410.
Milosavljević, Aleksandar. (1990). Categorization of macromolecular sequences by minimal length encoding. Ph.D. thesis, Computer Science Department, University of California at Santa Cruz.
Milosavljević, Aleksandar, Haussler, David, & Jurka, Jerzy. (1989). Informed parsimonious inference of prototypical genetic sequences. Proceedings of the Second Workshop on Computational Learning Theory (pp. 102–117). San Mateo, CA: Morgan Kaufmann.
Orloci, Laszlo. (1968). Information analysis in phytosociology: Partition, classification and prediction. Journal of Theoretical Biology, 20, 271–284.
Quentin, Y., (1988). The Alu family developed through successive waves of fixation closely connected with primate lineage history. Journal of Molecular Evolution, 27, 194–202.
Reichert, T.A., Cohen, D.N., & Wong, K.C. (1973). An application of information theory to genetic mutations and the matching of polypeptide sequences. Journal of Theoretical Biology, 42, 245–261.
Ridley, M. (1986). Evolution and classification. London and New York: Longman.
Smith, T.F., & Waterman, M.S. (1981). Identification of common molecular subsequences. Journal of Molecular Biology, 147, 195–197.
Sober, E., (1988). Reconstructing the past: Parsimony, evolution, and inference. Cambridge, MA: MIT Press.
Solomonoff, R.J. (1964). A formal theory of inductive inference, Part I. Information and Control, 7, 1–22.
Vitanyi, P.M.B. & Li, M. Kolmogorov complexity and its applications. (Tehnical Report CS-R8901). Amsterdam: Centre for Mathematics and Computer Science, Amsterdam University.
Wallace, C.S. (1990). Classification by minimum-message-length inference. In Working Notes, AAAI Spring Symposium on the Theory and Application of Minimal-Length Encoding.
Wallace, C.S., & Boulton, D.M. (1968). An information measure for classification. Computer Journal, II, 185–195.
Watson, J.D. (1987). Molecular Biology of the Gene. Reading, MA: Benjamin/Cummings.
Willard, C., Nguyen, H.T. & Schmid, C.W. (1987). Existence of at least three distinct Alu subfamilies. Journal of Molecular Evolution, 26, 180–186.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Milosavljević, A., Jurka, J. Discovery by Minimal Length Encoding: A Case Study in Molecular Evolution. Machine Learning 12, 69–87 (1993). https://doi.org/10.1023/A:1022871401069
Issue Date:
DOI: https://doi.org/10.1023/A:1022871401069