Discovery by Minimal Length Encoding: A Case Study in Molecular Evolution

Milosavljević, Aleksandar; Jurka, Jerzy

doi:10.1023/A:1022871401069

Discovery by Minimal Length Encoding: A Case Study in Molecular Evolution

Published: August 1993

Volume 12, pages 69–87, (1993)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Discovery by Minimal Length Encoding: A Case Study in Molecular Evolution

Download PDF

Aleksandar Milosavljević¹^nAff2 &
Jerzy Jurka¹

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

We apply the Minimal Length Encoding Principle to formalize inference about the evolution of macromolecular sequences. The Principle is shown to imply a combination of Weighted Parsimony and Compatibility methods that have long been used by biologists because of their good practical performance. The background assumptions are expressed as an encoding scheme for the observed data and as heuristic rules for selection of diagnostic positions in the sequences. The Principle was applied to discover new subfamilies of Alu sequences, the most numerous family of repetitive DNA sequences in the human genome.

References

Allison, L., & Yee, C.N. (1990). Minimum message length encoding and the comparison of macromolecules. Bulletin of Mathematical Biology, 52, 431–453.
Google Scholar
Babcock, Marla S., Olson, Wilma K., & Pednault, Edwin P.D. (1990). The use of the minimum description length principle to segment dna into structural and functional domains. In Working Notes, AAAI Spring Symposium Series, Stanford.
Bains, W. (1986). The multiple origins of human Alu sequences. Journal of Molecular Evolution, 23, 189–199.
Google Scholar
Bell, T.C., Cleary, J.G., & Witten, I.H. (1990). Text compression. Englewood Cliffs, NJ: Prentice Hall.
Google Scholar
Britten, R.J., Baron, W.F., Stout, D., & Davidson, E.H. (1988). Sources and evolution of human Alu repeated sequences. Proceedings of the National Academy of Sciences of the United States of America, 85, 4770–4774.
Google Scholar
Chaitin, G.J., (1966). On the length of programs for computing finite binary sequences. Journal of the Association for Computing Machinery, 13, 547–569.
Google Scholar
Cheeseman, P., Self, M., Kelly, J., Taylor, W., Freeman, D., & Stutz, J. (1988). Bayesian classification. In Proceedings of the Conference of the American Association for Artificial Intelligence. Los Altos, CA: Morgan Kaufmann.
Google Scholar
Cheeseman, Peter, & Kanefsky, Bob. (1990). Evolutionary tree reconstruction. In Working Notes, AAAI Spring Symposium Series, Stanford.
Cover, Thomas & Thomas, Joy. (1991). Elements of information theory. New York: Wiley.
Google Scholar
Duda, R.O., & Hart, P.E., (1973). Pattern recognition and scene analysis. New York: Wiley.
Google Scholar
Farris, J.S. (1969). A successive approximations approach to character weighting. Systematics and Zoology, 18, 374–385.
Google Scholar
Felsenstein, J. (1981). A likelihood approach to character weighting and what it tells us about parsimony and compatibility. Biological Journal of the Linnean Society, 16, 183–196.
Google Scholar
Felsenstein, J. (1982). Numerical methods for inferring evolutionary trees. Quarterly Review of Biology, 57 (4), 379–404.
Google Scholar
Fisher, D. (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2, 139–172.
Google Scholar
Gennari, J.H., Langley, P., & Fisher, D. (1989). Models of incremental concept formation. Artificial Intelligence, 40, 11–61.
Google Scholar
Hamming, R.W. (1986). Coding and information theory. Englewood Cliffs, NJ: Prentice-Hall.
Google Scholar
Hein, Jotun. (1990). Unified approach to alignment and phylogenies. Methods of Enzymology, 183, 626–645.
Google Scholar
Hwu, H.R., Roberts, J.W., Davidson, E.H., & Britten, R.J. (1986). Insertion and/or deletion of many repeated dna sequences in human and higher ape evolution. Proceedings of the National Academy of Sciences of the United States of America, 83, 3875–3879.
Google Scholar
Jiang, Tao, & Ming, Li, (1991). On the complexity of learning strings and sequences. In Proceedings of the Fourth Annual Workshop on Computational Learning Theory (pp. 367–371). San Mateo, CA: Morgan Kaufmann.
Google Scholar
Jimenez-Montano, M.A. (1984). On the syntactic structure of protein sequences and the concept of grammar complexity. Bulletin of Mathematical Biology, 46, 641–659.
Google Scholar
Jurka, J. (1989). Subfamily structure and evolution of the human L1 family of repetitive sequences. Journal of Molecular Evolution, 29, 496–503.
Google Scholar
Jurka, J. & Milosavljević (1991). Reconstruction and analysis of human Alu genes. Journal of Molecular Evolution, 32, 105–121.
Google Scholar
Jurka, J. & Smith, T. (1988). A fundamental division in the Alu family of repeated sequences. Proceedings of the National Academy of Sciences of the United States of America, 85, 4775–4778.
Google Scholar
Kolmogorov, A.N. (1968). Three approaches to the quantiative definition of information. International Journal for Computer Mathematics, 2, 157–168.
Google Scholar
Konagaya, Akihiko, & Yamanishi, Kenji. (1991). Stochastic decision predicates: A scheme to represent motifs. In AAAI Workshop on AI Applications to Classification and Pattern Recognition in Molecular Biology, Anaheim, California.
Kuhn, T.S. (1957). The Copernican revolution. Cambridge, MA; Harvard University Press.
Google Scholar
LeQuesne, W.J. (1969). A method of selection of characters in numerical taxonomy. Systematic Zoology, 18, 201.
Google Scholar
Losee, J. (1980). A historical introduction to the philosophy of science. Oxford: Oxford University Press.
Google Scholar
Mayr, Ernst. (1961). Cause and effect in biology. Science, 134, 1501–1506.
Google Scholar
Michalski, R.S., & Stepp, R.E., (1983). Automated construction of classifications: Conceptual clustering versus numerical taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5, 396–410.
Google Scholar
Milosavljević, Aleksandar. (1990). Categorization of macromolecular sequences by minimal length encoding. Ph.D. thesis, Computer Science Department, University of California at Santa Cruz.
Milosavljević, Aleksandar, Haussler, David, & Jurka, Jerzy. (1989). Informed parsimonious inference of prototypical genetic sequences. Proceedings of the Second Workshop on Computational Learning Theory (pp. 102–117). San Mateo, CA: Morgan Kaufmann.
Google Scholar
Orloci, Laszlo. (1968). Information analysis in phytosociology: Partition, classification and prediction. Journal of Theoretical Biology, 20, 271–284.
Google Scholar
Quentin, Y., (1988). The Alu family developed through successive waves of fixation closely connected with primate lineage history. Journal of Molecular Evolution, 27, 194–202.
Google Scholar
Reichert, T.A., Cohen, D.N., & Wong, K.C. (1973). An application of information theory to genetic mutations and the matching of polypeptide sequences. Journal of Theoretical Biology, 42, 245–261.
Google Scholar
Ridley, M. (1986). Evolution and classification. London and New York: Longman.
Google Scholar
Smith, T.F., & Waterman, M.S. (1981). Identification of common molecular subsequences. Journal of Molecular Biology, 147, 195–197.
Google Scholar
Sober, E., (1988). Reconstructing the past: Parsimony, evolution, and inference. Cambridge, MA: MIT Press.
Google Scholar
Solomonoff, R.J. (1964). A formal theory of inductive inference, Part I. Information and Control, 7, 1–22.
Google Scholar
Vitanyi, P.M.B. & Li, M. Kolmogorov complexity and its applications. (Tehnical Report CS-R8901). Amsterdam: Centre for Mathematics and Computer Science, Amsterdam University.
Wallace, C.S. (1990). Classification by minimum-message-length inference. In Working Notes, AAAI Spring Symposium on the Theory and Application of Minimal-Length Encoding.
Wallace, C.S., & Boulton, D.M. (1968). An information measure for classification. Computer Journal, II, 185–195.
Google Scholar
Watson, J.D. (1987). Molecular Biology of the Gene. Reading, MA: Benjamin/Cummings.
Google Scholar
Willard, C., Nguyen, H.T. & Schmid, C.W. (1987). Existence of at least three distinct Alu subfamilies. Journal of Molecular Evolution, 26, 180–186.
Google Scholar

Download references

Author information

Aleksandar Milosavljević
Present address: Genome Structure Group, Biological and Medical Research Division, Argonne National Laboratory, Argonne, IL, 60439

Authors and Affiliations

Linus Pauling Institute of Science and Medicine, 440 Page Mill Rd., Palo Alto, CA, 94306
Aleksandar Milosavljević & Jerzy Jurka

Authors

Aleksandar Milosavljević
View author publications
You can also search for this author in PubMed Google Scholar
Jerzy Jurka
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Milosavljević, A., Jurka, J. Discovery by Minimal Length Encoding: A Case Study in Molecular Evolution. Machine Learning 12, 69–87 (1993). https://doi.org/10.1023/A:1022871401069

Download citation

Issue Date: August 1993
DOI: https://doi.org/10.1023/A:1022871401069

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Discovery by Minimal Length Encoding: A Case Study in Molecular Evolution

Abstract

Article PDF

Similar content being viewed by others

How to Infer Ancestral Genome Features by Parsimony: Dynamic Programming over an Evolutionary Tree

When and How the Perfect Phylogeny Model Explains Evolution

Gene Family Evolution—An Algorithmic Framework

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Discovery by Minimal Length Encoding: A Case Study in Molecular Evolution

Abstract

Article PDF

Similar content being viewed by others

How to Infer Ancestral Genome Features by Parsimony: Dynamic Programming over an Evolutionary Tree

When and How the Perfect Phylogeny Model Explains Evolution

Gene Family Evolution—An Algorithmic Framework

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation