Discovering Dependencies via Algorithmic Mutual Information: A Case Study in DNA Sequence Comparisons

Milosavljević, Aleksandar

doi:10.1023/A:1022665630550

Discovering Dependencies via Algorithmic Mutual Information: A Case Study in DNA Sequence Comparisons

Published: October 1995

Volume 21, pages 35–50, (1995)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Discovering Dependencies via Algorithmic Mutual Information: A Case Study in DNA Sequence Comparisons

Download PDF

Aleksandar Milosavljević^1,2

475 Accesses
10 Citations
Explore all metrics

Abstract

Algorithmic mutual information is a central concept in algorithmic information theory and may be measured as the difference between independent and joint minimal encoding lengths of objects; it is also a central concept in Chaitin's fascinating mathematical definition of life. We explore applicability of algorithmic mutual information as a tool for discovering dependencies in biology. In order to determine significance of discovered dependencies, we extend the newly proposed algorithmic significance method. The main theorem of the extended method states that d bits of algorithmic mutual information imply dependency at the significance level 2^−d+O(1). We apply a heuristic version of the method to one of the main problems in DNA and protein sequence comparisons: the problem of deciding whether observed similarity between sequences should be explained by their relatedness or by the mere presence of some shared internal structure, e.g., shared internal repetitive patterns. We take advantage of the fact that mutual information factors out sequence similarity that is due to shared internal structure and thus enables discovery of truly related sequences. In addition to providing a general framework for sequence comparisons, we also propose an efficient way to compare sequences based on their subword composition that does not require any a priori assumptions about k-tuple length.

Article PDF

Compositional Properties of Alignments

Article Open access 28 December 2020

Biological Sequence Analysis: Algorithms and Statistical Methods

Circular sequence comparison: algorithms and applications

Article Open access 10 May 2016

References

Allison, L. & Yee, C.N. (1990). Minimum message length encoding and the comparison of macromolecules. Bulletin of Mathematical Biology, 52:431–453.
Google Scholar
Altschul, S.F., Boguski, M.S., Gish, W. & Wootton, J.C. (1994). Issues in searching molecular sequence databases. Nature Genetics, 6:119–129.
Google Scholar
Bains, W. The multiple origins of human Alu sequences. (1986). Journal of Molecular Evolution, 23:189–199.
Google Scholar
Bilofsky, H.S. & Burks, C. (1988). The GenBank (R) genetic sequence data bank. Nucleic Acids Research, 16:1861–1864.
Google Scholar
Blumer, A., Blumer, J., Haussler, D., Ehrenfeucht, A., Chen, M.T. & Seiferas, J. (1985). The smallest automaton recognizing the subwords of a text. Theoretical Computer Science, 40:31–55.
Google Scholar
Chaitin, G.J. (1979). The Maximum Entropy Formalism, chapter Toward a Mathematical Definition of Life, pages 477–498. MIT Press. Levine, R.D. and Tribus, M. (eds).
Chaitin, G.J. (1987). Algorithmic Information Theory. Cambridge University Press.
Chaitin, G.J., (1987) Information, Randomness and Incompleteness: Papers on Algorithmic Information Theory. World Scientific.
Claverie, J.-M. & States, D.J. (1993). Information enhancement methods for large scale sequence analysis. Computers in Chemistry, 17:191–201.
Google Scholar
Cover, T. & Thomas, J. (1991). Elements of Information Theory. Wiley.
Friezner-Degen, S.J., Rajput, B. & Reich, E. (1986). The human tissue plasminogen activator gene. Journal of Biological Chemistry, 261:6972–6985.
Google Scholar
Jurka, J. & Milosavljević, A. (1991). Reconstruction and analysis of human Alu genes. Journal of Molecular Evolution, 32:105–121.
Google Scholar
Li, M. & Vitányi, P.M.B. (1993). An Introduction to Kolmogorov Complexity and its Applications. Springer Verlag.
Milosavljević, A. (1993). Discovering sequence similarity by the algorithmic significance method. Proceedings of the First International Conference on Intelligent Systems for Molecular Biology. Bethesda, MD: AAAI Press.
Google Scholar
Milosavljević, A. (to appear). Repeat analysis. In Imperial Cancer Research Fund Handbook of Genome Analysis. Blackwell Scientific Publications.
Milosavljević, A. & Jurka, J. (1993a) Discovering simple DNA sequences by the algorithmic significance method. Computer Applications in Biosciences, 9:407–411.
Google Scholar
Milosavljević, A. & Jurka, J. (1993b). Discovery by minimal length encoding: A case study in molecular evolution. Machine Learning, 12:69–87.
Google Scholar
Pevzner, P.A. (1992). Satistical distance between texts and filtration methods in sequence comparison. Computer Applications in Biosciences, 8:121–127.
Google Scholar
Storer, J.A. (1988). Data Compression: Methods and Theory. Computer Science Press.
Wootton, J.C. & Federhen, S. (1993). Statistics of local complexity in amino acid sequences and sequence databases. Computers in Chemistry, 17:149–163.
Google Scholar

Download references

Author information

Authors and Affiliations

Genome Structure Group, Center for Mechanistic Biology and Biotechnology, Argonne National Laboratory, Argonne, Illinois, 60439-4833
Aleksandar Milosavljević
CuraGen Corporation, 322 East Main Street, Branford, CT 06405
Aleksandar Milosavljević

Authors

Aleksandar Milosavljević
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Milosavljević, A. Discovering Dependencies via Algorithmic Mutual Information: A Case Study in DNA Sequence Comparisons. Machine Learning 21, 35–50 (1995). https://doi.org/10.1023/A:1022665630550

Download citation

Issue Date: October 1995
DOI: https://doi.org/10.1023/A:1022665630550

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Discovering Dependencies via Algorithmic Mutual Information: A Case Study in DNA Sequence Comparisons

Abstract

Article PDF

Similar content being viewed by others

Compositional Properties of Alignments

Biological Sequence Analysis: Algorithms and Statistical Methods

Circular sequence comparison: algorithms and applications

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Discovering Dependencies via Algorithmic Mutual Information: A Case Study in DNA Sequence Comparisons

Abstract

Article PDF

Similar content being viewed by others

Compositional Properties of Alignments

Biological Sequence Analysis: Algorithms and Statistical Methods

Circular sequence comparison: algorithms and applications

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation