Abstract
Algorithmic mutual information is a central concept in algorithmic information theory and may be measured as the difference between independent and joint minimal encoding lengths of objects; it is also a central concept in Chaitin's fascinating mathematical definition of life. We explore applicability of algorithmic mutual information as a tool for discovering dependencies in biology. In order to determine significance of discovered dependencies, we extend the newly proposed algorithmic significance method. The main theorem of the extended method states that d bits of algorithmic mutual information imply dependency at the significance level 2−d+O(1). We apply a heuristic version of the method to one of the main problems in DNA and protein sequence comparisons: the problem of deciding whether observed similarity between sequences should be explained by their relatedness or by the mere presence of some shared internal structure, e.g., shared internal repetitive patterns. We take advantage of the fact that mutual information factors out sequence similarity that is due to shared internal structure and thus enables discovery of truly related sequences. In addition to providing a general framework for sequence comparisons, we also propose an efficient way to compare sequences based on their subword composition that does not require any a priori assumptions about k-tuple length.
Article PDF
Similar content being viewed by others
References
Allison, L. & Yee, C.N. (1990). Minimum message length encoding and the comparison of macromolecules. Bulletin of Mathematical Biology, 52:431–453.
Altschul, S.F., Boguski, M.S., Gish, W. & Wootton, J.C. (1994). Issues in searching molecular sequence databases. Nature Genetics, 6:119–129.
Bains, W. The multiple origins of human Alu sequences. (1986). Journal of Molecular Evolution, 23:189–199.
Bilofsky, H.S. & Burks, C. (1988). The GenBank (R) genetic sequence data bank. Nucleic Acids Research, 16:1861–1864.
Blumer, A., Blumer, J., Haussler, D., Ehrenfeucht, A., Chen, M.T. & Seiferas, J. (1985). The smallest automaton recognizing the subwords of a text. Theoretical Computer Science, 40:31–55.
Chaitin, G.J. (1979). The Maximum Entropy Formalism, chapter Toward a Mathematical Definition of Life, pages 477–498. MIT Press. Levine, R.D. and Tribus, M. (eds).
Chaitin, G.J. (1987). Algorithmic Information Theory. Cambridge University Press.
Chaitin, G.J., (1987) Information, Randomness and Incompleteness: Papers on Algorithmic Information Theory. World Scientific.
Claverie, J.-M. & States, D.J. (1993). Information enhancement methods for large scale sequence analysis. Computers in Chemistry, 17:191–201.
Cover, T. & Thomas, J. (1991). Elements of Information Theory. Wiley.
Friezner-Degen, S.J., Rajput, B. & Reich, E. (1986). The human tissue plasminogen activator gene. Journal of Biological Chemistry, 261:6972–6985.
Jurka, J. & Milosavljević, A. (1991). Reconstruction and analysis of human Alu genes. Journal of Molecular Evolution, 32:105–121.
Li, M. & Vitányi, P.M.B. (1993). An Introduction to Kolmogorov Complexity and its Applications. Springer Verlag.
Milosavljević, A. (1993). Discovering sequence similarity by the algorithmic significance method. Proceedings of the First International Conference on Intelligent Systems for Molecular Biology. Bethesda, MD: AAAI Press.
Milosavljević, A. (to appear). Repeat analysis. In Imperial Cancer Research Fund Handbook of Genome Analysis. Blackwell Scientific Publications.
Milosavljević, A. & Jurka, J. (1993a) Discovering simple DNA sequences by the algorithmic significance method. Computer Applications in Biosciences, 9:407–411.
Milosavljević, A. & Jurka, J. (1993b). Discovery by minimal length encoding: A case study in molecular evolution. Machine Learning, 12:69–87.
Pevzner, P.A. (1992). Satistical distance between texts and filtration methods in sequence comparison. Computer Applications in Biosciences, 8:121–127.
Storer, J.A. (1988). Data Compression: Methods and Theory. Computer Science Press.
Wootton, J.C. & Federhen, S. (1993). Statistics of local complexity in amino acid sequences and sequence databases. Computers in Chemistry, 17:149–163.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Milosavljević, A. Discovering Dependencies via Algorithmic Mutual Information: A Case Study in DNA Sequence Comparisons. Machine Learning 21, 35–50 (1995). https://doi.org/10.1023/A:1022665630550
Issue Date:
DOI: https://doi.org/10.1023/A:1022665630550