Clustering is a concept used in a huge variety of applications. We review a conceptually very simple algorithm for hierarchical clustering called in the following the mutual information clustering (MIC) algorithm. It uses mutual information (MI) as a similarity measure and exploits its grouping property: The MI between three objects X,Y, and Z is equal to the sum of the MI between X and Y, plus the MI between Z and the combined object (XY). We use MIC both in the Shannon (probabilistic) version of information theory, where the “objects” are probability distributions represented by random samples, and in the Kolmogorov (algorithmic) version, where the “objects” are symbol sequences. We apply our method to the construction of phylogenetic trees from mitochondrial DNA sequences and we reconstruct the fetal ECG from the output of independent components analysis (ICA) applied to the ECG of a pregnant woman.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Gencompress. http://monod.uwaterloo.ca/downloads/gencompress/
Genebank. http://www.ncbi.nlm.nih.gov
XM-results. ftp://ftp.infotech.monash.edu.au/software/DNAcompress-XM/2007/results.html
Cao, Y., Janke, A., Waddell, P.J., Westerman, M., Takenaka, O., Murata, S., Okada, N., Paabo, S., Hasegawa, M.: Conflict among individual mitochondrial proteins in resolving the phy-logeny of eutherian orders. Journal of Molecular Evolution 47(3), 307–322 (1998)
Cilibrasi, R., Vitanyi, P.M.B.: Clustering by compression. IEEE Transaction on Information Theory 51, 1523–1545 (2005)
Cilibrasi, R., Vitanyi, P.M.B.: A new quartet tree heuristic for hierarchical clustering. arXiv:cs/0606048 (2006)
Correa, J.C.: A new estimator of entropy. Communications in Statistics-Theory and Methods 24(10), 2439–2449 (1995)
Cover, T., Thomas, J.: Elements of Information Theory. Wiley, New York (1991)
Darbellay, G.A., Vajda, I.: Estimation of the information by an adaptive partitioning of the observation space. IEEE Transactions on Information Theory 45(4), 1315–1321 (1999)
Daub, C.O., Steuer, R., Selbig, J., Kloska, S.: Estimating mutual information using B-spline functions an improved similarity measure for analysing gene expression data. BMC Bioinfor-matics 5, 118 (2004)
Dobrushin, R.: A simplified method for experimental estimate of the entropy of a stationary sequence. Theory of Probability and Its Applications 3, 462 (1958)
Dudewicz, E.J., Van der Meulen, E.C.: Entropy-based tests of uniformity. Journal of the American Statistical Association 76(376), 967–974 (1981)
Ebrahimi, N., Pflughoeft, K., Soofi, E.S.: Two measures of sample entropy. Statistics and Probability Letters 20(3), 225–234 (1994)
van Es, B.: Estimating functionals related to a density by a class of statistics based on spacings. Scandinavian Journal of Statistics 19(1), 61–72 (1992)
Grassberger, P.: Entropy estimates from insufficient samplings. arXiv:physics/0307138 (2003)
Grunwald, P., Vitanyi, P.: Shannon information and Kolmogorov complexity. arXiv:cs/0410002 (2004)
Hyvärinen, A., Karhunen, J., Oja, E.: Independent component analysis. Wiley, New York (2001)
Jain, A., Dubes, R.: Algorithms for clustering data. Prentice Hall, Englewood Cliffs (1988)
Kozachenko, L.F., Leonenko, L.: Sample estimate of the entropy of a random vector. Problems of Information Transmission 23, 95–101 (1987)
Kraskov, A., Stogbauer, H., Andrzejak, R.G., Grassberger, P.: Hierarchical clustering based on mutual information. arXive:q-bio/0311039 (2003)
Kraskov, A., Stogbauer, H., Grassberger, P.: Estimating mutual information. Physical Review E 69(6), 066,138 (2004)
Kraskov, A., Stogbauer, H., Andrzejak, R.G., Grassberger, P.: Hierarchical clustering using mutual information. Europhysics Letters, 70(2), 278–284 (2005)
Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P., Zhang, H.Y.: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17(2), 149–154 (2001)
Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.M.B.: The similarity metric. IEEE Transactions on Information Theory 50(12), 3250–3264 (2004)
Li, M., Vitanyi, P.: An introduction to Kolmogorov complexity and its applications. Springer, New York (1997)
Bray, N., Pachter, L.: MAVID: Constrained ancestral alignment of multiple sequences. Genome Research 14, 693–699 (2004)
Moon, Y.-I., Rajagopalan, B., Lall, U.: Estimation of mutual information using kernel density estimators Physical Review E 52, 2318–2321 (1995)
Penner, O., Grassberger, P., Paczuski, M.: to be published (2008)
Press, W.H., Teukolski, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes in C++, 3rd edn, chap. 16.4. Cambridge University Press, New York (2007)
Reyes, A., Gissi, C., Pesole, G., Catzeflis, F.M., Saccone, C.: Where do rodents fit? evidence from the complete mitochondrial genome of sciurus vulgaris. Molecular Biology and Evolution 17(6), 979–983 (2000)
Reyes, A., Gissi, C., Catzeflis, F., Nevo, E., Pesole, G.: Congruent Mammalian trees from mitochondrial and nuclear genes using Bayesian methods. Molecular Biology and Evolution 21(2), 397–403 (2004)
Stogbauer, H., Kraskov, A., Astakhov, S.A., Grassberger, P.: Least-dependent-component analysis based on mutual information. Physical Review E 70(6), 066,123 (2004)
Strimmer, K., von Haeseler, A.: Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies. Molecular Biology and Evolution 13, 964–969 (1996)
Takens, F.: Detecting strange attractors in turbulence. In: D. Rand, L. Young (eds.) Dynamical Systems and Turbulence, vol. 898, p. 366. Springer, Berlin (1980)
Tishby, N., Pereira, F., Bialek, W.: The information bottleneck method. In: 37-th Annual Aller-ton Conference on Communication, Control and Computing, p. 368 (1997)
Tsybakov, A.B., Van der Meulen, E.C.: Root-n consistent estimators of entropy for densities with unbounded support. Scandinavian Journal of Statistics 23(1), 75–83 (1996)
Vasicek, O.: Test for normality based on sample entropy. Journal of the Royal Statistical Society Series B-Methodological 38(1), 54–59 (1976)
Wieczorkowski, R., Grzegorzewski, P.: Entropy estimators – improvements and comparisons. Communications in Statistics-Simulation and Computation 28(2), 541–567 (1999)
Cao, M.D., Dix, T.I., Allison, L., Mears, C.: A simple statistical algorithm for biological sequence compression. IEEE Data Compression Conference (DCC), pp. 43–52 (2007)
Yu, Z.G., Mao, Z., Zhou, L.Q., Anh, V.V.: A mutual information based sequence distance for vertebrate phylogeny using complete mitochondrial genomes. Third International Conference on Natural Computation (ICNC 2007), pp. 253–257 (2007)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Kraskov, A., Grassberger, P. (2009). MIC: Mutual Information Based Hierarchical Clustering. In: Emmert-Streib, F., Dehmer, M. (eds) Information Theory and Statistical Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-84816-7_5
Download citation
DOI: https://doi.org/10.1007/978-0-387-84816-7_5
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-84815-0
Online ISBN: 978-0-387-84816-7
eBook Packages: Computer ScienceComputer Science (R0)