Skip to main content
Log in

Mutual information, phi-squared and model-based co-clustering for contingency tables

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

Many of the datasets encountered in statistics are two-dimensional in nature and can be represented by a matrix. Classical clustering procedures seek to construct separately an optimal partition of rows or, sometimes, of columns. In contrast, co-clustering methods cluster the rows and the columns simultaneously and organize the data into homogeneous blocks (after suitable permutations). Methods of this kind have practical importance in a wide variety of applications such as document clustering, where data are typically organized in two-way contingency tables. Our goal is to offer coherent frameworks for understanding some existing criteria and algorithms for co-clustering contingency tables, and to propose new ones. We look at two different frameworks for the problem of co-clustering. The first involves minimizing an objective function based on measures of association and in particular on phi-squared and mutual information. The second uses a model-based co-clustering approach, and we consider two models: the block model and the latent block model. We establish connections between different approaches, criteria and algorithms, and we highlight a number of implicit assumptions in some commonly used algorithms. Our contribution is illustrated by numerical experiments on simulated and real-case datasets that show the relevance of the presented methods in the document clustering field.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Ailem M, Role F, Nadif M (2016) Graph modularity maximization as an effective method for co-clustering text data. Knowl Based Syst 109:160–173

    Article  Google Scholar 

  • Arabie P, Hubert LJ (1990) The bond energy algorithm revisited. IEEE Trans Syst Man Cybern 20:268–274

    Article  Google Scholar 

  • Arabie P, Schleutermann S, Daws J, Hubert L (1988) Marketing applications of sequencing and partitioning of nonsymmetric and/or two-mode matrices. In: Data, expert knowledge and decisions. Springer, pp 215–224

  • Baier D, Gaul W, Schader M (1997) Two-mode overlapping clustering with applications to simultaneous benefit segmentation and market structuring. In: Classification and knowledge organization. Springer, pp 557–566

  • Benzecri JP (1973) L’analyse des données, tome 2: l’analyse des correspondances. Dunod, Paris

    MATH  Google Scholar 

  • Bock HH (1979) Simultaneous clustering of objects and variables. In: Tomassone R (ed) Analyse des Données et Informatique. INRIA, Le Chesnay, pp 187–203

    Google Scholar 

  • Bock HH (1992) A clustering technique for maximizing \(\varphi \)-divergence, noncentrality and discriminating power. In: Analyzing and modeling data and knowledge. Springer, pp 19–36

  • Bock HH (1994) Information and entropy in cluster analysis. In: Bozdogan H (ed) First US/Japan conference on the frontiers of statistical modeling: an informational approach. Kluwer Academic Publishers, Dordrecht, pp 115–147

    Chapter  Google Scholar 

  • Bock HH (2004) Convexity-based clustering criteria: theory, algorithms, and applications in statistics. Stat Methods Appl 12(3):293–317

    MathSciNet  MATH  Google Scholar 

  • Bryant PG (1988) On characterizing optimization-based clustering criteria. J Classif 5:81–84

    Article  Google Scholar 

  • Castillo W, Trejos J (2002) Two-mode partitioning: review of methods and application of tabu search. In: Bock HH (ed) Classification, clustering, and data analysis. Springer, Heidelberg, pp 43–51

    Chapter  Google Scholar 

  • Celeux G, Govaert G (1992) A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14(3):315–332

    Article  MathSciNet  Google Scholar 

  • Cheng Y, Church GM (2000) Biclustering of expression data. In: ISMB2000, 8th international conference on intelligent systems for molecular biology, vol 8, pp 93–103

  • Cho H, Dhillon I (2008) Coclustering of human cancer microarrays using minimum sum-squared residue coclustering. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 5(3):385–400

    Article  Google Scholar 

  • Cramer H (1946) Mathematical methods of statistics. Princeton University Press, Princeton

  • Deerwester S, Dumais S, Furnas G, Landauer T, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407

    Article  Google Scholar 

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B 39(1):1–38

    MathSciNet  MATH  Google Scholar 

  • Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. KDD ’01: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 269–274

    Chapter  Google Scholar 

  • Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42(1–2):143–175

    Article  Google Scholar 

  • Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-2003), pp 89–98

  • Ding C, He X, Simon H (2005) On the equivalence of nonnegative matrix factorization and spectral clustering. In: SIAM data mining conference

  • Ding C, Li T, Peng W, Park H (2006) Orthogonal nonnegative matrix tri-factorizations for clustering. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, p 135

  • Duffy DE, Quiroz AJ (1991) A permutation-based algorithm for block clustering. J Classif 8:65–91

    Article  MathSciNet  Google Scholar 

  • Govaert G (1977) Algorithme de classification d’un tableau de contingence. First international symposium on data analysis and informatics. INRIA, Versailles, pp 487–500

    Google Scholar 

  • Govaert G (1983) Classification croisée. Thèse d’état, Université Paris 6, France

  • Govaert G (1995) Simultaneous clustering of rows and columns. Control Cybern 24(4):437–458

    MATH  Google Scholar 

  • Govaert G, Nadif M (2003) Clustering with block mixture models. Pattern Recognit 36:463–473

    Article  Google Scholar 

  • Govaert G, Nadif M (2005) An EM algorithm for the block mixture model. IEEE Trans Pattern Anal Mach Intell 27(4):643–647

    Article  Google Scholar 

  • Govaert G, Nadif M (2007) Clustering of contingency table and mixture model. Eur J Oper Res 183(3):1055–1066

    Article  MathSciNet  Google Scholar 

  • Govaert G, Nadif M (2008) Block clustering with Bernoulli mixture models: comparison of different approaches. Comput Stat Data Anal 52(6):3233–3245

    Article  MathSciNet  Google Scholar 

  • Govaert G, Nadif M (2010) Latent block model for contingency table. Commun Stat Theory Methods 39(3):416–425

    Article  MathSciNet  Google Scholar 

  • Govaert G, Nadif M (2013) Co-clustering. Wiley, New York

    Book  Google Scholar 

  • Greenacre M (1988) Clustering the rows and columns of a contingency table. J Classif 5:39–51

    Article  MathSciNet  Google Scholar 

  • Gupta N, Aggarwal S (2010) Mib: using mutual information for biclustering gene expression data. Pattern Recognit 43(8):2692–2697

    Article  Google Scholar 

  • Hanczar B, Nadif M (2011) Using the bagging approach for biclustering of gene expression data. Neurocomputing 74(10):1595–1605

    Article  Google Scholar 

  • Hanczar B, Nadif M (2012) Ensemble methods for biclustering tasks. Pattern Recognit 45(11):3938–3949

    Article  Google Scholar 

  • Hanczar B, Nadif M (2013) Precision-recall space to correct external indices for biclustering. In: Proceedings of the 30th international conference on machine learning (ICML-13), pp 136–144

  • Harris RR, Kanji GK (1983) On the use of minimum chi-square estimation. The Statistician, pp 379–394

  • Hartigan JA (1972) Direct clustering of a data matrix. JASA 67(337):123–129

    Article  Google Scholar 

  • Hathaway RJ (1986) Another interpretation of the em algorithm for mixture distributions. Stat Probab Lett 4(2):53–56

    Article  MathSciNet  Google Scholar 

  • Hofmann T (1999) Probabilistic latent semantic indexing. SIGIR ’99: proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 50–57

    Chapter  Google Scholar 

  • Labiod L, Nadif M (2011a) Co-clustering for binary and categorical data with maximum modularity. In: 2011 IEEE 11th international conference on data mining, pp 1140–1145

  • Labiod L, Nadif M (2011b) Co-clustering under nonnegative matrix tri-factorization. In: Neural information processing—18th international conference. ICONIP, pp 709–717

  • Labiod L, Nadif M (2015) A unified framework for data visualization and coclustering. IEEE Trans Neural Netw Learn Syst 26(9):2194–2199

    Article  MathSciNet  Google Scholar 

  • Li L, Guo Y, Wu W, Shi Y, Cheng J, Tao S (2012) A comparison and evaluation of five biclustering algorithms by quantifying goodness of biclusters for gene expression data. BioData Min 5(1):1

    Article  Google Scholar 

  • Long B, Zhang Z, Yu P (2005) Co-clustering by block value decomposition. KDD ’05: proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining. ACM, New York, pp 635–640

    Chapter  Google Scholar 

  • Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 1(1):24–45

    Article  Google Scholar 

  • Marcotorchino F (1987) Block seriation problems: a unified approach. Appl Stoch Models Data Anal 3:73–91

    Article  Google Scholar 

  • Neal RM, Hinton GE (1998) A view of the em algorithm that justifies incremental, sparse, and other variants. In: Learning in graphical models. Springer, pp 355–368

  • Neyman J (1949) Contribution to the theory of Chi-square test. Proceedings of the Berkeley symposium on mathematical statistics and probability. University of California Press, Berkeley, pp 239–273

    Google Scholar 

  • Pearson K (1900) On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Lond Edinb Dublin Philos Mag J Sci 50(302):157–175

    Article  Google Scholar 

  • Pötzelberger K, Strasser H (1997) Data compression by unsupervised classification

  • Pötzelberger K, Strasser H (2001) Clustering and quantization by MSP-partitions. Stat Decis Int J Stoch Methods Models 19(4):331–372

    MathSciNet  MATH  Google Scholar 

  • Rocci R, Vichi M (2008) Two-mode multi-partitioning. Comput Stat Data Anal 52(4):1984–2003

    Article  MathSciNet  Google Scholar 

  • Santamaría R, Quintales L, Therón R (2007) Methods to bicluster validation and comparison in microarray data. In: Intelligent data engineering and automated learning-IDEAL 2007. Springer, pp 780–789

  • Strehl A, Ghosh J (2003) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617

    MathSciNet  MATH  Google Scholar 

  • Tanay A, Sharan R, Shamir R (2005) Biclustering algorithms: a survey. Handb Comput Mol Biol 9(1–20):122–124

    Google Scholar 

  • Trejos J, Castillo W (2000) Simulated annealing optimization for two-mode partitioning. In: Decker R, Gaul W (eds) Classification and information processing at the turn of the millennium. Springer, Heidelberg, pp 135–142

    Chapter  Google Scholar 

  • Van Mechelen I, Schepers J (2006) A unifying model for biclustering. In: Compstat 2006-proceedings in computational statistics. Springer, pp 81–88

  • Van Mechelen I, Bock HH, De Boeck P (2004) Two-mode clustering methods: a structured overview. Stat Methods Med Res 13(5):363–394

    Article  MathSciNet  Google Scholar 

  • Vichi M (2001) Double k-means clustering for simultaneous classification of objects and variables. Advances in classification and data analysis. Springer, Heidelberg, pp 43–52

    Google Scholar 

  • Windham MP (1987) Parameter modification for clustering criteria. J Classif 4:191–214

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

We thank the referees and editors for valuable suggestions. We acknowledge support funded by AAP Sorbonne Paris Cité.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohamed Nadif.

Appendices

Appendix A: Proof of Proposition 1

Denoting \(e_{ij}=p_{ij}-p_{i.} p_{.j}\) \(\forall i,j\) and using the relation \(\log (1+x)=x-x^2/2 + O(x^2)\), it can be written \(\log (1+\frac{e_{ij}}{p_{i.}p_{.j}})=\frac{e_{ij}}{p_{i.}p_{.j}} - \frac{1}{2} \left( \frac{e_{ij}}{p_{i.}p_{.j}}\right) ^2 + O(e_{ij}^2)\) for \(e_{ij} \rightarrow 0\) and

$$\begin{aligned} (p_{ij} +e_{ij})\log \left( 1+\frac{e_{ij}}{p_{i.}p_{.j}}\right)= & {} e_{ij} - \frac{1}{2} \frac{e_{ij}^2}{p_{i.}p_{.j}} + \frac{e_{ij}^2}{p_{i.}p_{.j}} -\frac{1}{2} \frac{e_{ij}^3}{p_{i.}^2 p_{.j}^2}\\&+\, O(e_{ij}^2) =e_{ij}+ \frac{1}{2} \frac{e_{ij}^2}{p_{i.}p_{.j}} + O(e_{ij}^2)\\ \sum _{i,j} (p_{ij} +e_{ij})\log \left( 1+\frac{e_{ij}}{p_{i.}p_{.j}}\right)= & {} \underbrace{\sum _{i,j} e_{ij}}_{=0} + \frac{1}{2} \sum _{i,j}\frac{e_{ij}^2}{p_{i.}p_{.j}} + O\left( \sum _{i,j}e_{ij}^2\right) \end{aligned}$$

this yields

$$\begin{aligned} \mathcal {I}(P_{IJ})=\frac{1}{2} \Phi ^2(P_{IJ}) + O\left( \sum _{i,j}(p_{ij}-p_{i.}p_{.j})^2\right) \quad \text{ for } \quad (p_{ij}-p_{i.}p_{.j}) \rightarrow 0 \end{aligned}$$

Appendix B: Properties of \(P_{KL}^{\mathbf {z}\mathbf {w}}\)

$$\begin{aligned} \sum _{k\ell } p_{k\ell }^{\mathbf {z}\mathbf {w}}=1 \quad (P_{KL}^{\mathbf {z}\mathbf {w}} \hbox {is a distribution} ) \sum _{\ell } p_{k\ell }^{\mathbf {z}\mathbf {w}}=\sum _i z_{ik} p_{i.} \quad \text {and} \quad \sum _{k} p_{k\ell }^{\mathbf {z}\mathbf {w}}=\sum _j w_{j\ell } p_{.j}. \end{aligned}$$

Proof

We have

$$\begin{aligned} \sum _{k,\ell } p_{k\ell }^{\mathbf {z}\mathbf {w}}= \sum _{i,j,k,\ell } z_{ik}w_{j\ell } p_{ij} =\sum _{ij}p_{ij} \sum _{k,\ell } z_{ik}w_{j\ell }=1\quad \text{ since } \sum _{k,\ell } z_{ik}w_{j\ell }=1 \end{aligned}$$
$$\begin{aligned} \sum _\ell p_{k\ell }^{\mathbf {z}\mathbf {w}}= & {} \sum _{i,j,\ell } z_{ik}w_{j\ell } \; p_{ij} = \sum _i\left( z_{ik} \sum _j \left( p_{jj} \sum _\ell w_{j\ell }\right) \right) \\ {}= & {} \sum _i z_{ik} p_{i.}\quad \text{ since } \sum _\ell w_{j\ell }=1 \end{aligned}$$
$$\begin{aligned} \sum _k p_{k\ell }^{\mathbf {z}\mathbf {w}}= & {} \sum _{i,j,k} z_{ik}w_{j\ell } \; p_{ij} = \sum _j\left( w_{j\ell } \sum _i \left( p_{jj} \sum _k z_{ik}\right) \right) \\ {}= & {} \sum _j w_{j\ell } p_{i.j}\quad \text{ since } \sum _k z_{ik}=1 \end{aligned}$$

\(\square \)

Appendix C: Properties of \(Q_{IJ}^{\mathbf {z}\mathbf {w}}\)

$$\begin{aligned} \sum _{i,j} q_{ij}^{\mathbf {z}\mathbf {w}}=1, \quad q_{i.}^{\mathbf {z}}=p_{i.} \quad \text {and} \quad q_{.j}^{\mathbf {w}}=p_{.j} \qquad \forall i,j, \end{aligned}$$

Proof

$$\begin{aligned} \sum _{i,j} q_{ij}^{\mathbf {z}\mathbf {w}}= \sum _{i,j} p_{i.}p_{.j} \sum _{k,\ell } z_{ik} w_{j\ell }\frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}} =\sum _{k,\ell } \frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}} \underbrace{\sum _{i,j} p_{i.}p_{.j}z_{ik} w_{j\ell }}_{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}=\sum _{k,\ell } p_{k\ell }^{\mathbf {z}\mathbf {w}}=1 \end{aligned}$$
$$\begin{aligned} q_{i.}^{\mathbf {z}}=\sum _j q_{ij}^{\mathbf {z}\mathbf {w}}= \sum _j p_{i.}p_{.j} \sum _{k,\ell } z_{ik} w_{j\ell }\frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}= p_{i.} \sum _k \frac{z_{ik}}{p_{k.}} \underbrace{\sum _{\ell } \frac{\overbrace{\sum _j (w_{j\ell } p_{.j} )}^{p_{.\ell }^{\mathbf {w}}} p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{.\ell }^{\mathbf {w}}}}_{=p_{k.}^{\mathbf {z}}}=p_{i.}. \end{aligned}$$

and, symmetrically, \(q_{.j}^{\mathbf {w}}=p_{.j}.\) \(\square \)

Appendix D: Proof of Proposition 2

Lemma 1

$$\begin{aligned} \sum _{k,\ell } \frac{(p_{k\ell }^{\mathbf {z}\mathbf {w}})^2}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}= \sum _{i,j} \frac{p_{ij} q_{ij}^{\mathbf {z}\mathbf {w}}}{p_{i.}p_{.j}}=\sum _{i,j} \frac{(q_{ij}^{\mathbf {z}\mathbf {w}})^2}{p_{i.}p_{.j}} \end{aligned}$$

Proof

$$\begin{aligned} \sum _{i,j} \frac{p_{ij} q_{ij}^{\mathbf {z}\mathbf {w}}}{p_{i.}p_{.j}}= & {} \sum _{i,j} p_{ij} \frac{p_{i.}p_{.j} \sum _{k,\ell }z_{ik} w_{j\ell } \frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}}{p_{i.}p_{.j}}= \sum _{k,\ell } \frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}} \sum _{i,j} z_{ik}w_{j\ell } p_{ij}\\ {}= & {} \sum _{k,\ell } \frac{(p_{k\ell }^{\mathbf {z}\mathbf {w}})^2}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}} \end{aligned}$$
$$\begin{aligned} \sum _{i,j} \frac{(q_{ij}^{\mathbf {z}\mathbf {w}})^2}{p_{i.}p_{.j}}= & {} \sum _{i,j} \frac{ \left( p_{i.}p_{.j} \sum _{k,\ell } z_{ik}w_{j\ell } \frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}\right) ^2}{p_{i.}p_{.j}} =\sum _{i,j} p_{i.}p_{.j}\sum _{k,\ell } z_{ik}w_{j\ell } \frac{(p_{k\ell }^{\mathbf {z}\mathbf {w}})^2}{(p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}})^2}\\= & {} \sum _{k,\ell } \frac{(p_{k\ell }^{\mathbf {z}\mathbf {w}})^2}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}. \end{aligned}$$

\(\square \)

Proof of Proposition 2

  • The first equation (a) can easily be deduced from Lemma 1

  • Second equation (b):

    $$\begin{aligned} \Phi ^2(P_{IJ})-\Phi ^2(Q_{IJ}^{\mathbf {z}\mathbf {w}})= & {} \left( \sum _{i,j} \frac{p_{ij}^2}{p_{i.}p_{.j}} -1\right) - \left( \sum _{i,j} \frac{(q_{ij}^{\mathbf {z}\mathbf {w}})^2}{q_{i.}^{\mathbf {z}}q_{.j}^{\mathbf {w}}} -1\right) \\= & {} \sum _{i,j} \frac{p_{ij}^2}{p_{i.}p_{.j}} - \sum _{i,j} \frac{(q_{ij}^{\mathbf {z}\mathbf {w}})^2}{p_{i.}p_{.j}}. \end{aligned}$$

    Using Lemma 1, this relation can be written

    $$\begin{aligned} \sum _{i,j} \frac{p_{ij}^2}{p_{i.}p_{.j}} - \sum _{i,j} \frac{p_{ij}q_{ij}^{\mathbf {z}\mathbf {w}}}{p_{i.}p_{.j}} = \sum _{i,j} p_{ij} \left( \frac{p_{ij}}{p_{i.}p_{.j}} - \frac{q_{ij}^{\mathbf {z}\mathbf {w}}}{p_{i.}p_{.j}}\right) \end{aligned}$$

    and

    $$\begin{aligned}&\sum _{i,j} \frac{p_{ij}^2}{p_{i.}p_{.j}} - 2 \sum _{i,j} \frac{p_{ij}q_{ij}^{\mathbf {z}\mathbf {w}}}{p_{i.}p_{.j}} + \sum _{i,j} \frac{(q_{ij}^{\mathbf {z}\mathbf {w}})^2}{p_{i.}p_{.j}}=\sum _{i,j} \frac{(p_{ij}-q_{ij}^{\mathbf {z}\mathbf {w}})^2}{p_{i.}p_{.j}}\\&\quad =\chi ^2(P_{IJ}||Q_{IJ}^{\mathbf {z}\mathbf {w}})\ge 0. \end{aligned}$$
  • The two inequalities (c) can easily be deduced from the previous relation.

Appendix E: Proof of Proposition 3

  • First equation (d):

    $$\begin{aligned} \mathcal {I}(Q_{IJ}^{\mathbf {z}\mathbf {w}})= & {} \sum _{i,j} q_{ij}^{\mathbf {z}\mathbf {w}} \log \frac{q_{ij}^{\mathbf {z}\mathbf {w}}}{p_{i.}p_{.j}} =\sum _{i,j}\left( p_{i.}p_{.j}\left( \sum _{k,\ell }z_{ik} w_{j\ell }\frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}\right) \right. \\&\left. \times \log \left( \sum _{k,\ell } z_{ik}w_{j\ell }\frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}\right) \right) \\= & {} \sum _{i,j}\left( p_{i.}p_{.j}\left( \sum _{k,\ell }z_{ik} w_{j\ell }\frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {w}}p_{.\ell }^{\mathbf {w}}} \log \frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}\right) \right) \\= & {} \sum _{k,\ell } p_{k\ell }^{\mathbf {z}\mathbf {w}} \log \frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}\\= & {} \mathcal {I}(P_{KL}^{\mathbf {z}\mathbf {w}}). \end{aligned}$$
  • Second equation (e):

    $$\begin{aligned} \mathcal {I}(P_{IJ})-\mathcal {I}(Q_{IJ}^{\mathbf {z}\mathbf {w}})= & {} \mathcal {I}(P_{IJ})-\mathcal {I}(P_{KL}^{\mathbf {z}\mathbf {w}}) =\sum _{i,j} p_{ij} \log \frac{p_{ij}}{p_{i.}p_{.j}} -\sum _{k,\ell } p_{k\ell }^{\mathbf {z}\mathbf {w}} \log \frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}\\= & {} \sum _{i,j,k,\ell } z_{ik}w_{j\ell } p_{ij} \log \frac{p_{ij}}{p_{i.}p_{.j}} -\sum _{i,j,k,\ell }(z_{ik}w_{j\ell } p_{ij}) \log \frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}\\= & {} \sum _{i,j,k,\ell } z_{ik}w_{j\ell } p_{ij} \log \frac{p_{ij}}{p_{i.}p_{.j}} \frac{p_{k.}^{\mathbf {z}} p_{.\ell }^{\mathbf {w}}}{p_{k\ell }^{\mathbf {z}\mathbf {w}}}\\= & {} \sum _{i,j,k,\ell } z_{ik}w_{j\ell } p_{ij} \log \frac{p_{ij}}{p_{i.}p_{.j} \frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}} p_{.\ell }^{\mathbf {w}}}}\\= & {} \sum _{i,j} p_{ij} \log \frac{p_{ij}}{p_{i.}p_{.j}\sum _{k,\ell } z_{ik} w_{j\ell } \frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}} p_{.\ell }^{\mathbf {w}}}}\\= & {} \sum _{i,j} p_{ij} \log \frac{p_{ij}}{q_{ij}^{\mathbf {z}\mathbf {w}}} =\text {KL}(P_{IJ}||Q_{IJ}^{\mathbf {z}\mathbf {w}}) \ge 0. \end{aligned}$$
  • The two inequalities (f) can easily be deduced from the previous relation.

Appendix: F Properties of \(R_{IJ}^{\mathbf {z}\mathbf {w}\varvec{\delta }}\)

$$\begin{aligned} \sum _{i,j} r_{ij}^{\mathbf {z}\mathbf {w}\varvec{\delta }}=1 \quad r_{i.}^{\mathbf {z}\mathbf {w}\varvec{\delta }}=p_{i.} \quad \text {and} \quad r_{.j}^{\mathbf {z}\mathbf {w}\varvec{\delta }}=p_{.j} \qquad \forall i,j \end{aligned}$$

Proof

$$\begin{aligned} \sum _{ij}r_{ij}^{\mathbf {z}\mathbf {w}\varvec{\delta }}= & {} \sum _{i,j}p_{i.}p_{.j}\sum _{k,\ell }z_{ik}w_{j\ell }\delta _{k\ell }= \sum _{k,\ell } \delta _{k\ell }\sum _{i,j}p_{i.}p_{.j}z_{ik}w_{j\ell }= \sum _{k,\ell }p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}} \delta _{k\ell }=1\\ r_{i.}^{\mathbf {z}\mathbf {w}\varvec{\delta }}= & {} \sum _j r_{ij}^{\mathbf {z}\mathbf {w}\varvec{\delta }}= \sum _j p_{i.}p_{.j} \sum _{k,\ell } z_{ik} w_{j\ell }\delta _{k\ell }=p_{i.} \sum _j p_{.j}=p_{i.} \end{aligned}$$

and, symmetrically, \(r_{.j}^{\mathbf {z}\mathbf {w}\varvec{\delta }}=p_{.j}.\) \(\square \)

Appendix G: Proof of Eq. (7)

$$\begin{aligned} \widetilde{W}_{\Phi ^2}(\mathbf {z},\mathbf {w},\varvec{\delta })&=D_{\Phi ^2}(P_{IJ}||R_{KL}^{\mathbf {z}\mathbf {w}\varvec{\delta }}) =\sum _{i,j} \frac{(p_{ij}-r_{ij}^{\mathbf {z}\mathbf {w}\varvec{\delta }})^2}{p_{i.}p_{.j}} \\&=\sum _{i,j} \frac{(p_{ij}-p_{i.}p_{.j}\sum _{k,\ell }z_{ik}w_{j\ell }\delta _{k\ell })^2}{p_{i.}p_{.j}} \\&=\sum _{i,j} p_{i.}p_{.j} \left( \frac{p_{ij}}{p_{i.}p_{.j}}-\sum _{k,\ell }z_{ik}w_{j\ell }\delta _{k\ell }\right) ^2\\&=\sum _{i,j}p_{i.}p_{.j}\left( \frac{(\sum _{k,\ell }z_{ik}w_{j\ell })p_{ij}}{p_{i.}p_{.j}} -\sum _{k,\ell }z_{ik}w_{j\ell }\delta _{k\ell }\right) ^2 \end{aligned}$$

and since \(z_{ik}w_{j\ell } \in \{0,1\}\), we obtain \(\widetilde{W}_{\Phi ^2}(\mathbf {z},\mathbf {w},\varvec{\delta }) =\sum _{i,j,k,\ell } z_{ik} w_{j\ell } \,p_{i.}p_{.j} \left( \frac{p_{ij}}{p_{i.}p_{.j}}-\delta _{k\ell }\right) ^2\)

Appendix H: Proof of Eq. (8)

$$\begin{aligned} \widetilde{W}_{\mathcal {I}}(\mathbf {z},\mathbf {w},\varvec{\delta })&=KL(P_{IJ}||R_{IJ}^{\mathbf {z}\mathbf {w}\varvec{\delta }}) =\sum _{i,j} p_{ij} \log \frac{p_{ij}}{r_{ij}^{\mathbf {z}\mathbf {w}\varvec{\delta }}}\\&=\sum _{i,j} p_{ij} \log \frac{p_{ij}}{p_{i.}p_{.j} \sum _{k,\ell } z_{ik} w_{j\ell }\delta _{k\ell }}\\&=\sum _{i,j} p_{ij} \log \frac{p_{ij}}{p_{i.}p_{.j}} - \sum _{i,j} p_{ij} \log \sum _{k,\ell } z_{ik} w_{j\ell }\delta _{k\ell }\\&=\sum _{i,j} p_{ij} \log \frac{p_{ij}}{p_{i.}p_{.j}} -\sum _{i,k} z_{ik} \sum _\ell p_{i\ell }^{\mathbf {w}} \log \delta _{k\ell }\\&=\sum _{i,j} p_{ij} \log \frac{p_{ij}}{p_{i.}p_{.j}} -\sum _{k,\ell } p_{k\ell }^{\mathbf {z}\mathbf {w}} \log \delta _{k\ell }. \end{aligned}$$

Appendix I: Proof of Proposition 4

Using Eq. (7), the problem can be formulated \({{\mathrm{argmin}}}_{\delta _{k\ell }} F(\delta _{k\ell }) \quad \forall k,\ell \) where

$$\begin{aligned} F(\delta _{k\ell })= & {} \sum _{i,j}z_{ik}w_{j\ell }p_{i.}p_{.j}\left( \frac{p_{ij}}{p_{i.}p_{.j}}-\delta _{k\ell }\right) ^2 =p_{k.}^{\mathbf {z}} p_{.\ell }^{\mathbf {w}} \delta _{k\ell }^2 -2 p_{k\ell }^{\mathbf {z}\mathbf {w}} \delta _{k\ell }\\&+\sum _{i,j}z_{ik}w_{j\ell } \frac{p_{ij}^2}{p_{i.}p_{.j}} \end{aligned}$$

is a quadratic function of \(\delta _{k\ell }\) which has its maximum value for \(\delta _{k\ell }=-\frac{- 2 p_{k\ell }^{\mathbf {z}\mathbf {w}}}{2 (p_{k.}^{\mathbf {z}} p_{.\ell }^{\mathbf {w}})} =\frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}.\)

Appendix J: Proof of Proposition 5

The problem can be formulated as \({{\mathrm{argmax}}}_{\delta _{k\ell }} \sum _{k,\ell } p_{k\ell }^{\mathbf {z}\mathbf {w}} \log \delta _{k\ell } \ \text {with} \ \sum _{k,\ell }p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}} \delta _{k\ell }=1.\) We can solve this problem by using the method of Lagrange multiplier, we introduce a new variable \(\lambda \) and study the Lagrange function defined by \(F(k,\ell )=\sum _{k,\ell } p_{k\ell }^{\mathbf {z}\mathbf {w}}\log \delta _{k\ell } -\lambda (\sum _{k,\ell }p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}\delta _{k\ell }-1).\) Then we have \(\forall k,\ell \) \( \frac{\partial F(k,\ell )}{\partial \delta _{k\ell }}=0 \Rightarrow \frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{\delta _{k\ell }} - \lambda p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}=0 \Rightarrow \delta _{k\ell }=\frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{\lambda p_{k.}^{\mathbf {z}} p_{.\ell }^{\mathbf {w}}}. \) The constraint \(\sum _{k,\ell } p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}} \delta _{k\ell }=1\) yields \(\lambda =\sum _{k,\ell }p_{k\ell }^{\mathbf {z}\mathbf {w}}=1\) and therefore we obtain \(\delta _{k\ell }=\frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}.\)

Appendix K: VEM algorithm

Knowing that \(y_{k\ell }^{\widetilde{\mathbf {z}}\widetilde{\mathbf {w}}}=\sum _{i,j} \widetilde{z}_{ik} \widetilde{w}_{j\ell } \, x_{ij}\), \(y_{k.}^{\widetilde{\mathbf {z}}}=\sum _{\ell } y_{k\ell }^{\widetilde{\mathbf {z}}\widetilde{\mathbf {w}}}\) and \(y_{.\ell }^{\widetilde{\mathbf {w}}}=\sum _k y_{k\ell }^{\widetilde{\mathbf {z}}\widetilde{\mathbf {w}}}\), VEM alternates the following steps

  • Update of \(\varvec{\theta }\): \(\pi _k=\frac{\widetilde{z}_{.k}}{n}\), \(\rho _\ell =\frac{\widetilde{w}_{.\ell }}{d}\) and \(\gamma _{k\ell }=\frac{y_{k\ell }^{\widetilde{\mathbf {z}}\widetilde{\mathbf {w}}}}{y_{k.}^{\widetilde{\mathbf {z}}}y_{.\ell }^{\widetilde{\mathbf {w}}}}\)

  • Update of \(\widetilde{\mathbf {z}}\): \(\tilde{z}_{ik} \propto \pi _k \exp (\sum _\ell y_{.\ell }^{\widetilde{\mathbf {w}}} \log \gamma _{k\ell })\)

  • Update of \(\widetilde{\mathbf {w}}\): \(\tilde{w}_{j\ell } \propto \rho _\ell \exp (\sum _k y_{k.}^{\widetilde{\mathbf {z}}} \log \gamma _{k\ell })\)

Proof

  • Update of \(\varvec{\theta }\):

    • Equations (16) and (17) lead to \({{\mathrm{argmax}}}_{\varvec{\pi }} \sum _k \widetilde{z}_k \log \pi _k\) and then to \(\pi _k=\frac{\widetilde{z}_{.k}}{n}\) \(\forall k.\)

    • Similarly, we have \(\rho _\ell =\frac{\widetilde{w}_{.\ell }}{n}\qquad \forall \ell .\)

    • For \(\gamma _{k\ell }\), we have \(\forall k,\ell \),

    $$\begin{aligned} \hat{\gamma }_{k\ell }= & {} {{\mathrm{argmax}}}_{\gamma _{k\ell }} \sum _{i,j} \widetilde{z}_{ik} \widetilde{w}_{j\ell }(x_{ij} \log \gamma _{k\ell }-x_{i.} x_{.j} \gamma _{k\ell })\\= & {} {{\mathrm{argmax}}}_{\gamma _{k\ell }} \frac{y_{k\ell }^{\widetilde{\mathbf {z}}\widetilde{\mathbf {w}}}}{\gamma _{k\ell }}-y_{k.}^{\widetilde{\mathbf {z}}}y_{.\ell }^{\widetilde{\mathbf {w}}} =\frac{y_{k\ell }^{\widetilde{\mathbf {z}}\widetilde{\mathbf {w}}}}{y_{k.}^{\widetilde{\mathbf {z}}}y_{.\ell }^{\widetilde{\mathbf {w}}}}. \end{aligned}$$
  • Update of \(\widetilde{\mathbf {z}}\): Eqs. (16) and (17) lead, for all i, to

    $$\begin{aligned} {{\mathrm{argmax}}}_{\widetilde{z}_{ik}} \sum _k \left( \widetilde{z}_{ik} \log \pi _k + \widetilde{z}_{ik}\sum _{j,\ell } \widetilde{w}_{j\ell } (x_{ij} \log \gamma _{k\ell } -x_{i.}x_{.j} \gamma _{k\ell }) - \widetilde{z}_{ik} \log \widetilde{z}_{ik}\right) \end{aligned}$$

    \(\forall i,k\) under the constraint \(\sum _k \widetilde{z}_{ik}=1\). This takes the following form

    $$\begin{aligned} {{\mathrm{argmax}}}_{\widetilde{z}_{ik}}\sum _k\left( \widetilde{z}_{ik} \log \pi _k + \widetilde{z}_{ik} \sum _\ell (y_{.\ell }^{\mathbf {w}} \log \gamma _{k\ell } - x_{i.} y_{.\ell }^{\mathbf {w}} \gamma _{k\ell }) - \widetilde{z}_{ik} \log \widetilde{z}_{ik}\right) \end{aligned}$$

    and since \(\gamma _{k\ell }=\frac{y_{k\ell }^{\widetilde{\mathbf {z}}'\widetilde{\mathbf {w}}}}{y_{k.}^{\widetilde{\mathbf {z}}'}y_{.\ell }^{\widetilde{\mathbf {w}}}}\) where \(\widetilde{\mathbf {z}}'\) means the vector of membership values computed above, we obtain

    $$\begin{aligned} \sum _\ell x_{i.} y_{.\ell }^{\widetilde{\mathbf {w}}} \gamma _{k\ell }= \sum _\ell x_{i.} y_{.\ell }^{\widetilde{\mathbf {w}}} \frac{y_{k\ell }^{\widetilde{\mathbf {z}}'\widetilde{\mathbf {w}}}}{y_{k.}^{\widetilde{\mathbf {z}}'}y_{.\ell }^{\widetilde{\mathbf {w}}}}=x_{i.} \frac{\sum _\ell y_{k\ell }^{\widetilde{\mathbf {z}}'\widetilde{\mathbf {w}}}}{y_{k.}^{\widetilde{\mathbf {z}}'}}=x_{i.} \end{aligned}$$

    which does not depend on k and then

    $$\begin{aligned} {{\mathrm{argmax}}}_{\widetilde{z}_{ik}}\sum _k\left( \widetilde{z}_{ik} \log \pi _k + \widetilde{z}_{ik} \sum _\ell y_{.\ell }^{\widetilde{\mathbf {w}}} \log \gamma _{k\ell } - \widetilde{z}_{ik} \log \widetilde{z}_{ik}\right) \end{aligned}$$

    under the constraint \(\sum _k \widetilde{z}_{ik}=1\). Using the lagrange multiplier, we obtain

    $$\begin{aligned} \tilde{z}_{ik} =\frac{\pi _k \exp (\sum _\ell y_{.\ell }^{\widetilde{\mathbf {w}}} \log \gamma _{k\ell })}{\sum _{k'} \pi _{k'} exp (\sum _\ell y_{.\ell }^{\widetilde{\mathbf {w}}} \log \gamma _{k'\ell })}\propto \pi _k \exp \left( \sum _\ell y_{.\ell }^{\widetilde{\mathbf {w}}} \log \gamma _{k\ell }\right) \end{aligned}$$
  • The update of \(\widetilde{\mathbf {w}}\) is proven in a similar way.

\(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Govaert, G., Nadif, M. Mutual information, phi-squared and model-based co-clustering for contingency tables. Adv Data Anal Classif 12, 455–488 (2018). https://doi.org/10.1007/s11634-016-0274-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-016-0274-6

Keywords

Mathematics Subject Classification

Navigation