1 Introduction

Co-clustering, referred to by a variety of different names, aims at simultaneously partitioning both dimensions of a data matrix (Madeira & Oliveira, 2004; Van Mechelen et al., 2004; Govaert & Nadif, 2013). It has demonstrated better performances than one-sided clustering for high-dimensional data (Banerjee et al., 2004; Deodhar & Ghosh, 2010; Yu et al., 2019; Bock, 2020; Affeldt et al., 2021). There are different approaches dedicated to co-clustering. Among these, the Latent Block Model (LBM) (Govaert & Nadif, 2005, 2018) is a probabilistic model devoted to co-clustering that allows to model data of different types using the appropriate mixture distribution (Govaert & Nadif, 2013; Ailem et al., 2017), to derive efficient co-clustering algorithms based on variants of the EM algorithm (Dempster et al., 1977) and to do model selection in order to determine the appropriate number of row and column clusters.

Semi-supervised (or constrained) clustering (Pensa & Boulicaut, 2008; Basu et al., 2008) has allowed clustering algorithms to better recover the clusters of a dataset with partial supervision on the set of data points. Co-clustering algorithms can benefit from side information in both row and column space (Song et al., 2010; Salah & Nadif, 2017; Salah et al., 2018; Affeldt et al., 2021). However, the existing semi-supervised approaches have not been presented to the probabilistic setting of the LBM.

In this paper, we propose a general model, namely HLBM, to incorporate side information in the form of pairwise constraints between the rows and the columns of a data matrix in the LBM. This semi-supervision is formulated in a probabilistic setup using Hidden Markov Random Fields (HMRF). We instantiate this model for count data with a Poisson distribution and propose two algorithms based on Classification EM and Variational EM. We analyze the behavior of these algorithms when varying the trade-off between the semi-supervision and the data likelihood on data simulated with the model. We apply our algorithm on real-world attributed networks and compare its clustering performances to existing algorithms.

2 Related work

In the domain of unsupervised image segmentation, Ambroise and Govaert (1998) propose to introduce similarity constraints between the data points in EM by optimizing a penalized variational criterion for one-sided clustering and investigate the convergence properties of their algorithm. Celeux et al. (2003) present three algorithms for clustering with mixture models and HMRF: mean, mode and simulated field EM and compare their algorithms to the iterated conditional modes algorithm (ICM) (Besag, 1986) that maximizes the pseudo-likelihood using a Maximum A Posteriori classification rule. The authors report the good performances of mean and simulated field EM algorithms compared with ICM.

Wagstaff et al. (2001) propose to add Must Link (ML) and Cannot Link (CL) constraints to the k-means algorithm. These constraints can not be violated and are not presented in a probabilistic context. Basu et al. (2004) presents a probabilistic framework that uses a HMRF to include ML and CL relationships and propose an algorithm based on ICM. This algorithm is compared with Belief Propagation and linear programming relaxation of the objective function in (Bilenko & Basu, 2004), where the authors present empirical evidence that ICM gives similar results as these more complex algorithms when the number of constraints is great enough. Lange et al. (2005) experimentally show that their algorithm based on deterministic annealing with mean-field variational inference generally gives better clustering performances than the ICM-based algorithm of Basu et al. (2004) even with a great number of constraints. Tang et al. (2009) considers ML and CL relationships in the context of graph clustering with matrix factorization by adding penalty terms computed with the Euclidean distance between the learned factors.

Pensa and Boulicaut (2008) present a constrained co-clustering algorithm based on a metric approach that includes ML, CL as well as interval constraints. These introduced constraints cannot be violated like in Wagstaff et al. (2001). The interval constraint is defined based on an ordering on the set of rows (or columns) and define an interval constraint on the set of row (or column) clusters such that the obtained one-sided clusters are intervals w.r.t. the given ordering. Kilic et al. (2016) propose a semi-supervised co-clustering algorithm in a "fuzzy" context in which the supervision is expressed by using fixed labels during inference for data points whose cluster is known. This approach is also used in Nam et al. (2020) in the context of the LBM. Yan et al. (2013) introduce "fuzzy" ML and CL relationships in their metric-based approach for co-clustering. The additive penalty terms introduced by the constraints in their objective functions (without justification) can be seen as a special case of ours where all relationships have the same weight.

Another approach to incorporating constraints in an unsupervised setting is based on manifold learning with Laplacian regularization. It consists in building a k-nearest neighbors graph that describes the intrinsic geometry of the data. It has been proposed in Zhu and Lafferty (2005) and He et al. (2011) on Gaussian mixture models, but the E or M steps can not be expressed in closed form, requiring gradient methods or heuristics. This has been applied in Salah and Nadif (2017) on von-Mises Fisher mixture models, in the context of item recommendation, to incorporate constraints from a social network that connects the users.

As we focus on the co-clustering task, the closest work to ours is the CITTC model of Song et al. (2010), where the authors propose a constrained version of the information theoretic co-clustering (ITCC) model of Dhillon et al. (2003) by using two HMRFs. However, Govaert and Nadif (2018) proved that ITCC with the Kullback-Leibler divergence as the chosen Bregman divergence is equivalent to the Poisson LBM with equal mixture proportions. This hypothesis of equal mixture proportions makes it difficult for the algorithm to recover unbalanced clusters. Moreover, on an algorithmic point of view, ITCC does not benefit from reduced intermediate matrices (see Sect. 4.1). As presented in Appendix 1, CITTC can be seen as a particular case of our model.

3 The proposed model

The data is represented by a matrix \({{\varvec{X}}= (x_{ij})}\) of size \(n \times d\), where the \(x_{ij}\) are assumed to be sampled from a given parametric distribution of density \(\phi\). The value of each entry of the data matrix depends on the latent row and columns partitions and on the model parameters.

3.1 Definition of the model

3.1.1 Sampling the latent variables in the HMRF

The partition of the set of rows in g clusters is represented by the latent classification matrix \({\varvec{Z}}= (z_{ik})\), with \(\sum _{k=1}^g z_{ik} = 1\), where \(z_{ik} = 1\) if row i belongs to row cluster k and \(z_{ik} = 0\) otherwise. Alternatively, we write \(z_i \in \{1, \dots , g\}\) to be the cluster index of i. Similarly, the partition of the set of columns in m clusters is represented by the classification matrix \({\varvec{W}}= (w_{j\ell })\), where \(w_{j\ell } = 1\) or \(w_j = \ell\) if column j belongs to column cluster \(\ell\). Denoting by \({\mathcal {Z}}\) and \({\mathcal {W}}\) the set of possible partitions of the rows and columns of \({\varvec{X}}\) into respectively g and m clusters, the latent space of the model is \({\mathcal {Z}} \times {\mathcal {W}}\). Let \({\varvec{\varTheta }}\) be the vector of parameters of the model. The classification matrices of the rows and the columns are a priori independent:

$$\begin{aligned} p({\varvec{Z}}, {\varvec{W}}; {\varvec{\varTheta }}) = p({\varvec{Z}}; {\varvec{\varTheta }})p({\varvec{W}}; {\varvec{\varTheta }}). \end{aligned}$$

In the following, the semi-supervision over the set of rows (resp. columns) is expressed using pairwise and symmetric relationships between the latent classification vectors \({\varvec{z}}_i\) (resp. \({\varvec{w}}_j\)). Let \(\varUpsilon _r\) (resp. \(\varUpsilon _c\)) be the set of rows (resp. columns) that are in a semi-supervision relationship. We write \({\mathcal {N}}_r(i) \subset \varUpsilon _r\) to be the set of neighbors of row i, where \(i \not \in {\mathcal {N}}_r(i)\) and define a Markov Random Field (MRF) on \(\varUpsilon _r\), where for \(i \in \varUpsilon _r\), the random variable \({\varvec{z}}_i\) is dependent on a set of neighboring random variables \(\{{\varvec{z}}_{i'} \, |\, i'\in {\mathcal {N}}_r(i)\}\). With MRF, the graph of conditional independence is undirected, that is \(i' \in {\mathcal {N}}_r(i) \iff i \in {\mathcal {N}}_r(i')\). The Hammersley-Clifford theorem implies that the joint distribution of the MRF can be represented as a product of factors, one per maximal clique of the graph. In the following, we restrict the model to be a pairwise MRF, that is, the parameterization of the joint distribution is restricted to the edges of the graph, rather than the maximal cliques. For \(i \in \varUpsilon _r\) and \(i' \in {\mathcal {N}}_r(i)\), we define the edge potential functions \(\psi _{ii'}^r\) depending on the latent classification matrix \({\varvec{Z}}\) and the potential parameter \({\varvec{\varXi }}^r\). For the other nodes, \(i \notin \varUpsilon _r\), we let the latent variables be independent random variables following a categorical distribution of parameter \({\varvec{\alpha }}\), where \(\sum _k \alpha _k = 1\), as in a classical LBM (Govaert & Nadif, 2008). Thus, the joint distribution over \({\mathcal {Z}}\) is given by:

$$\begin{aligned} p({\varvec{Z}}; {\varvec{\varTheta }}) = \varGamma _r({\varvec{\varXi }}^r)^{-1} \exp \Big (\sum _{i \notin \varUpsilon _r} \log \alpha _{z_{i}} + \frac{1}{2}\sum _{i\in \varUpsilon _r}\sum _{i' \in {\mathcal {N}}_r(i)} \log \psi _{ii'}^r({\varvec{z}}_{i}, {\varvec{z}}_{i'} ;{\varvec{\varXi }}^r)\Big ), \end{aligned}$$
(1)

where \(\varGamma _r\) is the partition function of the HMRF on the rows. We can show that \(\varGamma _r\) only depends on \({\varvec{\varXi }}^r\) (see Appendix 1). We define a similar HMRF on \(\varUpsilon _c\) for the columns, with potentials \(\psi _{jj'}^c\), mixture proportions \({\varvec{\beta }}\) and partition function \(\varGamma _c({\varvec{\varXi }}^c)\).

In the following, we consider two types of relationships: ML and CL relationshipsFootnote 1 and define the edge potential in the MRF so that nodes in a ML relationship are more likely to be in the same cluster and nodes in a CL relationship are more likely to be in different clusters. In order to define the potential functions, we consider a given symmetric weight matrix \({\varvec{\varXi }}^r = (\xi _{ii'}^r)\), where \(\xi _{ii'}^r \ge 0\) corresponds to the weight of the ML or CL relationship between row i and row \(i'\):

$$\begin{aligned} \log \psi _{ii'}^r({\varvec{z}}_{i}, {\varvec{z}}_{i'} ; {\varvec{\varXi }}^r) = {\left\{ \begin{array}{ll} - \xi _{ii'}^r \mathbbm {1}(z_{i'} \ne z_i) &{} (i, i') \in {\mathcal {M}}^r \\ - \xi _{ii'}^r \mathbbm {1}(z_{i'} = z_i) &{} (i, i') \in {\mathcal {C}}^r. \end{array}\right. } \end{aligned}$$

where \({\mathcal {M}}^r\) (resp. \({\mathcal {C}}^r\)) denotes the set of undirected edges representing a ML (resp. CL) relationship and \(\mathbbm {1}(.)\) returns 1 if its argument is true and 0 otherwise. In the same way, we define the edge potential for the set of columns \(\psi _{jj'}^c({\varvec{w}}_{j}, {\varvec{w}}_{j'}; {\varvec{\varXi }}^c)\), with parameter matrix \({\varvec{\varXi }}^c = (\xi _{jj'}^c)\). This defines a distribution on the latent space.

Fig. 1
figure 1

Graphical model of the HLBM, where \({\varvec{\epsilon }}\) is the parameter of the mixture distribution, \({\varvec{\alpha }}=(\alpha _1,\ldots ,\alpha _g)\) and \({\varvec{\beta }}=(\beta _1,\ldots ,\beta _m)\) are the mixture proportions, \({\varvec{\varXi }}^r\) and \({\varvec{\varXi }}^c\) are the parameter matrices of the potentials of the HMRF

3.1.2 Sampling the observed variables

The univariate random variables \(x_{ij}\) are conditionally independent given \({\varvec{Z}}\) and \({\varvec{W}}\) and follow a probability distribution of density function \(\phi\) and parameter \({\varvec{\epsilon }}= (\epsilon _{ijk\ell })\): \(x_{ij} |(z_{ik} = 1, w_{j\ell } =1)\; \sim \; \phi (.; \epsilon _{ijk\ell }).\) Note that the general definition with parameter \(\epsilon _{ijk\ell }\) includes the more classical parameterization \(\epsilon _{ijk\ell } = \gamma _{k\ell }\) and is not intended for practical use as such. The graphical model is depicted in Fig. 1. Thus,

$$\begin{aligned} p({\varvec{X}}| {\varvec{Z}}, {\varvec{W}}; {\varvec{\varTheta }}) = \prod _{ij} \phi (x_{ij}; \epsilon _{ijz_i w_j}) = \prod _{ijk\ell } \phi (x_{ij}; \epsilon _{ijk\ell })^{z_{ik}w_{j\ell }}. \end{aligned}$$
(2)

3.1.3 Complete data log-likelihood

The vector of parameters of the model is \({\varvec{\varTheta }}= \{{\varvec{\alpha }},\, {\varvec{\beta }},\,{\varvec{\epsilon }},\, {\varvec{\varXi }}^r,\, {\varvec{\varXi }}^c\}\). Using (1) and (2), the complete data log-likelihood is given by:

$$\begin{aligned} \log p({\varvec{X}}, {\varvec{Z}}, {\varvec{W}}; {\varvec{\varTheta }})&= \log p({\varvec{Z}}; {\varvec{\varTheta }}) + \log p({\varvec{W}}; {\varvec{\varTheta }}) + \log p({\varvec{X}}| {\varvec{Z}}, {\varvec{W}}; {\varvec{\varTheta }})\\&= \sum _{i \notin \varUpsilon _r} \sum _k z_{ik} \log \alpha _k + \sum _{j \notin \varUpsilon _c} \sum _\ell w_{j\ell } \log \beta _\ell \\&\quad- \sum _{(i, i') \in {\mathcal {M}}^r} \xi _{ii'}^r \mathbbm {1}(z_{i'} \ne z_i) - \sum _{(i, i') \in {\mathcal {C}}^r} \xi _{ii'}^r \mathbbm {1}(z_{i'} = z_i)\\&\quad- \sum _{(j, j') \in {\mathcal {M}}^c} \xi _{jj'}^c \mathbbm {1}(w_{j'} \ne w_j) - \sum _{(j, j') \in {\mathcal {C}}^c} \xi _{jj'}^c \mathbbm {1}(w_{j'} = w_j)\\&\quad+ \sum _{ijk\ell } z_{ik}w_{j\ell }\log \phi (x_{ij}; \epsilon _{ijk\ell }) - \log \big (\varGamma _r({\varvec{\varXi }}^r) \varGamma _c({\varvec{\varXi }}^c)\big ). \end{aligned}$$

In this semi-supervised setup, \({\varvec{\varTheta }}\) can be decomposed as \({\varvec{\varTheta }}= \{{\varvec{\varTheta }}_{L}, {\varvec{\varTheta }}_{F}\}\), where \({\varvec{\varTheta }}_{L} = \{{\varvec{\alpha }}, \, {\varvec{\beta }}, \,{\varvec{\epsilon }}\}\) is to be learned and \({\varvec{\varTheta }}_{F} = \{{\varvec{\varXi }}^r, \, {\varvec{\varXi }}^c\}\) is fixed since it is given as input of the algorithms. In the following, we define \({\varvec{S}}^r = (s_{ii'}^r)\) such that:

$$\begin{aligned} s_{ii'}^r = {\left\{ \begin{array}{ll} \lambda _r^{-1} \xi _{ii'}^r &{} (i, i') \in {\mathcal {M}}^r \\ - \lambda _r^{-1} \xi _{ii'}^r &{} (i, i') \in {\mathcal {C}}^r \\ 0 &{} \text{ otherwise, } \end{array}\right. } \end{aligned}$$

where \(\lambda _r > 0\) is a scaling factor for all the weights. The model is not identifiable for \(\lambda _r\) and \({\varvec{S}}^r\), but this is not a problem since these parameters are fixed. Thereby, the log-potential can be simply written up to a constant as \(\log \psi _{ii'}^r({\varvec{z}}_{i}, {\varvec{z}}_{i'} ;\lambda _r, {\varvec{S}}^r) = \lambda _r s_{ii'}^r \sum _k z_{ik} z_{i'k}\). Similarly, we can define \({\varvec{S}}^c = (s_{jj'}^c)\) with scaling factor \(\lambda _c\) for the weights on the column space. Since \(\varGamma _r\) (resp. \(\varGamma _c\)) only depends only on \({\varvec{\varXi }}^r\) (resp. \({\varvec{\varXi }}^c\)) (see Appendix 1), the complete data log-likelihood is then reduced, up to a constant, to:

$$\begin{aligned}&\log p({\varvec{X}}, {\varvec{Z}}, {\varvec{W}}; {\varvec{\varTheta }}) = \sum _{i \notin \varUpsilon _r} \sum _k z_{ik} \log \alpha _k + \sum _{j \notin \varUpsilon _c} \sum _\ell w_{j\ell } \log \beta _\ell \\&+ \frac{\lambda _r}{2} \sum _{ii'k} s_{ii'}^r z_{ik}z_{i'k} + \frac{\lambda _c}{2} \sum _{jj'\ell } s_{jj'}^c w_{j\ell }w_{j'\ell } + \sum _{ijk\ell } z_{ik}w_{j\ell }\log \phi (x_{ij}; \epsilon _{ijk\ell }) + C. \nonumber \end{aligned}$$
(3)

3.1.4 Including an external field in the HMRF

In the proposed model, the rows or columns in the HMRFs do not contribute to the mixture proportions of the model. As proposed in (Celeux et al. 2002), we can address this problem by setting all nodes in the MRFs and defining mixture-like parameters \({\varvec{\alpha }}\) as an external field, using node potentials. Thus, we can define the following variant of the model, where \(\varUpsilon _r = \{1, \ldots , n\}\) and the joint distribution writes:

$$\begin{aligned} \log p({\varvec{Z}}; {\varvec{\varTheta }}) = - \log \varGamma _r({\varvec{\alpha }}, \lambda _r, {\varvec{S}}^r) + \sum _{ik} z_{ik} \log \alpha _k + \frac{\lambda _r}{2}\sum _{ii'k} s_{ii'}^r z_{ik}z_{i'k}. \end{aligned}$$

The specificity of this model is that the nodes that are in a semi-supervision relationship also contribute to the mixture-like parameter \({\varvec{\alpha }}\). It must however be noted that the normalization constant \(\varGamma _r\) now depends on \({\varvec{\alpha }}\), \(\lambda _r\) and \({\varvec{S}}^r\).

3.2 Inference with the EM algorithm

In order to use the model in a clustering setting, we want to jointly infer the latent variables \({\varvec{Z}}, {\varvec{W}}\) and to learn the model parameters. In the following, we develop two approaches, based respectively on Classification EM (Celeux & Govaert, 1992) and on Variational EM (Govaert & Nadif, 2005).

3.2.1 Classification EM approach

In the Classification EM (CEM) approach (Govaert & Nadif, 2008), we maximize \(\log p({\varvec{X}}, {\varvec{Z}}, {\varvec{W}}; {\varvec{\varTheta }})\) by alternating the maximization w.r.t. \({\varvec{Z}}\), \({\varvec{\varTheta }}\), \({\varvec{W}}\) and \({\varvec{\varTheta }}\). However, here, the maximization w.r.t. a classification matrix, e.g. \({\varvec{Z}}\) is not tractable because of the dependencies introduced in the HMRF. Thus, we use ICM, a coordinate ascent algorithm where we maximize \(\log p({\varvec{X}}, {\varvec{Z}}, {\varvec{W}}; {\varvec{\varTheta }})\) w.r.t. \(z_i \in \{1, \dots , g\}\), keeping \((z_{i'})_{i' \ne i}\) fixed. For \(i \in \varUpsilon _r\), the classification E-step is given by: \(z_i = {{\,\mathrm{arg\!\max }\,}}_{z_i}\log p({\varvec{X}}, z_i, (z_{i'})_{i' \ne i}, {\varvec{W}}; {\varvec{\varTheta }})\). Thus, we can show that the CE-step is:

$$\begin{aligned} z_i = {{\,\mathrm{arg\!\max }\,}}_{k} \big (\lambda _r \sum _{i'} s_{ii'}^r z_{i'k} + \sum _{j\ell } w_{j\ell }\log \phi (x_{ij}; \epsilon _{ijk\ell }) \big ). \end{aligned}$$

The M-step is given by \({\varvec{\varTheta }}= {{\,\mathrm{arg\!\max }\,}}_{{\varvec{\varTheta }}} \log p({\varvec{X}}, {\varvec{Z}}, {\varvec{W}}; {\varvec{\varTheta }})\) and depends on the class-conditional densities.

3.2.2 Variational EM approach

A variational approximation of the posterior distribution can be used as in (Govaert & Nadif, 2005). Let Q be a probability over the latent space \({\mathcal {Z}} \times {\mathcal {W}}\), parameterized by \(\widetilde{\varvec{Z}}= (\tilde{z}_{ik})\) and \(\widetilde{{\varvec{W}}}= (\tilde{w}_{j\ell })\), such that

$$\begin{aligned} Q({\varvec{Z}}, {\varvec{W}}; \widetilde{\varvec{Z}}, \widetilde{{\varvec{W}}}) = Q({\varvec{Z}}; \widetilde{\varvec{Z}})Q({\varvec{W}}; \widetilde{{\varvec{W}}}) = \prod _{ik} \tilde{z}_{ik}^{z_{ik}} \prod _{j\ell } \tilde{w}_{j\ell }^{w_{j\ell }}, \end{aligned}$$

and where \(\sum _k \tilde{z}_{ik} = 1\). We can show that \(Q(z_{ik}=1) = \tilde{z}_{ik}\). The objective function \(F(\widetilde{\varvec{Z}}, \widetilde{{\varvec{W}}}, {\varvec{\varTheta }})\) of the Variational EM algorithm is:

$$\begin{aligned} F(\widetilde{\varvec{Z}}, \widetilde{{\varvec{W}}}, {\varvec{\varTheta }}) = \mathbb {E}_Q \big (\log p({\varvec{X}}, {\varvec{Z}}, {\varvec{W}}; {\varvec{\varTheta }}) \big ) + H(Q). \end{aligned}$$

The objective of this approach is to obtain \(\widetilde{\varvec{Z}}, \widetilde{{\varvec{W}}}, {\varvec{\varTheta }}= {{\,\mathrm{arg\!\max }\,}}_{\widetilde{\varvec{Z}}, \widetilde{{\varvec{W}}}, {\varvec{\varTheta }}} F(\widetilde{\varvec{Z}}, \widetilde{{\varvec{W}}}, {\varvec{\varTheta }})\), which can be reached by alternating the maximization of F wrt. \(\widetilde{\varvec{Z}}\), \({\varvec{\varTheta }}\), \(\widetilde{{\varvec{W}}}\), and \({\varvec{\varTheta }}\). Contrary to the classical LBM (Govaert & Nadif, 2008), the maximization of \(F(\widetilde{\varvec{Z}}, \widetilde{{\varvec{W}}}, {\varvec{\varTheta }})\) wrt. \(\widetilde{\varvec{Z}}\) can not be decomposed for each row \(i \in \varUpsilon _{r}\) because of the dependencies introduced in the HMRF. However, we can apply coordinate ascent on \(f(\tilde{{\varvec{z}}}_1, \dots , \tilde{{\varvec{z}}}_n) = F(\widetilde{\varvec{Z}}, \widetilde{{\varvec{W}}}, {\varvec{\varTheta }})\), by maximizing over \(\tilde{{\varvec{z}}}_i = (\tilde{z}_{i1}, \dots , \tilde{z}_{ig})^{\top }\) and keeping fixed \((\tilde{{\varvec{z}}}_{i'})_{i' \ne i}\) such that the Lagrangian of each optimization problem is (see Appendix 1):

$$\begin{aligned} {\mathcal {L}}_{\mu }&= \frac{\lambda _r}{2} \sum _{i'} s_{i'i}^r \sum _k \tilde{z}_{i'k}\tilde{z}_{ik} + \frac{\lambda _r}{2} \sum _{i'} s_{ii'}^r \sum _k \tilde{z}_{ik}\tilde{z}_{i'k} \\& \quad+ \sum _k \tilde{z}_{ik} \sum _{j\ell } \tilde{w}_{j\ell }\log \phi (x_{ij}; \epsilon _{ijk\ell }) - \sum _{k} \tilde{z}_{ik} \log \tilde{z}_{ik} - \mu (1 - \sum _{k} \tilde{z}_{ik}). \end{aligned}$$

Thus, the VE-step is given by the following fixed-point:

$$\begin{aligned} \tilde{z}_{ik} \propto {\left\{ \begin{array}{ll} \exp {\big (\lambda _r \sum _{i'} s_{ii'}^r\tilde{z}_{i'k}\big ) \prod _{j\ell } \phi (x_{ij}; \epsilon _{ijk\ell })^{\tilde{w}_{j\ell }}} &{} i \in \varUpsilon _{r} \\ \alpha _k \prod _{j\ell } \phi (x_{ij}; \epsilon _{ijk\ell })^{\tilde{w}_{j\ell }} &{} i \notin \varUpsilon _r. \end{array}\right. } \end{aligned}$$

Note that we can deduce easily and in the same way the expression of \(\tilde{w}_{j\ell }\). On the other hand, in the general formulation of Celeux et al. (2003), the proposed VE-steps corresponds to a mean field approximation followed by a regular E-step. The M-step is given by \({\varvec{\varTheta }}= {{\,\mathrm{arg\!\max }\,}}_{{\varvec{\varTheta }}} F(\widetilde{\varvec{Z}}, \widetilde{{\varvec{W}}}, {\varvec{\varTheta }})\) and depends on the class-conditional densities.

3.2.3 With an external field

For a model with an external field, the E-steps for a row i are:

$$\begin{aligned} {\left\{ \begin{array}{ll} \tilde{z}_{ik} \propto \alpha _k \exp {\big (\lambda _r \sum _{i'} s_{ii'}^r\tilde{z}_{i'k}\big ) \prod _{j\ell } \phi (x_{ij}; \epsilon _{ijk\ell })^{\tilde{w}_{j\ell }}} &{}\text { VE-step}\\ z_i = {{\,\mathrm{arg\!\max }\,}}_{k}\big (\log \alpha _k + \lambda _r \sum _{i'} s_{ii}^r z_{i'k} + \sum _{j\ell } w_{j\ell }\log \phi (x_{ij}; \epsilon _{ijk\ell }) \big ) &{}\text { CE-step}. \end{array}\right. } \end{aligned}$$

The new potential function includes the mixture parameters as a node potential, which makes the partition function dependent on the mixture parameters \({\varvec{\alpha }}\) and \({\varvec{\beta }}\). Thus, the M-step for the mixture parameters has no closed form. We propose to ignore this dependence for the M step. Experimentally, we did not observe any significant difference between the two versions of the model in terms of clustering performance, but this model offers easier computations. Consequently, we used this model in the rest of the paper.

3.3 The proposed algorithm

3.3.1 Iteration of the fixed point in the VE-step

The proposed E-step corresponds to a fixed point equation of the form \(f(\widetilde{\varvec{Z}}) = \widetilde{\varvec{Z}}\). This is similar to the E-step of the Neighborhood EM algorithm of Ambroise and Govaert (1998) in the case of one-sided mixture models, where only ML relationships are considered. The authors prove that, for \(\lambda _r < (\max _i \sum _{i'} |s_{ii'}^r|)^{-1}\), f is a contraction mapping and the corresponding fixed-point is the maximum of the objective criterion for the E-step. In our experiments, we observed that a single iteration of this fixed-point seems to suffice, which is also suggested in (Celeux et al., 2003). Contrary to (Ambroise & Govaert, 1998), we observed that this sufficient condition on \(\lambda _r\) was too restrictive in our case and could not lead to enough regularization (see Sect. 5).

3.3.2 Parallel updates

For efficiency reasons, we use parallel updates in the E-step for the VEM and CEM algorithms in our implementation. This procedure trades the convergence properties of the ICM algorithm (for CEM) or the fixed-point iteration (for VEM) for the benefit of parallel computations.

For VEM, as suggested in (Hinton et al., 2005), we use damping for the parallel updates in the VE-step in order to avoid oscillations. Let \(\tilde{z}_{ik}^{(c+ 1 / 2)}\) be the variational probability obtained after one iteration of the fixed point in the E-step and let \({\eta \in (0, 1)}\). The damped VE-step is given by: \(\tilde{z}_{ik}^{(c+1)} = (1-\eta ) \tilde{z}_{ik}^{(c+1/2)} + \eta \tilde{z}_{ik}^{(c)}\). For CEM, we propose to use sequential updates after a given number of iterations if convergence has not been reached yet. The order of the updates is randomly selected at each E-step. Note that this choice of parallel updates does not allow the use of stochastic variants of EM based on variational inference as explained in (Celeux et al., 2003) for the simulated field algorithm.

3.4 Connection to other models

The proposed algorithms can be presented in connection to graph convolutional Neural Networks, where the constraints matrices are seen as the graph adjacency matrix, and to Laplacian regularization where the ML constraints can be viewed as a k-nearest neighbors graph describing the manifold on which the data lies. We here detail these connections.

3.4.1 Graph convolutional Neural Networks

Let \(\varLambda _{ik}^{\widetilde{{\varvec{W}}}^{(c)}} = \prod _{j\ell } \phi (x_{ij}; \epsilon _{ijk\ell }^{(c)})^{\tilde{w}_{j\ell }^{(c)}}\) (resp. \(\varLambda _{j\ell }^{\widetilde{\varvec{Z}}^{(c)}} = \prod _{ik} \phi (x_{ij}; \epsilon _{ijk\ell }^{(c)})^{\tilde{z}_{ik}^{(c)}}\)) and \({\varvec{A}}^{(c)}\) (resp. \({\varvec{B}}^{(c)}\)) be the \(n \times g\) (resp. \(d \times m\)) matrix such that each row of the matrix is \({\varvec{\alpha }}^{(c)}\) (resp. \({\varvec{\beta }}^{(c)}\)). The unnormalized variational probabilities at iteration c of the VEM algorithm can be written:

$$\begin{aligned} {\left\{ \begin{array}{ll} \widetilde{\varvec{Z}}_u^{(c+1)} &{} = {\varvec{A}}^{(c)} \odot \exp {\Big (\lambda _r {\varvec{S}}^r \widetilde{\varvec{Z}}^{(c)}}\Big ) \odot {\varvec{\varLambda }}^{\widetilde{{\varvec{W}}}^{(c)}} \\ \widetilde{{\varvec{W}}}_u^{(c+1)} &{} = {\varvec{B}}^{(c)} \odot \exp {\Big (\lambda _c {\varvec{S}}^c \widetilde{{\varvec{W}}}^{(c)}}\Big ) \odot {\varvec{\varLambda }}^{\widetilde{\varvec{Z}}^{(c+1)}}. \end{array}\right. } \end{aligned}$$

This can be compared to the graph convolutional neural networks (GCN) of Kipf and Welling (2016a), in a supervised context, where the \((c+1)\)th hidden layer \({\varvec{H}}^{(c+1)}\) is given by \({\varvec{H}}^{(c+1)} = \text {ReLU}(\tilde{{\varvec{S}}} {\varvec{H}}^{(c)} {\varvec{\varOmega }}^{(c)})\), where \(\tilde{{\varvec{S}}} = {\varvec{D}}^{-\frac{1}{2}} {\varvec{S}}_I {\varvec{D}}^{-\frac{1}{2}}\) with \({\varvec{S}}_I = {\varvec{S}} + {\varvec{I}}\), \({\varvec{D}}\) is the diagonal degree matrix of \({\varvec{S}}_I\), \({\varvec{S}}\) is the adjacency matrix of the attributed graph, \({\varvec{H}}^{(0)} = {\varvec{X}}\) contains the attributes of the graph, and \({\varvec{\varOmega }}^{(c)}\) is the weight matrix of layer c. The node features are propagated through the nodes neighbors and at layer c, each node i has a latent representation \({\varvec{h}}_i^{(c)}\) which aggregates the features of the nodes c steps away in the adjacency matrix \({\varvec{S}}\). In our model, we do not propagate the nodes features through the nodes neighbors (in the observed graph) but we instead propagate the posterior probabilities through the nodes neighbors (in the HMRF). At iteration c of EM, each node has aggregated the posterior probabilities of nodes c steps away in the HMRF. For our model, the nodes features are modeled in the generative part of the E-step: \({\varvec{A}}\odot {\varvec{\varLambda }}^{\widetilde{{\varvec{W}}}^{(c)}}\) and the nodes aggregate their neighbors posterior probability with the kernel \({\varvec{S}}^r\) (or \({\varvec{S}}^c\)), that is not learned contrary to GCNs (which requires a set of labeled data).

3.4.2 Laplacian regularization

A straightforward extension of Zhu and Lafferty (2005), He et al. (2011) and (Salah and Nadif 2017) in a semi-supervised context for the LBM is, given weighted adjacency matrices (\({\varvec{S}}^r\) for the rows and \({\varvec{S}}^c\) for the columns) that represent the ML relationships between the data points, to consider that two rows in a ML relationship have to lie close in the latent space. To this end, one can optimize a penalized log-likelihood \(\log p({\varvec{X}}, {\varvec{Z}}, {\varvec{W}}; {\varvec{\varTheta }}) - \lambda _r{\mathcal {R}}_r - \lambda _c{\mathcal {R}}_c\), where the penalty for rows is \({\mathcal {R}}_r = \frac{1}{2} \sum _{ii'}\sum _k s_{ii'}^r(\tilde{z}_{ik} - \tilde{z}_{i'k})^2 = {{\,\mathrm{Tr}\,}}(\widetilde{\varvec{Z}}^\top {\varvec{L}}^r \widetilde{\varvec{Z}})\), where \({\varvec{L}}^r\) is the Laplacian matrix associated to the adjacency matrix \({\varvec{S}}^r\). As mentioned in (Zhu & Lafferty, 2005), \({\mathcal {R}}_r\) may seem to be a prior for the latent variables of the model in the form \(p({\varvec{Z}}; {\varvec{\varTheta }}) \propto {{\,\mathrm{Tr}\,}}(\widetilde{\varvec{Z}}^\top {\varvec{L}}^r \widetilde{\varvec{Z}}) + \sum _{ik} z_{ik}\log \alpha _k\), but it actually depends on the posterior probabilities and is thus best thought of as a discriminative component in the objective function. In our model, HLBM, the semi-supervision is expressed in a generative way with the HMRFs, but EM algorithms for the two models can be compared. In the CEM approach for the Laplacian regularization, we optimize the regularized complete data log-likelihood over the latent classification matrices and we can then consider \(\tilde{z}_{ik} = z_{ik} \in \{0, 1\}\). Thus, \({\mathcal {R}}_r = \frac{1}{2} \sum _{ii'}s_{ii'}^r\mathbbm {1}(z_{i'} \ne z_i)\). It can then be shown that the corresponding algorithm is equivalent to the algorithm of HLBM for CEM. In the VEM approach, due to the non-linearities in the latent variables introduced by the Laplacian regularization term, there is no closed form for the variational E-step. The solution proposed in (He et al., 2011; Salah & Nadif, 2017) is to maximize the variational objective and minimize the regularization term sequentially. This strategy, unfortunately, did not yield convincing results for our model.

4 Co-clustering of count data with the Poisson HLBM

With the appropriate mixture distribution, the proposed model can be applied on different types of data, as in classical mixture models: gaussian distributions can be chosen to model microarray data or multinomial distributions for categorical data. In the following, we develop the proposed model for count data with a mixture of Poisson distributions, as in (Govaert & Nadif, 2018). This model has the advantage of being suited to high-dimensional text data (Ailem et al., 2017).

4.1 Algorithm for the Poisson HLBM

The data distribution, conditionally on the clusters is \(x_{ij} |z_{ik} w_{j\ell } = 1\; \sim \; \phi (., \mu _i \nu _j \gamma _{k\ell }),\) where \(\phi\) is the probability mass function of a Poisson distribution. The model is parameterized by \({\varvec{\varTheta }}= \{{\varvec{\alpha }}, \, {\varvec{\beta }}, \, {\varvec{\mu }}, \, {\varvec{\nu }}, \, {\varvec{\gamma }}, \, {\varvec{\varXi }}^r, \, {\varvec{\varXi }}^c\}\). For identifiability, following (Govaert & Nadif, 2018), we impose the following constraint: for \(M > 0\),

$$\begin{aligned} {\varvec{\varTheta }}\in \{{\varvec{\varTheta }}| \mu _.=\nu _.= M, \forall k \; \sum _\ell \beta _\ell \gamma _{k\ell } = M^{-1}, \forall \ell \; \sum _k \alpha _k \gamma _{k\ell } = M^{-1} \}. \end{aligned}$$

It can then be shown that \(\mathbb {E}{(x_{i.})} = \mu _i\) and \(\mathbb {E}{(x_{.j})} = \nu _j\), and the marginals \(\mu _i\) and \(\nu _j\) can then be replaced by \(x_{i.}=\sum _j x_{ij}\) and \(x_{.j}=\sum _i x_{ij}\). Thereby, we define respectively the row, column and block reduced matrices \({\varvec{X}}^{\widetilde{\varvec{Z}}} = (x_{kj}^{\widetilde{\varvec{Z}}})\), \({\varvec{X}}^{\widetilde{{\varvec{W}}}} = (x_{i\ell }^{\widetilde{{\varvec{W}}}})\) and \({\varvec{X}}^{\widetilde{\varvec{Z}}\widetilde{{\varvec{W}}}} = (x_{k\ell }^{\widetilde{\varvec{Z}}\widetilde{{\varvec{W}}}})\) such that \({\varvec{X}}^{\widetilde{\varvec{Z}}} = \widetilde{\varvec{Z}}^\top {\varvec{X}}\), \({\varvec{X}}^{\widetilde{{\varvec{W}}}} = {\varvec{X}}\widetilde{{\varvec{W}}}\) and \({\varvec{X}}^{\widetilde{\varvec{Z}}\widetilde{{\varvec{W}}}} = \widetilde{\varvec{Z}}^\top {\varvec{X}}\widetilde{{\varvec{W}}}\). The matrices \(\widetilde{\varvec{Z}}\) and \(\widetilde{{\varvec{W}}}\) contain the variational probabilities, as defined in Sect. 3.2. Equivalent reduced matrices \({\varvec{X}}^{{\varvec{Z}}}\), \({\varvec{X}}^{{\varvec{W}}}\) and \({\varvec{X}}^{{\varvec{Z}}{\varvec{W}}}\) can be defined in a CEM setup. We can show that the VE and CE-steps are respectively:

$$\begin{aligned} {\left\{ \begin{array}{ll} \tilde{z}_{ik} \propto \alpha _k \exp {\big (\lambda _r \sum _{i'} s_{ii'}^r\tilde{z}_{i'k} + \sum _\ell x_{i\ell }^{\widetilde{{\varvec{W}}}} \log \gamma _{k\ell }\big )} &{}\text { VE-step}\\ z_i = {{\,\mathrm{arg\!\max }\,}}_{k}\big (\log \alpha _k + \lambda _r \sum _{i'} s_{ii}^r z_{i'k} + \sum _\ell x_{i\ell }^{{\varvec{W}}} \log \gamma _{k\ell } \big ) &{}\text { CE-step} \end{array}\right. } \end{aligned}$$

Since the constraints matrices are sparse, these updates are efficient. The M-step in a VEM setting leads to: \(\alpha _k = \frac{\sum _i\tilde{z}_{ik}}{n}=\frac{\tilde{z}_{.k}}{n} ,\; \beta _\ell =\frac{\sum _j\tilde{w}_{j\ell }}{d}= \frac{\tilde{w}_{.\ell }}{d} ,\; \gamma _{k\ell } = \frac{x_{k\ell }^{\widetilde{\varvec{Z}}\widetilde{{\varvec{W}}}}}{x_{k.}^{\widetilde{\varvec{Z}}}x_{.\ell }^{\widetilde{{\varvec{W}}}}}.\) The M-step for CEM is similar, with hard assignments. Following (Govaert & Nadif, 2008), we propose the following algorithm for the Poisson HLBM VEM (PHLBMVEM) (see Algorithm 1). Note that the row and column M-step for \({\varvec{\gamma }}\) can benefit from the reduced matrix \({\varvec{X}}^{{\varvec{Z}}}\) or \({\varvec{X}}^{{\varvec{W}}}\) computed for its corresponding E-step. The algorithm for CEM is similar, but considers hard assignment matrices, does not apply damping and switches to a sequential E-step after a given number of iterations.

figure a

4.1.1 Initialization

We can use the constraints matrices \({\varvec{S}}^r\) and \({\varvec{S}}^c\) to provide a better initialization of the row and column partitions. We build a stochastic matrix \({\varvec{M}}^{r} = {\varvec{S}}_I^{r+}{\varvec{\varDelta }}_{r+}\), where \({\varvec{S}}_I^{r+}= {\varvec{S}}^{r+} + {\varvec{I}}\), \({\varvec{S}}^{r+}\) contains only the non-negative values of \({\varvec{S}}^{r}\), and \({\varvec{\varDelta }}_{r+}\) is the inverse of the diagonal degree matrix of \({\varvec{S}}_I^{r+}\). This corresponds to averaging the features of neighboring nodes. We apply a clustering algorithm, namely skmeans for a Poisson model (see Sect. 4), on the rows of \({\varvec{M}}^{r}{\varvec{X}}\) to get an initial partition for \({\varvec{Z}}\). We apply the same procedure for the columns, using \({\varvec{S}}^c\) and apply a clustering algorithm on the columns of \({\varvec{X}}{\varvec{M}}_{c}\). The obtained matrix has values \(({\varvec{M}}^{r}{\varvec{X}})_{ij} = x_{ij} + \frac{1}{s^{r+}_{i.}}\sum _{i'} s^{r+}_{ii'} x_{i'j}\), where the dot indicates the sum over a given index (\(s_{i.} = \sum _{i'} s_{ii'}\)). This method also has the advantage of reducing the sparsity of the data matrix used for initialization and thus provides a better partition.

4.2 Algorithmic complexity

The proposed algorithms can benefit from the sparse structure of \({\varvec{X}}\), \({\varvec{S}}^r\) and \({\varvec{S}}^c\). The computations in the E-steps and M-steps are based on reduced matrices \({\varvec{X}}^{{\varvec{Z}}} = {\varvec{Z}}^\top {\varvec{X}}\), \({\varvec{X}}^{{\varvec{W}}} = {\varvec{X}}{\varvec{W}}\) and \({\varvec{X}}^{{\varvec{Z}}{\varvec{W}}} = {\varvec{Z}}^\top {\varvec{X}}{\varvec{W}}\) of respective sizes \(g \times d\), \(n \times m\) and \(g \times m\). Let \(n_\text {it}\) denote the number of iterations of the EM algorithm, \(n_{\text {NZX}}\) and \(n_{\text {NZS}_r}\) denote respectively the number of non-zero values in \({\varvec{X}}\) and \({\varvec{S}}^r\) and \({\mathcal {N}}_r^{\text {max}} = \max _i |{\mathcal {N}}_r(i)|\).

For an iteration of the VEM algorithm, the computational bottleneck is the row and column cluster assignments, which is \(O(gn({\mathcal {N}}_r^{\text {max}} + m))\) for the rows and \(O(md({\mathcal {N}}_c^{\text {max}} + g))\) for the columns, and the computation of the reduced matrices is \(O(n_{\text {NZX}}(g + m))\). The time complexity of the Poisson VEM algorithm is thus \(O\big (n_\text {it}\big (n_{\text {NZX}}(g + m) + gn({\mathcal {N}}_r^{\text {max}} + m) + md({\mathcal {N}}_c^{\text {max}} + g)\big )\big )\). The Poisson CEM complexity is similar to VEM but the algorithm benefits from a faster convergence and sparse structures for the classification matrices that speed-up the computations.

The space complexity of the Poisson VEM algorithm is related to the data matrix, the constraint matrix, the reduced matrices and the classification matrices. It is thus \(O(n_{\text {NZX}} + n_{\text {NZS}_r} + n_{\text {NZS}_c} + nm + dg + nd + dm)\). Sparse structures for CEM can reduce it to \(O(n_{\text {NZX}} + n_{\text {NZS}} + nm + dg + n + d)\).

5 Experiments on simulated data

5.1 Sampling and experiment plan

Here, we evaluate our algorithms in terms of co-clustering performance on simulated data. In a semi-supervised setting, the difficulty of the co-clustering problem will depend on both the data and the given constraints. In the following, we describe the procedures to sample different co-clustering problems from the model, to generate different constraints matrices from the true clusters and to evaluate the obtained partitions against the true clusters.

5.1.1 Sampling the data

We can use the generative part of the model to sample simulated data. To this end, we do not include the HMRF to sample the data, and, given a vector of parameters \({\varvec{\varTheta }}\) we can sample the complete data \(({\varvec{X}}, {\varvec{Z}}, {\varvec{W}})\). The margins \(\mu _i\) and \(\nu _j\) are sampled from \(\{1,\, \dots , \, 100\}\) with a power law \(p(k) \propto k^{-\frac{3}{2}}\), resulting in skewed margins. The experiments are carried out with \(n = 100\) rows, \(d = 200\) columns, \(g = 3\) row clusters and \(m = 4\) column clusters and mixture proportions \({\varvec{\alpha }}\) and \({\varvec{\beta }}\) with symmetric Dirichlet distribution of parameter \(\delta = 4\) and for \(\begin{aligned}{\varvec{\gamma }}= \gamma ^{0} \left({\begin{array}{cccc} 1 &{} 2 &{} 3 &{} 1 \\ 3 &{} 1 &{} 2 &{} 3 \\ 2 &{} 3 &{} 1 &{} 3 \\ \end{array}}\right)\end{aligned}\). where \(\gamma ^{0}>0\) controls the class overlap. In order to obtain 3 sets of parameters with increasing overlap \(\{{\varvec{\varTheta }}_{+}, {\varvec{\varTheta }}_{++}, {\varvec{\varTheta }}_{+++}\}\), we measure the linear separability of the clusters with Linear Discriminant Analysis, computed as the ratio between the inter-cluster variance and the total variance of the data projected onto each of the factorial axes. Each of the ratios is in \([0,\, 1]\), a ratio of 1 meaning that the intra-cluster variance on the factorial axis is null (i.e. the clusters are linearly separable), and a ratio of 0 meaning that the centers of gravity of each cluster are projected onto the same point on the factorial axis. Using this criterion, we define \(\gamma ^{0}_{+} = 2\times 10^{-2}\), \(\gamma ^{0}_{++} = 2\times 10^{-3}\) and \(\gamma ^{0}_{+++} = 1\times 10^{-3}\).

5.1.2 Sampling the constraints matrix

For each set of complete data, we can build row and column binary constraints matrices with some of the true classes, by setting \(s_{ii}^r = 1\) if \(z_{i'} = z_i\) or \(s_{ii}^r = -1\) if \(z_{i'} \ne z_i\). To this end, we sample a fraction \(f_S\) of all the \(\left({\begin{array}{c}n\\ 2\end{array}}\right)\) or \(\left({\begin{array}{c}d\\ 2\end{array}}\right)\) pairwise ML and CL constraints that can be formulated from the true clusters. It must be noted that the models will be influenced differently depending on which relationships are sampled and that CL relationships are sampled more often than ML relationships (with respective probabilities \(1 - \sum _k \alpha _k^2\) and \(\sum _k \alpha _k^2\) for a relationship on the set of rows). Thus, for each set of complete data we sample 50 row and column constraints matrices. We use an identical regularization parameter for rows and columns, \(\lambda = \lambda _r = \lambda _c\).

In order to evaluate the sensibility of the models w.r.t. noise in the constraints matrix, we sample a fraction \(f_{\text {noise}}\) of the pairs \((i, i')\) and set \(s_{ii'}^{\text {noise}} = - s_{ii'}\). For the other \((i, i')\) pairs, we set \(s_{ii'}^{\text {noise}} = s_{ii'}\). Finally, given a set of ML and CL relationships, we can choose whether to apply transitive closure on these relationships as described in (Basu et al., 2004). This allows to test if the algorithm applies the transitive closure implicitly or if these supplementary relationships convey new information.

5.1.3 Measuring the information in the constraints matrices

For a given algorithm, the ML and CL relationships in a constraints matrix can convey more or less information on a clustering problem. They can be of limited use for an algorithm which naturally recovers these constraints without supervision and they can be noisy and convey wrong information about the true clusters—and thus be contradictory with the data—so that some constraints are not satisfied after convergence of the algorithm with supervision.

For a partition \({\mathcal {P}}\) of the rows or columns of the data matrix, where \({\mathcal {P}}_i\) is the cluster of node i, and a constraints matrix \({\varvec{S}}\), we define the ratio (4), where \(\text {unsat}({\mathcal {P}}_i, {\mathcal {P}}_{i'}, s_{ii'})\) equals 1 if the constraint \(s_{ii'}\) is not satisfied and 0 if the constraint is satisfied or if \(s_{ii'} = 0\).

$$\begin{aligned} {\mathcal {R}}({\varvec{S}}, {\mathcal {P}})&= \frac{\sum _{ii'} |s_{ii'}| \text {unsat}({\mathcal {P}}_i, {\mathcal {P}}_{i'}, s_{ii'})}{\sum _{ii'} |s_{ii'}|} \end{aligned}$$
(4)

Depending on the nature of \({\mathcal {P}}\), this criterion can have different meanings. If \({\mathcal {P}}\) is a partition returned by the algorithm without regularization, \({\mathcal {R}}\) is the weighted proportion of constraints that are not already in the data. This criterion corresponds to a weighted version of the informativeness criterion of Davidson et al. (2006). If \({\mathcal {P}}\) is the true partition, \({\mathcal {R}}\) is a measure of noise in the constraints matrix and corresponds to a weighted version of the "spatial discordance" criterion of Miele et al. (2014). If \({\mathcal {P}}\) is the partition returned by the algorithm with regularization, \({\mathcal {R}}\) corresponds to the weighted proportion of constraints that have been violated by the algorithm after convergence. It can be noted that \({\mathcal {R}}\) is then proportional to the log prior for the rows or columns, in a setup without mixture proportions. This last criterion can also be used as an heuristic to suggest appropriate values for the regularization parameters \(\lambda _r\) and \(\lambda _c\).

5.1.4 Experiment plan

We sample data from 3 sets of parameters with increasing overlap \(\{{\varvec{\varTheta }}_{+}, {\varvec{\varTheta }}_{++}, {\varvec{\varTheta }}_{+++}\}\). For each \({\varvec{\varTheta }}\), we sample 50 sets of complete data \(\{{\varvec{X}}, {\varvec{Z}}, {\varvec{W}}\}\). For each set of complete data we initialize the row and column partitions \({\varvec{Z}}^{(0)}\) and \({\varvec{W}}^{(0)}\) by applying skmeans (50 initializations) on the sets of row and columns,Footnote 2 and for each \(f_S \in \{0\%,\, 1\%,\, 2\%,\, 3\%,\, 4\%,\, 5\%\}\), we repeat the following procedure 20 times: sample the row and columns constraints matrices, then fit the model for each \(\lambda \in \{0,\, 10^{-3},\, 10^{-2},\, 10^{-1},\, 1, 10^{1},\, 10^{2},\, 10^{3},\, 10^{4}\}\). We report the classification results in terms of Co-clustering ARI (CARI) (Robert et al., 2021) between the true partitions and the partitions returned by our algorithm for each run of the algorithm. This criterion is an extension of the Adjusted Rand Index (Hubert & Arabie, 1985) in the context of co-clustering. The CARI index varies between 0 and 1, where 1 means a perfect match and 0 leads to all worst scenarios, including independence and is proportional to the number of miss-classified cells in the data matrix. We identify a working point between two values of \(\lambda\) in the log-scale and also report the CARI values in a linear scale between these two values. In our experiments, we use parallel E-steps with a damping coefficient \(\eta = 0.7\) in the case of VE-steps. We set the number of row and column cluster to their true value.

Fig. 2
figure 2

Comparison of CARI values for CEM and VEM algorithms and for different values of the regularization parameter \(\lambda\). The results are presented for different fractions of ML and CL constraints sampled. The data is sampled with \({\varvec{\varTheta }}_{+++}\). First row: \(\lambda\) in log-scale, second row: \(\lambda\) in linear scale around a working point. From left to right: Increasing fraction \(f_S\) of sampled relationships

5.2 Evaluation of the CEM and VEM algorithms in terms of co-clustering

In Fig. 2, we compare, for a difficult clustering problem \({\varvec{\varTheta }}_{+++}\), the CARI values for the CEM and VEM algorithms in the absence of noise, as a function of \(f_S\) and \(\lambda\), points with \(\lambda = 0\) being our baseline without regularization. First, we observe that the regularization is almost always beneficial to the clustering performances. Also, we note that VEM gives better CARI values than CEM. We observe that the performances are sensitive to the choice of \(\lambda\). In our experimental setup, the optimal working point of our algorithms is located between 1 and 10, independently of the class overlap. For easier problems \({\varvec{\varTheta }}_{+}\) and \({\varvec{\varTheta }}_{++}\), we observed, as expected, that the fraction of constraints \(f_S\) required to reach CARI \(\sim 1\) increases with the complexity of the clustering problem.

An interesting result is that the performances are more sensitive to the choice of \(\lambda\) for lower constraint fractions. When there is more supervision, this phenomenon is less prominent, and \(\lambda\) can more freely be set anywhere above a threshold value. Also, surprisingly, the CARI difference between VEM and CEM does not seem to increase with the fraction of constraints sampled (i.e. the density of the edges in the HMRF). It should however be noted that the drawbacks of CEM in terms of classification performance are counterbalanced by its fast convergence and its use of sparse data structures.

Fig. 3
figure 3

Median CARI values for different values of noise in \({\varvec{S}}^r\) and \({\varvec{S}}^c\). The results are presented for different fractions of ML and CL constraints sampled and for different values of the regularization parameter \(\lambda\). The data is sampled with \({\varvec{\varTheta }}_{+++}\). From left to right: Increasing fraction \(f_S\) of sampled relationships

In Fig. 3, we observe that, when noise is added to the constraint matrices \({\varvec{S}}^{r}\) and \({\varvec{S}}^{c}\), the algorithm can still benefit from the regularization but the choice of \(\lambda\) becomes more critical and too high values of \(\lambda\) are detrimental to the clustering performances of the algorithm. As the fraction of true constraints increases, the algorithm becomes less dependent on the value of \(\lambda\). In Fig. 4, we observe that applying the transitive closure of the ML and CL relationships does not seem to increase the clustering performances of the algorithm, as mentioned in (Lange et al., 2005). Moreover, it decreases the sparsity of the constraints matrices and thus the performances of the computations. Also, it increases the weight of the discriminative component w.r.t. the generative component and thus the value of \(\lambda\) must be changed accordingly (with a smaller value than without transitive closure). In practice, depending on which relationships are sampled, the constraints matrices after transitive closure will have different sparsity values and will consequently require different values of \(\lambda\), which is not convenient. Finally, the results presented in Fig. 4 suggest that the transitive closure of the relationships is applied implicitly in the algorithm.

Fig. 4
figure 4

Median CARI values with and without applying the transitive closure of the ML and CL relationships. Error bars correspond to 1st and 3rd quartiles. The results are presented for different fractions of ML and CL constraints sampled and for different values of the regularization parameter \(\lambda\). The data is sampled with \({\varvec{\varTheta }}_{+++}\). From left to right: Increasing fraction \(f_S\) of sampled relationships

Thus, the regularization parameters \(\lambda _r\) and \(\lambda _c\)—which correspond to a scaling factor of the weights of the ML and CL relationships—must be set, as a first approximation, according to the confidence we have w.r.t. the given ML and CL relationships. The interval [0, 10] seems to be a suitable range for this parameter. It should however be noted that, even for ML and CL relationships sampled from the true clusters (i.e. without noise in \({\varvec{S}}^r\) and \({\varvec{S}}^c\)), the algorithms remain sensitive—in a minor extent—to the choice of the regularization parameter, but are rarely affected negatively by the regularization. Finally, this dependency on the regularization parameters is reduced when the number of ML and CL constraints increases.

6 Experiments on real world data: attributed network clustering

In the previous Section, we evaluated our algorithms, in a semi-supervised setting, in terms of co-clustering performances on data sampled from the model. Here, we compare our algorithms to other algorithms from the literature in terms of one-sided clustering on real world data. We focus on the task of Attributed Network Clustering, where the data is in the form \(({\varvec{A}}, {\varvec{X}})\), where \({\varvec{A}}\) is a graph adjacency matrix and \({\varvec{X}}\) is a data matrix containing feature vector for each node in the network. We evaluate our algorithms on the task of clustering the nodes of these networks. Note that, several studies have demonstrated the importance of co-clustering even to obtain object clusters only (one-side clustering). Actually using co-clustering is often more effective than one-way clustering, especially when considering sparse high dimensional data.

6.1 Experimental setup

We evaluate the one-sided row clustering performances of the VEM and CEM algorithms on datasets commonly used in the field of Attributed Network Embedding (ANE). These datasets are four citation networks: Cora, Citeseer, Wiki and Pubmed where \({\varvec{A}}\) is a sparse graph adjacency matrix in which each node corresponds to a document and edges correspond to citations, \({\varvec{X}}\) is a data matrix containing a bag-of-words feature vector for each node in the network. Although our model is not directly suited for this kind of task, we can see the adjacency matrix \({\varvec{A}}\) as a ML constraint matrix \({\varvec{S}}^r\) on the set of rows. This is less expressive than graph convolutions where the model can learn complex aggregations of a node’s neighbors features but this can still lead to satisfying results.

Table 1 Datasets characteristics

The datasets characteristics are reported on Table 1, where \(n_\text {NZX} (\%)\) corresponds to the percentage of sparsity of the data matrix \({\varvec{X}}\) and the balance coefficient is defined as the ratio of the number of documents in the smallest class to the number of documents in the largest class. The attribute matrices of the datasets are high-dimensional. Thus, co-clustering is an appropriate approach for these datasets.

Note that here, on binary data, we use the Poisson HLBM instead of a Bernoulli HLBM. It leads to give better results since the margins \({\varvec{\mu }}\) and \({\varvec{\nu }}\) implicitly perform a normalization of the data. This is relevant for bag-of-words data since the number of words in a document is not relevant in order to determine its cluster. Note also that the Poisson LBM is even used on Tf-idf normalized data (Wiki and Pubmed), which happens to give satisfying results.

6.2 Model selection

To assess the number of row and column clusters g and m, we rely on the asymptotic integrated classification likelihood (ICL) (Biernacki et al., 2000), as in (Brault et al., 2014). We here propose to use the ICL criterion of a model without HMRF. For a model \({\mathcal {M}}_{gm}\) with g row clusters and m column clusters, we compute ICL with (5), using \(\widetilde{\varvec{Z}}\) and \(\widetilde{{\varvec{W}}}\), the matrices of variational posterior probabilities obtained with VEM.

$$\begin{aligned}&\text {ICL}(g, m) = \log p({\varvec{X}}, {\varvec{Z}}, {\varvec{W}}|{\mathcal {M}}_{gm})\nonumber \\ \quad \approx&\max _{{\varvec{\varTheta }}} \log p({\varvec{X}}, \widetilde{\varvec{Z}}, \widetilde{{\varvec{W}}}| {\varvec{\varTheta }}, {\mathcal {M}}_{gm}) - \frac{g-1}{2} \log n - \frac{m-1}{2} \log d - \frac{gm}{2} \log (nd). \end{aligned}$$
(5)

We computed the ICL for each dataset with \(g \in \{g_{\text {True}} - 4, \dots , g_{\text {True}} + 4 \}\) and \(m \in \{4, \dots , 12\}\) to determine both g and m. The results are presented in Appendix 1. Since the clustering problems are difficult, and even if the algorithm compares well to the literature, some classes that the algorithm can not distinguish are merged. This results in an underestimated number of row cluster.

In the following experiment (see Sect. 6.4), we set the number of row clusters to its true value, and determine an appropriate number of column clusters \(m_{\text {ICL}}\) using the ICL. We found respectively 6, 7, 4 and 5 column clusters for Cora, Citeseer, Wiki and Pubmed with the ICL criterion.

6.3 Setting the hyper-parameters

For all the datasets, we use our VEM algorithm with \(\eta = 0.7\). Based on our study on simulated data, we set \(\lambda _r = 3\) and use a symmetric adjacency matrix \({\varvec{A}}\). As can be seen in Fig. 5a, the algorithm benefits from a positive value of \(\lambda\) in terms of clustering performance, and the choice of \(\lambda _r\) does not appear to be too critical in the range [1, 4], except for Pubmed which seems to require more regularization. In fact, we observe that our algorithm does not recover any meaningful structure with \(\lambda _r = 0\) on Pubmed, probably due to a low class separability related to the relatively low dimension of the word vectors, but the addition of the adjacency matrix allows it to overcome this issue.

Fig. 5
figure 5

a Row clustering performances in terms of NMI for different values of \(\lambda _r\) on the datasets. b Discordance criteria \({\mathcal {R}}({\varvec{A}}, {\mathcal {P}})\) with \({\mathcal {P}}\) denoting, for each dataset: the true partition (True), the partition obtained with \(\lambda _r = 0\) (LBM) and the partition obtained with \(\lambda _r = 3\) (HLBM)

6.4 Evaluation of the CEM and VEM algorithms in terms of clustering

We compare our algorithms (Poisson HLBM VEM: PHLBMVEM and Poisson HLBM CEM: PHLBMCEM) to the following deep learning algorithms: GAE (Kipf & Welling, 2016b), VGAE (Kipf & Welling, 2016a), MGAE (Wang et al., 2017), ARGA and ARVGA (Pan et al., 2018), AGC (Zhang et al., 2019) and DAEGC (Wang et al., 2019). All these algorithms are unsupervised, which enables a fair comparison. Further comparisons to weakly-supervised methods are presented in Appendix 1. As discussed in Sect. 2 and Appendix 1, the CITTC model of Song et al. (2010) is a special case of ours. Consequently, we do not compare our algorithms to CITTC. We run our algorithms 20 times and report.Footnote 3 Otherwise, results are reported from the original paper the results in terms of clustering accuracy (ACC) and Normalized Mutual Information (NMI) on Table 2.

Table 2 Attributed network clustering metrics (mean std, higher is better)

We observe that our algorithms perform well compared with most of these more complex algorithms. We also note that CEM performs comparably to VEM on this task. In comparison to the baseline deep-learning models, ours has the advantage of being interpretable, parsimonious, to rely on a simple input parameter (\(\lambda _r\)) and to be able to perform model selection.

We can measure the information conveyed by the adjacency matrix, considered as a ML constraints matrix. In Fig. 5b, we observe, using the true clusters as reference, that approximately 70% of the edges of the adjacency matrices actually encode a true ML relationship. We also find that an important fraction of the ML relationships of \({\varvec{A}}\) are not already inferred by the algorithm with \(\lambda _r = 0\) in the data (the LBM returns only one non-empty cluster on Pubmed, so all ML relationships are satisfied). Finally, we note that most of the ML constraints are satisfied after fitting the model with \(\lambda _r = 3\). The results are more nuanced on Wiki, where \({\varvec{A}}\) encodes less true ML relationships and the partition learned with the HLBM only satisfies approximately 80% of the constraints of \({\varvec{A}}\). This is probably due to the high true number of row clusters \(g = 17\), that makes a random ML relationship less likely to be true.

7 Conclusion

We have introduced a general probabilistic framework for co-clustering that incorporates ML and CL relationships in the LBM based on HMRF. We presented two efficient inference algorithms based on Variational and Classification EM that also benefit from the supervision in the initialization. We showed that we can establish connections between our algorithms and GCNs as well as manifold regularization. We instantiated this framework on a model for count data and presented detailed VEM and CEM algorithms for which we analyzed the time and space complexity. We have studied the behavior of these algorithms on simulated data when varying the tradeoff between the discriminative and the generative component of the model. Our algorithms have also demonstrated good clustering performances in comparison with deep learning algorithms devoted to the task of attributed network clustering.

In future work, the model can be extended to represent more complex relationships in the latent space with a more general MRF. Stochastic variants of the EM algorithm based on simulated field EM (Celeux et al., 2003) could also be investigated. Finally, the definition of column constraints in the context of clustering with pairwise semi-supervision, only available in the row space, is an important problem-specific research track.