Abstract
Multiview clustering is a framework for grouping objects given multiple views, e.g. text and image views describing the same set of entities. This paper introduces coregularization techniques for multiview clustering that explicitly minimize a weighted sum of divergences to impose coherence between perview learned models. Specifically, we iteratively minimize a weighted sum of divergences between posterior memberships of clusterings, thus learning viewspecific parameters that produce similar clusterings across views. We explore a flexible family of divergences, namely Rényi divergences for coregularization. An existing method of probabilistic multiview clustering is recovered as a special case of the proposed method. Extensive empirical evaluation suggests improved performance over a variety of existing multiview clustering techniques as well as related methods developed for information fusion with multiview data.
1 Introduction
Multiple views of entities are often readily available in modern datasets, for example, a webpage entity has text, images and hyperlinks, each of which can be considered as views of the webpage entity. A problem of practical interest is to harness complementary information available in multiple views to improve over conventional learning algorithms. Multiview learning has been studied as a potential framework to achieve such improved performance. Multiview methods operate with the assumption that different views cluster or label entities similarly. Such similarities have been exploited via cotraining (Blum and Mitchell 1998) and coregularization (Sindhwani and Rosenberg 2008). Cotraining learns one hypothesis for each view which then bootstrap other views to converge to a coherent model (Blum and Mitchell 1998). Coregularization, on the other hand, explicitly minimizes disagreement between views during training. Multiview methods have substantial theoretical and practical advantages over learning a single hypothesis by concatenating views (Nigam and Ghani 2000). For instance, Dasgupta et al. (2001) show that for semisupervised multiview learning with two views, the probability of disagreement between views is an upper bound on the probability of error of either view’s hypothesis.
Due to noisy measurements or unknown biases, different views may not cluster the entities similarly. To be robust to such misspecification in model assumptions made by multiview clustering, we propose a method that maintains a separate posterior distribution for each view. In the proposed method, clustering coherence is imposed by encouraging posterior distributions of viewspecific cluster memberships to be ‘close’ to each other, where closeness is measured via suitable divergences. Specifically, a weighted sum of divergences between current posterior estimates of cluster memberships is minimized. This coregularization technique is combined with ExpectationMaximization (EM) (Dempster et al. 1977) to maximize the loglikelihood. The training process thus alternates between an inference phase that estimates and updates viewwise posterior distributions to encourage coherence followed by perview parameter updates.
To specifically account for potential incoherence among views, we formulate the cost function as a weighted sum of Rényi divergences (Rényi 1960). Storkey et al. (2014) have observed that when aggregating opinions from biased experts or agents, the maximum entropy distribution is obtained via Rényi divergence aggregation (see Definition 1). An extreme case is when views don’t agree on the cluster memberships, in which case, linear aggregation provides the best aggregate posterior. For instance, Bickel and Scheffer use linear aggregation in CoEM (Bickel and Scheffer 2005), inadvertently assuming that different views mostly do not agree with respect to the cluster membership. Instead of assuming a bias free condition, we explore the utility of various aggregation strategies applied to the coregularization framework. Hence, the proposed method can be applied with appropriate Rényi divergences best suited for different levels of discordance in view memberships. CoEM (Bickel and Scheffer 2005) is a special case of our framework as it can be recovered as a specific setting of the Rényi divergence parameter for a fixed parametrization of weights as shown in Sect. 4.3. Extensive empirical evaluations are presented to demonstrate improved performance over existing multiview clustering methods as well as other methods of fusing information from multiple views.
Our main contributions are highlighted in the following:

We propose a novel coregularized multiview clustering algorithm that minimizes weighted sums of Rényi divergences.

We show that an existing approach to probabilistic multiview clustering, namely CoEM can be recovered as a special case of our framework.

We present extensive empirical evaluation showing that the proposed class of methods significantly outperform strong baselines. Moreover, the choice of Rényi divergence can affect clustering performance, while simultaneously capturing biases in viewspecific posterior cluster memberships. Empirical evaluation also demonstrates that our methods handle mixed data e.g. discrete and continuous data very well.
The rest of the manuscript is organized as follows. A brief survey of existing approaches to multiview clustering is provided in Sect. 2. Background and notation is given in Sect. 3. The proposed methods along with other modeling choices are detailed in Sect. 4. Extensive empirical evaluation on several data sets are in Sect. 5 followed by a discussion and conclusion in Sect. 6.
2 Related work
Related work in multiview unsupervised learning goes back to neural network models, a few of which are noted here. Becker and Hinton (1992), Schmidhuber and Prelinger (1993) maximize agreement between a given neural network module and a weighted output of its neighbors. De Sa and Ballard (1993) take advantage of complementary information available in different views by using separate modules for views feeding into a common output. Bickel and Scheffer (2005) introduced probabilistic multiview clustering using cotraining.
Relatively recent models, like those proposed by Chaudhuri et al. (2009), Sa (2005) construct lower dimensional projections using multiple views of data. However, these methods are only applicable when at most two views are available. Kumar et al. (2010) and Tzortzis and Likas (2012) explore multiple kernel learning techniques where each view is represented as a kernel. Closely related to kernel techniques are multiview spectral clustering methods described in detail below.
Zhou and Burges (2007) propose a multiview spectral clustering method as a generalization of the normalized cuts algorithm. In a similar vein, Kumar and DaumeIII (2011) update the similarity matrix of a given view based on the clustering of another view iteratively to produce a coherent clustering. Kumar et al. (2011) minimize disagreement between views by constraining the similarity matrices of views to be close in the Frobenius norm. While spectral methods are effective, they do not estimate cluster centroids, making interpretation and outofsample cluster assignments more challenging to implement. Our empirical studies show that proposed methods outperform spectral multiview clustering methods.
Further, connections between nonnegative matrix factorization and clustering have also been utilized when multiple views are observed. For example, Liu et al. (2013) have shown that modeling userfeature matrices via multiview clustering based on nonnegative matrix factorization (NMF) admits better empirical clustering performance compared to collective matrix factorization (Akata et al. 2011), a popular method for combining information from multiple sources. This further illustrates the advantage of multiview clustering over other related methods. Another approach for multiview clustering using convex subspace representation learning has been proposed by Guo (2013). These methods estimate a subspace where different views are clustered similarly. Many of these methods, however, provide little insight into how views interact within the data. Probabilistic techniques such as ours are particularly useful when such exploration is required. Our empirical evaluation suggests improved performance of our models over NMF based multiview clustering. In addition, many models also deal with partially missing views (Eaton et al. 2014; Li et al. 2014) and demonstrate improved performance using multiview clustering. Lian et al. (2015) use a shared latent factor model to model heterogeneous multiview data and can also handle arbitrarily missing views, i.e., the case when a complete view may be missing for a sample. However, this model assumes a shared latent matrix across all views as opposed the proposed method which maintains separate cluster membership variables for each view. Our proposed methods can easily extend to handle missing views by simply not coregularizing over the missing view.
Note that multiview clustering is distinct from cluster ensemble methods (Ghosh and Acharya 2011; Strehl and Ghosh 2003) that learn hypotheses for each view independently and find a consensus among the perview results posttraining. The latter methods do not share information during training and are thus more suitable for knowledge reuse (Strehl and Ghosh 2003).
One of the more popular applications of multiview clustering is to jointly model images and annotations, each constituting a view. The objective is to utilize annotations and images to learn the underlying clustering of images. This problem has been modeled in varied ways using unsupervised as well as supervised methods. We compare our multiview clustering framework to other relevant methods in the context of this application to motivate the differences in model assumptions. Recently, much supervised work has explored the utility of rich representations of label words and/or annotations in a high dimensional embedding space (Mikolov et al. 2013). A mapping is learned directly from the image view to the word embedding space (annotation view) so that relevant tags or labels are closer under some similarity metric (Frome et al. 2013; Akata et al. 2013) or ranked higher compared to the rest (Weston et al. 2011). Additionally, Akata et al. (2013) learn a mapping to a predefined attribute space to extend supervised image classification to unseen labels. Thus in this case, the target labeling is the same as the text or label view. In contrast, multiview clustering models aim to find the best underlying grouping of data jointly, thus differing in the underlying modeling assumptions. The multiview clustering methods presented in this paper are for completely unsupervised scenarios, and thus do not assume availability of labels for images. Further, the target clustering does not necessarily have to have a onetoone mapping to the annotation views. Hence, in our empirical evaluation, we only compare our models to unsupervised multiview methods with similar modeling and data assumptions as ours.
3 Preliminaries
For nonnegative integers K, vectors in \({\mathbb {R}}^K\) are denoted by lowercase bold (e.g., \({\mathbf {x}}\) with components \(x_1, \ldots , x_K\)). The set \(\{1, 2, \ldots , K\}\) will be denoted [K]. The simplex \(\varDelta ^{K}\) is the set:
A categorical distribution is a discrete distribution over outcomes \(\omega \in [K]\) parameterized by \(\mathbf {\theta } \in \varDelta ^{K}\) so that \(Pr (\omega =k) = \theta _k\). It is a member of the exponential family of distributions. The natural parameters of categorical distribution are \(\log {\mathbf {\theta }} = (\log \theta _k)_{k\in [K]}\) and sufficient statistics are given by the vector of indicator functions for each outcome \(\omega \in [K]\), denoted by \({\mathbf {z}}(\omega ) \in \{0,1\}^K\) with:
Given two categorical distributions \(p (\omega )\) and \(q (\omega )\), describing the distribution over the categorical random variable \(\omega \), the divergence of \(p (\omega )\) from \(q (\omega )\), denoted \(\mathcal {D} (p (\omega )\Vert q (\omega ))\), is a nonsymmetric measure of the difference between the two probability distributions. The KullbackLeibler or KLdivergence is a specific divergence denoted by \({\mathrm {KL}\!\left( {p (\omega )}\Vert {q (\omega )}\right) }\) and is defined as follows.
KLdivergence of \(p (\omega )\) from \(q (\omega )\) is given by:
This is also known as the relative entropy between \(p (\omega )\) and \(q (\omega )\). The relative entropy is nonnegative and jointly convex with respect to both arguments. Further, we have that \({\mathrm {KL}\!\left( {p (\omega )}\Vert {q (\omega )}\right) }=0\) iff \(p (\omega ) = q (\omega )\), for all \(\omega \).
The Rényi divergences (Rényi 1960) are a parametric family of divergences with many similar properties to the KLdivergence. Since our focus is on using these divergences to measure distances of distributions over cluster labels, we will focus on Rényi divergences for distributions over discrete random variables.
Definition 1
(van Erven and Harremoës 2012) Let p, q be two distributions for a random variable \(\omega \in [K]\). The Rényi divergence of order \(\gamma \in (0,1)\cup (1,\infty )\) of \(p (\omega )\) from \(q (\omega )\) is,
The definition may been extended for divergences of other orders like \(\gamma = 0, \ \gamma \rightarrow 1,\ \text {and} \ \gamma \rightarrow \infty \) (van Erven and Harremoës 2012). Rényi divergences are nonnegative \(\forall \gamma \in [0,\infty ]\). In addition, they are jointly convex in \( (p,\ q) \ \forall \gamma \in [0, 1]\) and convex in the second argument \(q \ \forall \gamma \in [0,\infty ]\). As discussed in the comprehensive survey of Rényi divergences by van Erven and Harremoës (2012), many special cases of other commonly used divergences are recovered for specific choices of \(\gamma \). For example, \(\gamma = \frac{1}{2}\) and \(\gamma = 2\) give Rényi divergences which are closely related to the Hellinger and \(\chi ^2\) divergences, respectively, and the KLdivergence is recovered as a limiting case when \(\gamma {\,\rightarrow \,}1\). For the rest of the manuscript, we will abuse notation slightly and use \(p (\omega )\) and \(p ({\mathbf {z}})\) interchangeably to denote the same categorical distribution over outcomes in [K].
4 Coregularized multiview clustering using Rényi divergence minimization
We propose a coregularization technique for multiview clustering using Rényi divergences. The generative model of the data is assumed to be a mixture model in each view. Let N be the total number of samples and V be the total number of views. For \(n \in [N]\) and \(v \in [V]\), let \({\mathbf {x}}_n^v\) represent the feature vector observed at view v for sample n. If the data sample n lies in cluster k in view v, the latent membership is indicated by the categorical random variable \({\mathbf {z}}_n^v \in \{0, 1\}^{K}\), where the kth element of the vector, denoted by \(z_{n,k}^v\) is 1 and the rest are 0. The vector \({\varvec{\pi }}_n \in \varDelta ^{K}\) parametrizes the prior distribution over the categorical variable \({\mathbf {z}}_n^v\) and is the same for each view v. Each data sample is generated independent of the others. Also each view of a sample is generated independently conditioned on \({\mathbf {z}}_n^v\). Let the \(k^{th}\) cluster distribution at view v be parametrized by \({\varvec{\varPsi }}_k^v\). Let the set of all parameters for view v be denoted by \({\varvec{\varPsi }}^v\), i.e. \({\varvec{\varPsi }}^v = \{{\varvec{\varPsi }}_k^v\}\). The generative process can be represented by the plate model in Fig. 1 and is described as follows:

For each n:

For each view v:

Choose \({\mathbf {z}}_n^v \sim p ({\mathbf {z}}_n^v ; {\varvec{\pi }}_n)\) categorical distribution parametrized by \({\varvec{\pi }}_n\).

Choose \({\mathbf {x}}_n^v \sim p ({\mathbf {x}}_n^v  z_{n,k}^v=1 , \varPsi _{k}^v)\) i.e., sample feature from the \(k^{th}\) cluster.


If no coherence conditions are imposed, each view can be modeled independently by maximizing the complete loglikelihood \(\sum _{n \in [N]} \log {p ({\mathbf {x}}_n^v, {\mathbf {z}}_n^v  {\varvec{\varPsi }}^v)}\) using Expection Maximization (EM) (Dempster et al. 1977). Let \({\varvec{\varPsi }}_t^v\) be an estimate of the parameter \({\varvec{\varPsi }}^v\) at iteration t. The Expectation (E)step, estimates posterior probabilities \(p ({\mathbf {z}}_n^v  {\mathbf {x}}_n^v, {\varvec{\varPsi }}_t^v) \, \forall n \in [N], \, \forall v \, \in [V]\). A new estimate of cluster memberships can be obtained for each view, that decreases the Rényi divergence between all viewspecific posteriors to encourage coherence. This core idea motivates the coregularization technique proposed in the following subsections.
4.1 Global coregularization
The proposed method minimizes a weighted sum of divergences between the current posterior or cluster membership estimates available at all views to estimate a new ‘global’ categorical distribution. We would like to tradeoff between the ‘global’ posterior (accounting for coregularization) and the viewspecific unregularized posteriors. A new posterior distribution is estimated for every view \(v \in [V]\) by minimizing the sum of divergences between the global categorical distribution and the viewspecific posterior \(p ({\mathbf {z}}_n^i  {\mathbf {x}}_n^i, {\varvec{\varPsi }}^i) \ \forall i \in [V]\).
At any iteration t, let \(g_{t} ({\mathbf {z}}_n)\) be the global categorical posterior that is to be estimated from the independent posteriors \(p({\mathbf {z}}_{n}^{i}  \mathbf{{x}}_{n}^{i}, {\vec {\varPsi }}_{t}^{i})\). Let \({\mathbf {w}}\in \varDelta ^{V}\) denote a (known) weight vector of dimension V that determines the contribution of each view. Let the choice of divergence metric be known and given by \(\gamma \). The ith element of the vector \({\mathbf {w}}\) is denoted by \(w_i\). The global distribution \(g_{t}^* ({\mathbf {z}}_n)\) can be obtained by minimizing (3).
To solve the weighted divergence minimization, a modified version of the variational algorithm proposed by Storkey et al. (2014) is used. Detailed derivation of the variational algorithm and specific updates are provided in Appendix 1.
For any given view v, it is desirable that its posterior be close to the global distribution \(g_t^* ({\mathbf {z}})\) as well as its local estimate \(p ({\mathbf {z}}_n^v  {\mathbf {x}}_n^v, {\varvec{\varPsi }}_t^v)\). Thus, we introduce a new weight vector \([w_g, (1w_g)] \in \varDelta ^2, \, 0 < w_g < 1\) that achieves this tradeoff. Let the coregularized posterior obtained for the current view v being updated be denoted by \(q_t ({\mathbf {z}}_n^v)\). Then, given \(\gamma \) and weights \([w_g, (1w_g)], q_t ({\mathbf {z}}_n^v)\) can be estimated using:
Given \(g^*_{t} ({\mathbf {z}}_n)\), (4) can be solved to minimize the weighted sum of two divergences only. Since all views admit separate aggregate posteriors, the Mstep can be executed for each view independent of all other views using the conventional Mstep for a mixture model. In order that all the views are in coherence with the latest posterior beliefs of all other views, an Mstep for the view should follow every view’s posterior update. This prevents the algorithm from getting stuck at local minima by avoiding parameter estimates that may agree with old and potentially discordant beliefs of other views. For specific updates in the Mstep based on the chosen mixture model, please refer to Appendix 4.
The above procedure of inferring perview responsibilities independently, followed by a coherence enforcing step using (3) and (4) and a viewspecific Mstep is computed for each view. The complete algorithm for the proposed global coregularization, called GRECO (Global REnyi divergence based COregularization) and is provided by Algorithm 1. The inference step for GRECO, specifically the inner loop update is shown in Fig. 2a for a toy problem with three views for \(K = 3\). Note that (3) and (4) can be computed in parallel for data samples \(n \in [N]\). This is because our target coregularized posterior is independent for each sample (and each view) and can be factored in the product form over samples as well as views. A detailed proof of how that leads to an embarrassingly parallel coregularization algorithm is provided in Appendix 2.
4.2 Local coregularization
We now consider two limiting cases, when \(w_g =0\) and \(w_g = 1\) in (5). The first case (\(w_g = 0\)) is trivial as it does not use coregularization at all and is therefore equivalent to the ensemble method (Strehl and Ghosh 2003). The latter recovers a new method. We consider the nontrivial case when \(w_g=1\) separately for several reasons. First, we are able recover an existing multiview clustering algorithm, CoEM as a special case of this setting for a certain choice of Rényi divergences (\(\gamma \rightarrow 1\)). Thus CoEM is also a special case of our most general setting, GRECO for \(\gamma \rightarrow 1\) and \(w_g = 1\). Further, our empirical evaluation suggests better performance of the most general case of the proposed method (GRECO) in most cases as opposed to this special case (\(w_g = 1\)) suggesting that a nontrivial tradeoff between the global posterior and the viewspecific unregularized posterior in the Estep is advantageous. A useful analogy we would like to draw here is between Gaussian mixture models versus the kmeans algorithm, used for soft and hard clustering respectively. The latter is a limiting case of the former (as widths go to zero, and using identical, isotropic covariances) but is considered as a separate algorithm because of its special properties. The coregularization framework in GRECO with \(w_g = 1\) does not involve an additional tradeoff between the global posterior and the unregularized viewspecific posterior. The minimizer of (5) in this case is exactly equal to \(g^* ({\mathbf {z}}_n)\). Thus, the viewspecific coregularized posterior \(q ({\mathbf {z}}_n^v)\) is equal to the global posterior \(g^* ({\mathbf {z}}_n)\). Note that in a given iteration, only view v is coregularized so that \(q ({\mathbf {z}}_n^v) = g^* ({\mathbf {z}}_n)\). All other views are not updated in the inference and the learning step in the same iteration. The procedure is repeated subsequently for all views. We call this algorithm LYRIC (LocallY weighted Rényi dIvergence Coregularization). As in GRECO, the outer loop of LYRIC iterates over each view v and the inner loop carries out a coherence enforcing Estep for the given view followed by an Mstep. The Estep comprises of estimating independent viewspecific posteriors followed by a local coregularization step that updates the current view’s posterior. It is important to highlight that LYRIC does not result in the same estimates as GRECO every iteration. This is because viewspecific posteriors will be different in each iteration for GRECO and LYRIC owing to the different stages of coregularization. The details of local coregularization (LYRIC) are now explained in the following.
Let v be the current view to be updated at any iteration t and let \(q_t ({\mathbf {z}}_n^v)\) denote the newly obtained posterior at view v for sample n. Local coregularization solves the coherence equation given by (5).
Similar to GRECO, a perview Mstep can now be executed to update perview parameters according to the modified responsibilities. The procedure is repeated iteratively for all views \(v \in [V]\). The final algorithm, LYRIC, is illustrated in Algorithm 2. Figure 2b shows the inference step for a single view in LYRIC.
4.3 Special case I: \(\gamma \rightarrow 1\)
If \(\gamma \) is chosen such that \(\gamma \rightarrow 1\), the minimizer of weighted sum of Rényi divergences admits a closed form solution. Specifically, \(\gamma \rightarrow 1\) reduces the cost to a weighted sum of KLdivergences with the target distribution on the right hand side of KLdivergence (Storkey et al. 2014). Consider (5), for instance, with \(\gamma \rightarrow 1\). Let the perview posterior, \(p ({\mathbf {z}}^i  {\mathbf {x}}^i, {\varvec{\varPsi }}^i)\) be parametrized by \({\varvec{\theta }}^i \in \varDelta ^{K}\). Let the target distribution \(q ({\mathbf {z}}_n^v)\), be parametrized by \({\varvec{\phi }}^v \in \varDelta ^{K}\). The cost function given by (5) can be simplified to (6).
For categorical distributions, the closed form solution of (6) is given by (7) as was derived by Garg et al. (2004). Refer to Appendix 3 for a proof.
Note that the linear aggregation closed form is not specific to LYRIC and can be generalized to GRECO for the choice of \(\gamma \rightarrow 1\). In GRECO, both (3) and (4) reduce to linearly aggregating over perview posteriors in the former and weighted divergence minimization between the global posterior and the current view’s posterior in the latter case.
Specifically, if \(w_v = (1  \alpha )\) for the view v currently being updated, and \(w_i = \frac{\alpha }{V1}\), where \(0 \le \alpha \le 1\) for \(i \ne v, i \in [V]\), the LYRIC algorithm recovers CoEM when \(\gamma \rightarrow 1\). Thus CoEM is a special case of LYRIC.
4.4 Special case II: \(\gamma \rightarrow 0\)
When \(\gamma \rightarrow 0\), (5) has been shown by Storkey et al. (2014) to be equivalent to a minimization over a weighted sum of the KLdivergences with the target distribution as the argument on the lefthand side of KLterms. The closed form solution in this case is an averaging of the parameters \({\varvec{\theta }}^i \ \forall i \in [V]\) in the logspace weighted by \(w_i \ \forall i \in [V]\) (Garg et al. 2004) as shown in (8). The proof is detailed in Appendix 3.
This result is also general and applicable to (3) and (4) with appropriate weighting. For these special cases, the variational updates can be avoided to use the simpler closed form updates for GRECO and LYRIC. Note that (8) can be equivalently written as:
This further suggests that when \(\gamma \rightarrow 1\), the parameters across views contribute equally owing to linear averaging as opposed to when \(\gamma \rightarrow 0\) (9) where extreme values of the posteriors may dominate. Conventionally, a product of experts model (Hinton 2002; Storkey et al. 2014) uses such a product to combine beliefs from independently trained models, for example in an ensemble setting.
Note that coregularization in each GRECO and LYRIC adds an additional complexity of \(\mathcal {O} (NKV^2)\) per iteration where N is the sample size, K is the number of clusters and V is the number of views, compared to the unregularized method. As suggested before, the operations can be trivially parallelized over data samples as well as for calculations required to estimate unnormalized variational parameters for each cluster (see Appendices 1, 2). For the case where all views are Gaussian mixtures, the complexity per outer iteration is \(\mathcal {O} (NKV^2T_{inner} + NKV + \sum _{v \in [V]}d_v^2 K )\) where \(T_{inner}\) is the number of inner iterations for variational estimation of coregularized posteriors, \(d_v\) is the dimension of view v. In each of the special cases described earlier, i.e. when \(\gamma \rightarrow 0\) and \(\gamma \rightarrow 1\), the complexity reduces to \(\mathcal {O} (NKV + \sum _{v \in [V]}d_v^2 K )\) per iteration, same as that of CoEM, due to closed form solutions available for coregularization. In the general case, the largest source of computational overhead in the proposed algorithm is due to the variational procedure currently employed to impose coregularization. However, we are not bound to such a procedure and any accelerated methods available for solving (3), (4) and (5) can be adopted, if available. Further, our variational procedure is trivially parallelizable over samples (see Appendix 2 for relevant proof) whereas coregularization/cotraining techniques for the baselines (see Sect. 5.1) are not. This allows us to improve training efficiency to scale to large datasets.
4.5 Choice of weights and Rényi divergences
For empirical studies, we parametrize the weights for easy comparison with baselines. Let \(0 \le \alpha \le 1\) be a scalar. For every view \(v \in [V]\) being updated, \(w_v = 1  \alpha \). For all other views, \(w_i = \frac{\alpha }{V1} \, \forall i \, \in [V], \ i \ne v\). At every stage in the outer loop of either GRECO or LYRIC, the current view being updated is weighted by \(1\alpha \) and the rest are weighted equally \(\frac{\alpha }{V1}\). This also ensures fair comparison with CoEM by maintaining the same parametrization of weights. Therefore, all experiments demonstrate that a significant boost in clustering performance can be obtained via a suitable choice of Rényi divergences. We evaluated the performance of GRECO and LYRIC for different choices of \(\alpha \) and \(\gamma \). Section 5.3 shows the performance of the model across different choices of the divergence parameter. Specifically, for comparison with baselines, we choose the best performing set of \(\alpha \) and \(\gamma \) based on average accuracy of holdout clustering assignment across five trials.
4.6 Prediction on holdout samples
For outofsample cluster assignment, the conventional Estep with the learned parameters is used to obtain perview posteriors for a test sample for all views independently. It is now desirable to obtain one aggregate posterior \(q ({\mathbf {z}})\) as follows.
For LYRIC, a global posterior can then be obtained using (10) for a given choice of \(\gamma \) and \({\mathbf {w}}\) (see Sect. 4.5) and the set of corresponding learned parameters from LYRIC \({{\varvec{\varPsi }}^v}^*\). Similarly for GRECO, the Estep is run for all views independently followed by executing (10) to obtain a global posterior. A hard clustering is simply the MAP assignment of \({\mathbf {z}}\) w.r.t. the distribution \(q ({\mathbf {z}})\). Empirical performance of LYRIC at \(\gamma \rightarrow 1\) can differ from CoEM due to different methods of obtaining the consensus clustering. Specifically, let \({\varvec{\pi }}_n^c\) and \({{\varvec{\varPsi }}^v}^c, \forall v \in [V]\) be the estimates of the prior distribution parameter and mixture model parameters learned by CoEM respectively. Then the consensus clustering distribution and the corresponding MAP assignment w.r.t. the consensus distribution in CoEM is given by (11) for a data sample \({\mathbf {x}}\equiv \{{\mathbf {x}}^v ,\, \forall v \in [V]\}\),
Note that this method of obtaining a consensus clustering used by CoEM is equivalent to the Estep of a multiview latent variable model that shares a single latent clustering variable across all views. As opposed to CoEM, GRECO and LYRIC obtain a consensus via linear aggregation for \(\gamma \rightarrow 1\) and weights \({\mathbf {w}}\) as shown in (12).
5 Experiments
The proposed methods have been extensively compared with existing multiview clustering models to show that the choice of divergence obtained by tuning \(\gamma \) is of significance, as well as to demonstrate that Rényi divergence is a reasonable choice for coregularization. All datasets were trained using both LYRIC and GRECO algorithms for different values of \(\gamma \in [0, 1]\) discretized in the corresponding logspace. Very high values of Rényi divergences did not matter significantly affect the performance. The weights \({\mathbf {w}}\) are reparametrized as described in Sect. 4.5. For all datasets, groundtruth cluster labels are known and utilized for objective evaluation and comparison to baselines. All models and baselines were trained on the same training and holdout data for five trials with best performing models chosen based on average clustering accuracy for comparison purposes. The mapping between cluster labels to ground truth labels is solved using Hungarian matching (Kuhn 1955). For comparison to baselines, we only report the best performance obtained across different choices of \({\mathbf {w}}\) and \(\gamma \). Holdout assignment results have only been compared to baselines that explicitly mention a mechanism to obtain holdout cluster assignment and empirically test the same. We report Clustering Accuracy, Precision, Recall, Fmeasure, NMI (Strehl and Ghosh 2003) and Entropy (Bickel and Scheffer 2005) for our evaluation. Lower entropy is better while higher values of other metrics show a better performing algorithm. All metrics are defined in Appendix 5. Note that the empirical evaluation here maintains prior cluster distribution \({\varvec{\pi }}_n\) to be equal for all samples n for all probabilistic models, including GRECO and LYRIC without loss of generality. Results demonstrating empirical convergence for a sample fold with multiple initializations (in negative loglikelihood) of GRECO and LYRIC have been included in Appendix 6.^{Footnote 1} To the best of our knowledge, our empirical evaluation is the most extensive evaluation of multiview clustering methods compared to prior work in terms of the number of datasets, number of views and comparison to existing baselines.
5.1 Baselines
The proposed methods are compared to an extensive set of baselines. The baselines are briefly described here.

Shared Latent Variable Model (Joint): An alternative way of modeling multiple views is to have one latent variable that denotes the cluster membership across all views. This is called the ‘Joint’ model. This model is equivalent to concatenating views especially in the most commonly assumed scenario i.e. all views are Gaussian mixtures with diagnonal covariances.

Ensemble Clustering Model (Ensemble) (Strehl and Ghosh 2003): This model trains each view independently followed by a consensus evaluation. To predict the hard clustering assignment, the label correspondence among views is obtained using Hungarian matching (Kuhn 1955). A single posterior is obtained using the same equation as (10) with KLdivergence (logaggregation), followed by a MAP assignment. This method is compared to only when at most two views are available.

CoEM (Bickel and Scheffer 2005): CoEM estimates a mixture model per view subject to crossentropy constraints. The weights for each view are parametrized by \(\eta \in [0,1]\) and the results corresponding to the best performing \(\eta \) are reported.

Coregularized Spectral Clustering (Coreg (Sp)) (Kumar et al. 2011): This is the stateoftheart spectral multiview clustering. The results corresponding to the best performing \(\lambda \) parameter (between 0.01 to 0.1 as suggested by authors) are reported. The implementation provided by the authors is used.^{Footnote 2}

Minimizing Disagreement (Mindis (Sp)) (Sa 2005): This is another spectral clustering technique proposed by (Sa 2005) for 2 views only. The implementation used was implemented and compared to by Kumar et al. (2011).

CCA for Mixture Models (CCAmvc) (Chaudhuri et al. 2009): This method uses Canonical Correlation Analysis to project views on a lower dimensional space. This model can be used for 2 views only.

NMF based Multiview Clustering (NMFmvc) (Liu et al. 2013): This method uses nonnegative matrix factorization for multiview clustering. The original implementation provided by the authors was used for empirical evaluation.^{Footnote 3}
A kmeans clustering algorithm is used independently for each view to initialize distribution parameters for all probabilistic models. An approximate Hungarian matching problem is solved using the kmeans cluster assignments for initialization.
5.2 Datasets
The datasets are chosen referencing prior work in multiview clustering. Details of the datasets are provided in the following.

Twitter multiview ^{Footnote 4} (Greene and Cunningham 2013): This is a collection of twitter datasets in five topical areas (politicsUK, politicsIreland, Football etc.). Each user has views corresponding to users they follow, their followers, mentions, tweet content etc. We use the politicsuk dataset with three views (mentions, retweets and follows). The labels correspond to one of five party memberships of each user. Each view is a bagofwords vector and modeled as a mixture of multinomials for the probabilistic models.

WebKB ^{Footnote 5}: This dataset consists of webpage information from four university websites: Cornell, Texas, Washington and Wisconsin. We show results for the Cornell dataset. Each sample is a webpage with two views, one view of which is the text content (bagofwords) format and weblinks into and out of the webpage (binary bagofwords vector). Each webpage can be clustered into one of five topics. Each view is modeled as a mixture of multinomials.

NUSWide Object ^{Footnote 6} (Chua et al. 2009): This dataset consists of 31 object classes. Of these, we subsample in a balanced manner for 10 classes, with 50 samples belonging to each class. We use 6 views, namely edge histograms (mixture of Gaussians), bagofvisual words of SIFT features (mixture of multinomial distributions) and normalized correlogram (mixture of Gaussians), color histogram (mixture of multinomials), wavelet texture (mixture of Gaussians) and blockwise color moments (mixture of Gaussians).

CUB2002011 ^{Footnote 7} (Wah et al. 2011): This dataset consists of 200 classes and 11,800 data samples. We use the binary attributes and Fisher Vector representations of images as our views. The binary attributes are modeled as mixtureofmultinomials and the Fisher vectors as Gaussian mixtures. We assume diagonal covariances for all views modeled as a mixture of Gaussians in all datasets.
5.3 Results
Tables 1, 2, 3 and 4 show clustering and outofsample cluster assignment results for the datasets mentioned in Sect. 5.2 in that order. Note that results are marked NA if any of the baseline methods were not extendable to more than two views or could not be compared due to limiting model assumptions e.g. nonnegativity required by NMFmvc (Liu et al. 2013). The tables only consist of results corresponding to the \(\gamma \) parameter that provided the best results across different choices of \(\gamma \) for both GRECO and LYRIC on a holdout dataset. Additionally, Fig. 3 shows performance of GRECO and LYRIC using different Rényi divergences parametrized by \(\log { (\gamma )}\) in comparison with CoEM, that uses linear aggregation, corresponding to \(\gamma \rightarrow 1\). The performance across different \(\gamma \) provides further insights into performance of the proposed coregularization method.
The proposed methods outperform almost all the baselines consistently across different datasets. In addition, holdout cluster assignment performance is better for both models across most datasets. Improved performance over ensemble methods suggests coregularization improves on the viewwise clustering approaches. In addition, results also suggest that sharing a single latent variable (see Joint Model) across views is restrictive. In the low bias regime, GRECO has particular advantages over LYRIC because of the additional tradeoff in regularization. When the bias across views is low, the additional regularization potentially accelerates convergence by restricting the deviation from viewspecific unregularized posteriors, especially when initial model parameters may be noisy. In the high bias case, LYRIC shows some advantage (see Table 2WebKB data). It is important to note that overall, the general trend of performance of both GRECO and LYRIC is consistent for each dataset (see Fig. 3). In particular, the performance peaks for the most appropriate choice of \(\gamma \) that best captures inherent biases across views for both algorithms for all datasets and this choice of divergence is the same for GRECO as well as LYRIC.
For Twitter data, the \(\gamma \) parameter of 0.01 resulted in the best clustering accuracy as measured on holdout set (see Table 1). This provides further insight that the views do have some bias in the latent clustering distribution. In the absence of such a bias, the best clustering parameter should have corresponded to \(\gamma \rightarrow 0\). Thus the value of the divergence parameter \(\gamma \) provides an intuitive understanding of inherent incoherence in clustering beliefs in the data. It is notable that characterizing this bias has resulted in almost an order of magnitude increase in clustering accuracy compared to baselines like multiview NMF and spectral clustering methods. To the best of our knowledge, there is little work in terms of designing robust learning models when underlying model assumptions may be violated. The results on Twitter data strongly highlight the significance of such an approach.
Similar observations on the WebKB data suggests a high degree of incoherence across views on the clustering distributions, suggested by the fact that linear aggregation (\(\gamma \rightarrow 1\)) provides the best results on the holdout dataset. Note that in such a scenario, i.e. when views completely disagree (in terms of the MAP estimate of the clustering) across views, learning each view independently is equally useful, as demonstrated by competitive performance of Ensemble methods relative to GRECO and LYRIC. Again, this further reinforces the advantage of our model in terms of robustness to violations of model assumptions. Figure 3 also suggests that as the underlying bias is assumed increase, the model performance in both LYRIC and GRECO consistently improves. In addition, the improvement over CoEM at \(\gamma \rightarrow 1\) suggests that the method proposed to estimate a holdout clustering assignment using (10) is better or comparable to that of CoEM. Note that although GRECO and LYRIC do not perform the best on training data in terms of NMI and Entropy, the results on holdout set are competitive—suggesting that the models do not overfit the training data.
From the results of NUSWide Object dataset, where six views are modeled jointly, the improvement in performance is significant when an appropriate divergence parameter \(\gamma \) is used, as compared to CoEM, which enforces linear aggregation and the joint model that estimates a single clustering posterior across all views. This further suggests advantages of GRECO and LYRIC when the number of views available is large. The best performing divergence parameter is relatively high (\(\gamma = 0.1\)). This also suggests that as the number of views being modeled increases, the views are likely to be more incoherent and an assumption of a high bias (higher \(\gamma \)) is a better modeling assumption. This is also apparent from the deteriorated performance of the joint model. Both GRECO and LYRIC perform the best at the limiting case \(w_g = 1\) as expected in a slightly high bias case, when additional regularization of GRECO is not necessarily advantageous. Figure 3 also suggests that at lower values of \(\gamma \) both LYRIC and GRECO may be getting stuck in local minima (suggested by the high observed variance at \(\gamma = 0.01\)) potentially reflecting sensitivity to choice of \(\gamma \) for this data.
For a large dataset like CUB2002011 with 200 clusters and \(\sim \)11,000 samples and high dimensionality (\(\sim \)8000), the improvement in unsupervised learning performance of GRECO and LYRIC is more pronounced compared to CoEM even though the best performance is obtained at \(\gamma \rightarrow 1\). This suggests that our inference on holdout set works better than CoEM (see Sect. 4.6 for details). Further, the best performance divergence parameter \(\gamma \rightarrow 1\) suggests the attribute view and the Fisher vector views, used from the CUB_200_2011 data, are potentially incoherent in terms of the latent clustering distribution. Comparison to other probabilistic methods, i.e. Joint model and Ensemble model, suggest restrictive model assumptions may fail and general methods like GRECO and LYRIC may be more reliable in large scale settings. Ensemble model also relies on Hungarian matching to solve the correspondence problem between cluster indices (200 clusters) across views. Improved performance in GRECO and LYRIC is obtained at a significant computational cost compared to CCAmvc which provides comparable performance very fast. This corroborates the model assumptions made by CCAmvc, namely that views of a sample are uncorrelated conditioned on cluster identity of sample (weaker assumptions than those made by the Joint model) can provide improvement in unsupervised learning performance. Faster inference for GRECO/LYRIC in such settings can be obtained by parallelization and/or any improvements to the variational inference procedure used to impose coregularization.
Overall, the best Rényi divergence suitable for a particular dataset differs, indicating that GRECO and LYRIC capture potential differences in coherence between views with respect to cluster memberships significantly better than comparable methods. The biases between views demonstrably affect clustering performance. This also suggests that the multiview assumption of a single underlying cluster membership distribution is not always satisfied in real data. Thus flexible models such as GRECO and LYRIC are preferable. All results further show that the choice of the class of Rényi divergences is beneficial for improving multiview clustering performance and both methods generalize better to unseen data compared to baselines.
A comparison of training time suggests that the increased accuracy of GRECO and LYRIC is obtained at the cost of increased training time. However, as suggested in Sect. 4, the variational update required for coregularization is the major contributing factor to training time. Since these updates can be trivially executed in a distributed setting across samples as well as for estimating unnormalized cluster membership distributions, the training time can be easily improved. Further, any alternative inference procedure to solve the coregularization constraint will directly improve training times for the proposed method. Also note that training times are comparable to CoEM and other baselines for special cases (see Tables 2, 4).
Additional advantages of GRECO and LYRIC compared to other methods are noteworthy. Both Twitter and WebKB datasets consist of at least one view with relational data. The twitter data is sparse (as is the case with social network data), i.e., a lot of the entries are 0. In these cases probabilistic methods outperform other methods suggesting the importance of probabilistic models in general. The NUSWide Object dataset and CUB datasets have mixed views, i.e. bagofwords as well as numeric features (e.g. Fisher vector representations). Empirical evaluation also demonstrates that our methods handle mixed data well.
Some limitations of the proposed methods arise in selecting an appropriate choice of weights and the best suited Rényi divergence parameter for a given dataset. Storkey et al. (2014) have proposed a method for automatic selection of weights which can be easily incorporated in GRECO or LYRIC via minor changes to the variational procedures described in Appendix 1. However, we chose to use manual selection of weights inorder to highlight significance of the choice of Rényi divergences as opposed to a finer choice of weights, especially to highlight the generalization over CoEM. Note that automatic selection or learning the best divergence parameter in an unsupervised setting suitable for a given data is a challenging and novel problem that we expose. Particularly, conventional model selection methods that tradeoff model complexity and likelihood are not applicable in this scenario as model complexity does not change w.r.t. different \(\gamma \). Automatic selection of such a model parameter is deferred to future work. However, we point out that both GRECO and LYRIC provide better performance compared to all existing baselines for all choices of \(\gamma \) that we tested. A more appropriate choice of \(\gamma \) further boosts performance. In case computational constraints exist, we suggest using either of the closed form methods suggested in Sect. 4.
6 Discussion and conclusion
This work proposed a coregularization approach to multiview clustering that builds on a novel idea of directly minimizing a weighted sum of divergences between viewspecific posteriors that indicate probabilities of cluster memberships. This approach encourages coherence between the posterior memberships by bringing them ‘closer’ in distribution. The resulting coregularization techniques, GRECO and LYRIC significantly improve performance over existing multiview clustering methods. By maintaining perview posteriors and using a flexible choice of Rényi divergences for imposing coherence, these models are robust to incoherence among views. In addition, CoEM is recovered as a special case of LYRIC. CoEM proposes linear aggregation of posteriors, which is best suited when aggregating among incoherent posterior memberships. We show empirically that better performance can be achieved by accounting for incoherence via a flexible family of divergences. We also achieve closed form updates to impose coregularization for two special cases, when the divergence parameter \(\gamma \rightarrow 0\) and \(\gamma \rightarrow 1\).
For future work, a more general framework for multiview parameter estimation that accounts for divergence aggregation can be explored. Additional performance and computational gains may be obtained by learning the regularization weights and the divergence parameter \(\gamma \). Theoretical analyses of special cases and studying the effects of other class of divergence can provide insights in further developing such flexible models. Such a framework could also offer advantages when views may be arbitrarily missing or in distributed settings when minimal interaction between views is expected due to communication constraints.
Notes
For the CUB dataset, we only have results with a single initialization for a single traintest split. However, average over different splits shows the same trend.
References
Akata, Z., Perronnin, F., Harchaoui, Z., & Schmid, C. (2013). Labelembedding for attributebased classification. In IEEE conference on computer vision and pattern recognition (CVPR), 2013, pp. 819–826.
Akata, Z., Thurau, C., & Bauckhage, C. (2011). Nonnegative matrix factorization in multimodality data for segmentation and label prediction. In: W. Andreas, S. Sabine & G. Martin (Eds.), 16th Computer vision winter workshop.
Becker, S., & Hinton, G. E. (1992). Selforganizing neural network that discovers surfaces in randomdot stereograms. Nature, 355, 161–163.
Bickel, S., & Scheffer, T. (2005). Estimation of mixture models using CoEM. In Machine learning: ECML 2005, 16th European conference on machine learning, Porto, Portugal.
Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with cotraining. In Proceedings of the eleventh annual conference on computational learning theory. ACM.
Chaudhuri, K., Kakade, S. M., Livescu, K., & Sridharan, K. (2009). Multiview clustering via canonical correlation analysis. In Proceedings of the 26th annual international conference on machine learning.
Chua, T.S., Tang, J., Hong, R., Li, H., Luo, Z., & Zheng, Y.T. (2009). NUSWIDE: A realworld web image database from National University of Singapore. In Proceedings of ACM conference on image and video retrieval.
Dasgupta, S., Littman, M. L., & McAllester, D. A. (2001). PAC generalization bounds for cotraining. In Advances in neural information processing systems NIPS.
de Sa V. R., & Ballard, D. H. (1993). Selfteaching through correlated input. In Computation and neural systems. New York: Springer.
De Sa, V. R. (2005). Spectral clustering with two views. In Proceedings of the workshop on learning with multiple views, international conference on machine learning.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B, 39, 1–38.
Eaton, E., des Jardins, M., & Jacob, S. (2014). Multiview constrained clustering with an incomplete mapping between views. Knowledge and Information Systems, 38, 231–257.
Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., & Mikolov, T., et al. (2013). Devise: A deep visualsemantic embedding model. In Advances in neural information processing systems, pp. 2121–2129.
Garg, A., Jayram, T. S., Vaithyanathan, S., & Zhu, H. (2004). Generalized opinion pooling. In AMAI.
Geoffrey, E. (2002). Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800.
Ghosh, J., & Acharya, A. (2011). Cluster ensembles. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1, 305–315.
Greene, D., & Cunningham, P. (2013). Producing a unified graph representation from multiple social network views. In Proceedings of the 5th annual ACM web science conference.
Guo, Y. (2013). Convex subspace representation learning from multiview data. In Proceedings of the twentyseventh AAAI conference on artificial intelligence, 2013.
Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1), 83–97.
Kumar, A., & DaumeIII, H. (2011). A cotraining approach for multiview spectral clustering. In Proceedings of the 28th international conference on machine learning.
Kumar, A., Rai, P., & Daumé III, H. (2010). Coregularized spectral clustering with multiple kernels. In NIPS Workshop on new directions in multiple kernel learning.
Kumar, A., Rai, P., & Daume, H. (2011). Coregularized multiview spectral clustering. In Advances in neural information processing systems 24. Curran Associates, Inc.,.
Li, S.Y., Jiang, Y., & Zhou, Z.H. (2014). Partial multiview clustering. In AAAI Conference on artificial intelligence.
Lian, W., Rai, P., Salazar, E., & Carin, L. (2015). Integrating features and similarities: Flexible models for heterogeneous multiview data. In Twentyninth AAAI conference on artificial intelligence.
Liu, J., Wang, C., Gao, J., & Han, J. (2013). Multiview clustering via joint nonnegative matrix factorization. In Proceedings of 2013 SIAM data mining conference.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Nigam, K., & Ghani, R. (2000). Analyzing the effectiveness and applicability of cotraining. In Proceedings of the ninth international conference on information and knowledge management. ACM.
Rényi, A. (1960). On measures of entropy and information. In Proceedings of the 4th Berkeley symposium on mathematics, statistics and probability.
Schmidhuber, J., & Prelinger, D. (1993). A novel unsupervised classification method. In Third international conference on artificial neural networks, 1993.
Sindhwani, V., & Rosenberg, D. S. (2008). An RKHS for multiview learning and manifold coregularization. In Proceedings of the 25th international conference on machine learning. ACM.
Storkey, A., Zhu, Z., & Hu, J. (2014). A continuum from mixtures to products: Aggregation under bias. In ICML workshop on divergence methods for probabilistic inference.
Strehl, A., & Ghosh, J. (2003). Cluster ensembles–A knowledge reuse framework for combining multiple partitions. The Journal of Machine Learning Research, 3, 583–617.
Tzortzis, G., & Likas, A. (2012). Kernelbased weighted multiview clustering. 2013 IEEE 13th international conference on data mining.
van Erven, T., & Harremoës, P. (2012). Rényi divergence and KullbackLeibler divergence. ArXiv eprints.
Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The CaltechUCSD Birds2002011 Dataset. Technical Report CNSTR2011001, California Institute of Technology.
Weston, J., Bengio, S., & Usunier, N. (2011). Wsabie: Scaling up to large vocabulary image annotation. In IJCAI, (Vol. 11, pp. 2764–2770).
Zhou, D., & Burges, C. J. C. (2007). Spectral clustering and transductive learning with multiple views. In Proceedings of the 24th international conference on machine learning.
Acknowledgments
This work is supported in part by NSF SCH1417697 and the United States Office of Naval Research, Grant No. N000141410039. MDR is supported by an ARC Discovery Early Career Research Award (DE130101605). Authors would also like to acknowledge Suriya Gunasekar for reviewing the manuscript for clarity.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: Concha Bielza, Joao Gama, Alipio Jorge, and Indrė Žliobaité.
Appendices
Appendix 1: Derivation of variational inference for weighted sum of divergence minimization
Minimize the weighted sum of Rényi divergences between M distributions \(p^i ({\mathbf {z}}), \, i \in [M]\), given the divergence parameter \(\gamma \). Let \(q^* ({\mathbf {z}})\) be the corresponding minimizing distribution and \(w \in \varDelta ^{M}\) be the known weight vector determining how important a given distribution is. The specific cost function is given by (13). Consider the case when each of the distributions are categorical distributions over clusters [K].
Let \(\kappa ^i ({\mathbf {z}})\) be a variational distribution corresponding to \(p^i ({\mathbf {z}})\). Using the logsum inequality, we have a lower bound on (13) given by (14).
The lower bound is optimized by iteratively estimating \(\kappa ^i ({\mathbf {z}})\)’s and \(q ({\mathbf {z}})\). To update \(\kappa ^i ({\mathbf {z}})\), \(\kappa ^j ({\mathbf {z}}) \, \forall j \in [M], \ j \ne i\) and \(q ({\mathbf {z}})\) are held fixed. Setting the gradient w.r.t. \(\kappa ^i ({\mathbf {z}})\), the iterative update is given by \(\kappa ^i ({\mathbf {z}}) \propto p^i ({\mathbf {z}})^{\gamma }q ({\mathbf {z}})^{ (1\gamma )}\). When all \(\kappa ^i ({\mathbf {z}})\) are held fixed, \(q ({\mathbf {z}})\) is again obtained by setting the gradient of the bound w.r.t. \(q ({\mathbf {z}})\) to 0, and given by (15). The complete variational update is described by Algorithm 3. Note that all distributions should be appropriately renormalized.
Appropriate variants of Algorithm 3 are used by GRECO and LYRIC. To estimate the centroid distribution of GRECO, Algorithm 4 is used. To estimate viewspecific distributions \(q_t ({\mathbf {z}}^v) \forall v \in [V]\), i.e. (4), Algorithm 5 is used. In the case of LYRIC, Algorithm 4 is used except in that the target distribution is \(q_t ({\mathbf {z}}^v)\).
All the proposed variational updates (3) can be run in parallel for each sample \(n \in [N]\). Further, for each sample, calculation of \(\kappa ^i ({\mathbf {z}})\) for each \(i \in [M]\) and each \(k \in [K]\) can be estimated in parallel up to proportionality. Similarly for the target variable \(q ({\mathbf {z}})\), the estimates are trivially parallelizable for each \(k \in [K]\) up to proportionality.
Appendix 2: Detailed derivation of parallel (coregularization) Estep over samples, for GRECO and LYRIC
Let \({\mathbf {z}}^i = \{ {\mathbf {z}}_n^i : n \in [N]\}\) and \({\mathbf {x}}^i = \{ {\mathbf {x}}_n^i : n \in [N]\}\). Let \({\mathbf {z}}= \{ {\mathbf {z}}^i : i \in [V]\}, {\mathbf {x}}= \{ {\mathbf {x}}^i : i \in [V]\}\) and \({\varvec{\varPsi }}= \{ {\varvec{\varPsi }}^i : i \in [V] \}\). Let \(g ({\mathbf {z}})\) be the target posterior for GRECO that is obtained by solving (16).
We estimate \(g ({\mathbf {z}})\) such that it is independent across all samples, i.e. \(g ({\mathbf {z}}) = \prod _{n \in [N]} g ({\mathbf {z}}_n)\). By the IID assumption on the loglikelihood, the posterior \(p ({\mathbf {z}} {\mathbf {x}}, {\varvec{\varPsi }})\) can be factored into perview persample posteriors as in (17).
Therefore,
Equation 18 can now be solved in parallel for each sample n to obtain \(g ({\mathbf {z}}) = \prod _{n \in [N]} g ({\mathbf {z}}_n)\). This completes the proof and can be analogously proved for LYRIC and viewspecific updates.
Appendix 3: Special cases of Rényi divergence aggregation
1.1 Case I : \(\gamma \rightarrow 1\):
Storkey et al. (2014) have shown that weighted Rényi divergence aggregation when \(\gamma \rightarrow 1\) is equivalent to (19)
For multiview clustering, we aggregate between categorical distributions. Let \(p ({\mathbf {z}}^i)\) be a categorical distribution parametrized by \({\varvec{\theta }}^i\) so that \(Pr (z_k^i=1) = \theta _k^i, \ {\varvec{\theta }}^i \in \varDelta ^{K}\). The target distribution \(q ({\mathbf {z}})\), also categorical is parametrized by \({\varvec{\phi }}\). Then the KLdivergence aggregation of (19) is given by (20)
(20) is convex in \({\varvec{\phi }}\). The corresponding Lagrangian function is given by (21)
Setting the gradient of (21) to 0,
If \(\lambda \mathbf {1} + \beta = 1, {\varvec{\phi }}= \sum _{i \in [V]} w_i {\varvec{\theta }}^i \text {and }\, {\varvec{\phi }}\in \varDelta ^K\), a feasible solution.
1.2 Case II : \(\gamma \rightarrow 0\):
Storkey et al. (2014) have shown that weighted Rényi divergence aggregation when \(\gamma \rightarrow 1\) is equivalent to (19)
For multiview clustering, we aggregate between categorical distributions. Let \(p ({\mathbf {z}}^i)\) be a categorical distribution parametrized by \({\varvec{\theta }}^i\) so that \(Pr (z_k^i=1) = \theta _k^i, \ {\varvec{\theta }}^i \in \varDelta ^{K}\). The target distribution \(q ({\mathbf {z}})\), also categorical is parametrized by \({\varvec{\phi }}\). Then the KLdivergence aggregation of (23) is given by (24)
(24) is convex in \({\varvec{\phi }}\). The corresponding Lagrangian function is given by (25)
Setting the gradient of (25) to 0 as before, we have,
If \(\lambda \mathbf {1} + \beta + 1= 0, \log {{\varvec{\phi }}} = \sum _{i \in [V]} w_i \log {{\varvec{\theta }}^i} \text {and }\, {\varvec{\phi }}\in \varDelta ^K\).
Appendix 4: Mstep for standard mixture models
Let N be the total number of samples in a mixture model with K classes. Let at any iteration \(t, q ({\mathbf {z}}_n)\) be the posterior responsibilities calculated using current model parameters of the mixture model. Let \({\mathbf {x}}_n \in {\mathbb {R}}^D\) represent the observed features e.g. numeric data modeled as a Gaussian mixture or bagofwords data that can be modeled as a mixture of multinomials.

Gaussian mixture models: If the mixture model is a Gaussian mixture with parameters \({\varvec{\mu }}_k\) and \({\varvec{\varSigma }}_k \forall k \, \in [K]\), the mean \({\varvec{\mu }}_k\) and Covariance \({\varvec{\varSigma }}_k\) are updated using (27) and (28) respectively.
$$\begin{aligned} {\varvec{\mu }}_{t+1,k}= & {} \frac{\sum _{n \in [N]} q ({\mathbf {z}}_{n,k}) {\mathbf {x}}_n}{\sum _{n \in [N]} q ({\mathbf {z}}_{n,k})}\end{aligned}$$(27)$$\begin{aligned} {\varvec{\varSigma }}_{t+1,k}= & {} \frac{\sum _{n \in [N]} q ({\mathbf {z}}_{n,k}) ({\mathbf {x}}_n  {\varvec{\mu }}_{t+1,k}) ( {\mathbf {x}}_n  {\varvec{\mu }}_{t+1,k})^T}{\sum _{n \in [N]} q ({\mathbf {z}}_{n,k})} \end{aligned}$$(28) 
Multinomial mixture models: The Multinomial distribution parameters for each cluster \(\theta _k \, \forall k \in [K]\) can be updated using (29)
$$\begin{aligned} {\varvec{\theta }}_{t+1,k} = \frac{\sum _{n \in [N]} q ({\mathbf {z}}_{n,k}) {\mathbf {x}}_n}{\sum _{n \in [N]} q ({\mathbf {z}}_{n,k}) \sum _{d \in [D]} {\mathbf {x}}_{n,d}} \end{aligned}$$(29)
Appendix 5: Formulae of evaluation metrics
All evaluation metrics assume that groundtruth cluster memberships are known and that the correspondence between clustering labels and groundtruth labels is estimated. The number of learned clusters is the same as number of groundtruth clusters.
Definition 2
If \(C_n\) represents the cluster label determined by the learning algorithm and \(\omega _n\) represents the groundtruth clustering, the clustering accuracy for a dataset with N samples and K clusters is given by,
where,
Following terms are defined per cluster \(k \in [K]\)

True Positives (\(TP_k\)): It is the number of samples that were clustered correctly by the learning model.

False Positives (\(FP_k\)): It is the number of samples assigned to a cluster they do not belong to.

True Negatives (\(TN_k\)): This is the total number of samples not belonging to a given cluster and is clustered correctly i.e. clustered into a different cluster than for which true negatives are measured.

False Negatives (\(FN_k\)): This is the total number of samples belonging to a given cluster that were not actually assigned to the cluster by the learning algorithm.
Definition 3
Definition 4
Definition 5
The following metrics do not assume a correspondence between groundtruth labels and learned cluster labels.
Definition 6
Let C be the categorical random variable over K clusters with a distribution obtained from clustering i.e. \(Pr (C = k)\) is the fraction of samples clustered into k by the learning algorithm. Let \(\omega \) represent the categorical variable with a distribution obtained from true clustering. The joint distribution \(p (C, \omega )\) is the fraction of samples clustered as C and lie in groundtruth cluster \(\omega \). The mutual information \(I (C, \omega )\) is given by,
The Entropy of \(H (C) =  \sum _{k \in [K]} p (C = k)\log {p (C = k)}\) and analogously for \(H (\omega )\). Normalized Mutual Information (NMI) (Strehl and Ghosh 2003) is the symmetrized and normalized mutual information between C and \(\omega \).
Definition 7
Average Entropy (Bickel and Scheffer 2005)
Appendix 6: Empirical convergence of loglikelihood
In order to validate empirical convergence in loglikelihood for GRECO and LYRIC for a fixed divergence parameter \(\gamma \) and the coregularizing weights, we conduct the following experiment. For all datasets, a single traintest split is chosen and the parameters are initialized differently (using kmeans clustering) for five trials and the negative loglikelihood ( \(\sum _{v \in [V]} \sum _{n \in [N]} \log {p ({\mathbf {x}}_n^v , {\mathbf {z}}_n^v ; {\varvec{\varPsi }}_n^v )}\)) is observed over iterations until convergence. Figures 4, 5, 6 and 7 show the negative loglikelihood observed for each alternating EM iteration for both GRECO and LYRIC for the Twitterpoliticsuk, Cornell (WebKB), NUSWideObj and CUB2002011 datasets respectively.
Rights and permissions
About this article
Cite this article
Joshi, S., Ghosh, J., Reid, M. et al. Rényi divergence minimization based coregularized multiview clustering. Mach Learn 104, 411–439 (2016). https://doi.org/10.1007/s1099401655432
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s1099401655432