1 Introduction

Multiple views of entities are often readily available in modern datasets, for example, a web-page entity has text, images and hyper-links, each of which can be considered as views of the web-page entity. A problem of practical interest is to harness complementary information available in multiple views to improve over conventional learning algorithms. Multiview learning has been studied as a potential framework to achieve such improved performance. Multiview methods operate with the assumption that different views cluster or label entities similarly. Such similarities have been exploited via co-training  (Blum and Mitchell 1998) and co-regularization  (Sindhwani and Rosenberg 2008). Co-training learns one hypothesis for each view which then bootstrap other views to converge to a coherent model  (Blum and Mitchell 1998). Co-regularization, on the other hand, explicitly minimizes disagreement between views during training. Multiview methods have substantial theoretical and practical advantages over learning a single hypothesis by concatenating views  (Nigam and Ghani 2000). For instance, Dasgupta et al. (2001) show that for semi-supervised multiview learning with two views, the probability of disagreement between views is an upper bound on the probability of error of either view’s hypothesis.

Due to noisy measurements or unknown biases, different views may not cluster the entities similarly. To be robust to such mis-specification in model assumptions made by multiview clustering, we propose a method that maintains a separate posterior distribution for each view. In the proposed method, clustering coherence is imposed by encouraging posterior distributions of view-specific cluster memberships to be ‘close’ to each other, where closeness is measured via suitable divergences. Specifically, a weighted sum of divergences between current posterior estimates of cluster memberships is minimized. This co-regularization technique is combined with Expectation-Maximization (EM)  (Dempster et al. 1977) to maximize the log-likelihood. The training process thus alternates between an inference phase that estimates and updates view-wise posterior distributions to encourage coherence followed by per-view parameter updates.

To specifically account for potential incoherence among views, we formulate the cost function as a weighted sum of Rényi divergences  (Rényi 1960). Storkey et al. (2014) have observed that when aggregating opinions from biased experts or agents, the maximum entropy distribution is obtained via Rényi divergence aggregation (see Definition 1). An extreme case is when views don’t agree on the cluster memberships, in which case, linear aggregation provides the best aggregate posterior. For instance, Bickel and Scheffer use linear aggregation in Co-EM  (Bickel and Scheffer 2005), inadvertently assuming that different views mostly do not agree with respect to the cluster membership. Instead of assuming a bias free condition, we explore the utility of various aggregation strategies applied to the co-regularization framework. Hence, the proposed method can be applied with appropriate Rényi divergences best suited for different levels of discordance in view memberships. Co-EM  (Bickel and Scheffer 2005) is a special case of our framework as it can be recovered as a specific setting of the Rényi divergence parameter for a fixed parametrization of weights as shown in Sect. 4.3. Extensive empirical evaluations are presented to demonstrate improved performance over existing multiview clustering methods as well as other methods of fusing information from multiple views.

Our main contributions are highlighted in the following:

  • We propose a novel co-regularized multiview clustering algorithm that minimizes weighted sums of Rényi divergences.

  • We show that an existing approach to probabilistic multiview clustering, namely Co-EM can be recovered as a special case of our framework.

  • We present extensive empirical evaluation showing that the proposed class of methods significantly outperform strong baselines. Moreover, the choice of Rényi divergence can affect clustering performance, while simultaneously capturing biases in view-specific posterior cluster memberships. Empirical evaluation also demonstrates that our methods handle mixed data e.g. discrete and continuous data very well.

The rest of the manuscript is organized as follows. A brief survey of existing approaches to multiview clustering is provided in Sect. 2. Background and notation is given in Sect. 3. The proposed methods along with other modeling choices are detailed in Sect. 4. Extensive empirical evaluation on several data sets are in Sect. 5 followed by a discussion and conclusion in Sect. 6.

2 Related work

Related work in multiview unsupervised learning goes back to neural network models, a few of which are noted here. Becker and Hinton (1992), Schmidhuber and Prelinger (1993) maximize agreement between a given neural network module and a weighted output of its neighbors. De Sa and Ballard (1993) take advantage of complementary information available in different views by using separate modules for views feeding into a common output. Bickel and Scheffer (2005) introduced probabilistic multiview clustering using co-training.

Relatively recent models, like those proposed by Chaudhuri et al. (2009), Sa (2005) construct lower dimensional projections using multiple views of data. However, these methods are only applicable when at most two views are available. Kumar et al. (2010) and Tzortzis and Likas (2012) explore multiple kernel learning techniques where each view is represented as a kernel. Closely related to kernel techniques are multiview spectral clustering methods described in detail below.

Zhou and Burges (2007) propose a multiview spectral clustering method as a generalization of the normalized cuts algorithm. In a similar vein, Kumar and Daume-III (2011) update the similarity matrix of a given view based on the clustering of another view iteratively to produce a coherent clustering. Kumar et al. (2011) minimize disagreement between views by constraining the similarity matrices of views to be close in the Frobenius norm. While spectral methods are effective, they do not estimate cluster centroids, making interpretation and out-of-sample cluster assignments more challenging to implement. Our empirical studies show that proposed methods outperform spectral multiview clustering methods.

Further, connections between non-negative matrix factorization and clustering have also been utilized when multiple views are observed. For example, Liu et al. (2013) have shown that modeling user-feature matrices via multiview clustering based on non-negative matrix factorization (NMF) admits better empirical clustering performance compared to collective matrix factorization  (Akata et al. 2011), a popular method for combining information from multiple sources. This further illustrates the advantage of multiview clustering over other related methods. Another approach for multi-view clustering using convex subspace representation learning has been proposed by Guo (2013). These methods estimate a subspace where different views are clustered similarly. Many of these methods, however, provide little insight into how views interact within the data. Probabilistic techniques such as ours are particularly useful when such exploration is required. Our empirical evaluation suggests improved performance of our models over NMF based multiview clustering. In addition, many models also deal with partially missing views  (Eaton et al. 2014; Li et al. 2014) and demonstrate improved performance using multiview clustering. Lian et al. (2015) use a shared latent factor model to model heterogeneous multiview data and can also handle arbitrarily missing views, i.e., the case when a complete view may be missing for a sample. However, this model assumes a shared latent matrix across all views as opposed the proposed method which maintains separate cluster membership variables for each view. Our proposed methods can easily extend to handle missing views by simply not co-regularizing over the missing view.

Note that multiview clustering is distinct from cluster ensemble methods  (Ghosh and Acharya 2011; Strehl and Ghosh 2003) that learn hypotheses for each view independently and find a consensus among the per-view results post-training. The latter methods do not share information during training and are thus more suitable for knowledge reuse  (Strehl and Ghosh 2003).

One of the more popular applications of multiview clustering is to jointly model images and annotations, each constituting a view. The objective is to utilize annotations and images to learn the underlying clustering of images. This problem has been modeled in varied ways using unsupervised as well as supervised methods. We compare our multiview clustering framework to other relevant methods in the context of this application to motivate the differences in model assumptions. Recently, much supervised work has explored the utility of rich representations of label words and/or annotations in a high dimensional embedding space  (Mikolov et al. 2013). A mapping is learned directly from the image view to the word embedding space (annotation view) so that relevant tags or labels are closer under some similarity metric  (Frome et al. 2013; Akata et al. 2013) or ranked higher compared to the rest  (Weston et al. 2011). Additionally, Akata et al. (2013) learn a mapping to a pre-defined attribute space to extend supervised image classification to unseen labels. Thus in this case, the target labeling is the same as the text or label view. In contrast, multiview clustering models aim to find the best underlying grouping of data jointly, thus differing in the underlying modeling assumptions. The multiview clustering methods presented in this paper are for completely unsupervised scenarios, and thus do not assume availability of labels for images. Further, the target clustering does not necessarily have to have a one-to-one mapping to the annotation views. Hence, in our empirical evaluation, we only compare our models to unsupervised multiview methods with similar modeling and data assumptions as ours.

3 Preliminaries

For non-negative integers K, vectors in \({\mathbb {R}}^K\) are denoted by lower-case bold (e.g., \({\mathbf {x}}\) with components \(x_1, \ldots , x_K\)). The set \(\{1, 2, \ldots , K\}\) will be denoted [K]. The simplex \(\varDelta ^{K}\) is the set:

$$\begin{aligned} \varDelta ^{K} = \left\{ {\mathbf {a}}\in {\mathbb {R}}^{K} \ :\ a_k \ge 0 \ \forall k \in [K],\ \sum _{k \in [K]} a_k = 1\right\} \end{aligned}$$

A categorical distribution is a discrete distribution over outcomes \(\omega \in [K]\) parameterized by \(\mathbf {\theta } \in \varDelta ^{K}\) so that \(Pr (\omega =k) = \theta _k\). It is a member of the exponential family of distributions. The natural parameters of categorical distribution are \(\log {\mathbf {\theta }} = (\log \theta _k)_{k\in [K]}\) and sufficient statistics are given by the vector of indicator functions for each outcome \(\omega \in [K]\), denoted by \({\mathbf {z}}(\omega ) \in \{0,1\}^K\) with:

$$\begin{aligned} z_k (\omega ) = {\left\{ \begin{array}{ll} 1, &{} \text {if } \omega =k, \\ 0, &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

Given two categorical distributions \(p (\omega )\) and \(q (\omega )\), describing the distribution over the categorical random variable \(\omega \), the divergence of \(p (\omega )\) from \(q (\omega )\), denoted \(\mathcal {D} (p (\omega )\Vert q (\omega ))\), is a non-symmetric measure of the difference between the two probability distributions. The Kullback-Leibler or KL-divergence is a specific divergence denoted by \({\mathrm {KL}\!\left( {p (\omega )}\Vert {q (\omega )}\right) }\) and is defined as follows.

KL-divergence of \(p (\omega )\) from \(q (\omega )\) is given by:

$$\begin{aligned} {\mathrm {KL}\!\left( {p (\omega )}\Vert {q (\omega )}\right) } = \mathrm {E}_{ p (\omega )}\left[ \, \log p (\omega ) - \log q (\omega ) \, \right] \end{aligned}$$
(1)

This is also known as the relative entropy between \(p (\omega )\) and \(q (\omega )\). The relative entropy is non-negative and jointly convex with respect to both arguments. Further, we have that \({\mathrm {KL}\!\left( {p (\omega )}\Vert {q (\omega )}\right) }=0\) iff \(p (\omega ) = q (\omega )\), for all \(\omega \).

The Rényi divergences (Rényi 1960) are a parametric family of divergences with many similar properties to the KL-divergence. Since our focus is on using these divergences to measure distances of distributions over cluster labels, we will focus on Rényi divergences for distributions over discrete random variables.

Definition 1

  (van Erven and Harremoës 2012) Let pq be two distributions for a random variable \(\omega \in [K]\). The Rényi divergence of order \(\gamma \in (0,1)\cup (1,\infty )\) of \(p (\omega )\) from \(q (\omega )\) is,

$$\begin{aligned} \mathcal {D}_{\gamma } (p (\omega ) \Vert q (\omega )) = \frac{1}{\gamma -1} \log { \Big ( \sum _{\omega = 1}^K p (\omega )^{\gamma } q (\omega )^{ (1-\gamma )}\Big )} \end{aligned}$$
(2)

The definition may been extended for divergences of other orders like \(\gamma = 0, \ \gamma \rightarrow 1,\ \text {and} \ \gamma \rightarrow \infty \)  (van Erven and Harremoës 2012). Rényi divergences are non-negative \(\forall \gamma \in [0,\infty ]\). In addition, they are jointly convex in \( (p,\ q) \ \forall \gamma \in [0, 1]\) and convex in the second argument \(q \ \forall \gamma \in [0,\infty ]\). As discussed in the comprehensive survey of Rényi divergences by van Erven and Harremoës (2012), many special cases of other commonly used divergences are recovered for specific choices of \(\gamma \). For example, \(\gamma = \frac{1}{2}\) and \(\gamma = 2\) give Rényi divergences which are closely related to the Hellinger and \(\chi ^2\) divergences, respectively, and the KL-divergence is recovered as a limiting case when \(\gamma {\,\rightarrow \,}1\). For the rest of the manuscript, we will abuse notation slightly and use \(p (\omega )\) and \(p ({\mathbf {z}})\) interchangeably to denote the same categorical distribution over outcomes in [K].

4 Co-regularized multiview clustering using Rényi divergence minimization

We propose a co-regularization technique for multiview clustering using Rényi divergences. The generative model of the data is assumed to be a mixture model in each view. Let N be the total number of samples and V be the total number of views. For \(n \in [N]\) and \(v \in [V]\), let \({\mathbf {x}}_n^v\) represent the feature vector observed at view v for sample n. If the data sample n lies in cluster k in view v, the latent membership is indicated by the categorical random variable \({\mathbf {z}}_n^v \in \{0, 1\}^{K}\), where the k-th element of the vector, denoted by \(z_{n,k}^v\) is 1 and the rest are 0. The vector \({\varvec{\pi }}_n \in \varDelta ^{K}\) parametrizes the prior distribution over the categorical variable \({\mathbf {z}}_n^v\) and is the same for each view v. Each data sample is generated independent of the others. Also each view of a sample is generated independently conditioned on \({\mathbf {z}}_n^v\). Let the \(k^{th}\) cluster distribution at view v be parametrized by \({\varvec{\varPsi }}_k^v\). Let the set of all parameters for view v be denoted by \({\varvec{\varPsi }}^v\), i.e. \({\varvec{\varPsi }}^v = \{{\varvec{\varPsi }}_k^v\}\). The generative process can be represented by the plate model in Fig. 1 and is described as follows:

  • For each n:

    • For each view v:

      • Choose \({\mathbf {z}}_n^v \sim p ({\mathbf {z}}_n^v ; {\varvec{\pi }}_n)\) categorical distribution parametrized by \({\varvec{\pi }}_n\).

      • Choose \({\mathbf {x}}_n^v \sim p ({\mathbf {x}}_n^v | z_{n,k}^v=1 , \varPsi _{k}^v)\) i.e., sample feature from the \(k^{th}\) cluster.

If no coherence conditions are imposed, each view can be modeled independently by maximizing the complete log-likelihood \(\sum _{n \in [N]} \log {p ({\mathbf {x}}_n^v, {\mathbf {z}}_n^v | {\varvec{\varPsi }}^v)}\) using Expection Maximization (EM)  (Dempster et al. 1977). Let \({\varvec{\varPsi }}_t^v\) be an estimate of the parameter \({\varvec{\varPsi }}^v\) at iteration t. The Expectation (E)-step, estimates posterior probabilities \(p ({\mathbf {z}}_n^v | {\mathbf {x}}_n^v, {\varvec{\varPsi }}_t^v) \, \forall n \in [N], \, \forall v \, \in [V]\). A new estimate of cluster memberships can be obtained for each view, that decreases the Rényi divergence between all view-specific posteriors to encourage coherence. This core idea motivates the co-regularization technique proposed in the following subsections.

Fig. 1
figure 1

Generative model for multi-view clustering

4.1 Global co-regularization

The proposed method minimizes a weighted sum of divergences between the current posterior or cluster membership estimates available at all views to estimate a new ‘global’ categorical distribution. We would like to trade-off between the ‘global’ posterior (accounting for co-regularization) and the view-specific unregularized posteriors. A new posterior distribution is estimated for every view \(v \in [V]\) by minimizing the sum of divergences between the global categorical distribution and the view-specific posterior \(p ({\mathbf {z}}_n^i | {\mathbf {x}}_n^i, {\varvec{\varPsi }}^i) \ \forall i \in [V]\).

At any iteration t, let \(g_{t} ({\mathbf {z}}_n)\) be the global categorical posterior that is to be estimated from the independent posteriors \(p({\mathbf {z}}_{n}^{i} | \mathbf{{x}}_{n}^{i}, {\vec {\varPsi }}_{t}^{i})\). Let \({\mathbf {w}}\in \varDelta ^{V}\) denote a (known) weight vector of dimension V that determines the contribution of each view. Let the choice of divergence metric be known and given by \(\gamma \). The i-th element of the vector \({\mathbf {w}}\) is denoted by \(w_i\). The global distribution \(g_{t}^* ({\mathbf {z}}_n)\) can be obtained by minimizing (3).

$$\begin{aligned} {g}_{t}^* ({\mathbf {z}}_n) = \mathop {\hbox {arg min}}\limits _{g ({\mathbf {z}}_n)} \sum _{i \in [V]} \frac{w_i}{\gamma } \mathcal {D}_{\gamma } (p ({\mathbf {z}}_n^i | {\mathbf {x}}_n^i, {\varvec{\varPsi }}_t^i) || g ({\mathbf {z}}_n)) \end{aligned}$$
(3)

To solve the weighted divergence minimization, a modified version of the variational algorithm proposed by Storkey et al. (2014) is used. Detailed derivation of the variational algorithm and specific updates are provided in Appendix 1.

For any given view v, it is desirable that its posterior be close to the global distribution \(g_t^* ({\mathbf {z}})\) as well as its local estimate \(p ({\mathbf {z}}_n^v | {\mathbf {x}}_n^v, {\varvec{\varPsi }}_t^v)\). Thus, we introduce a new weight vector \([w_g, (1-w_g)] \in \varDelta ^2, \, 0 < w_g < 1\) that achieves this trade-off. Let the co-regularized posterior obtained for the current view v being updated be denoted by \(q_t ({\mathbf {z}}_n^v)\). Then, given \(\gamma \) and weights \([w_g, (1-w_g)], q_t ({\mathbf {z}}_n^v)\) can be estimated using:

$$\begin{aligned} q_t ({\mathbf {z}}_n^v) = {\mathop {\hbox {arg min}}\limits }_{q ({\mathbf {z}}_n^v)} \frac{w_g}{\gamma } \mathcal {D}_{\gamma } ({g^*}_t ({\mathbf {z}}_n) || q ({\mathbf {z}}_n^v)) + \frac{ (1-w_g)}{\gamma } \mathcal {D}_{\gamma } (p ({\mathbf {z}}_n^v | {\mathbf {x}}_n^v, {\varvec{\varPsi }}_t^v) || q ({\mathbf {z}}_n^v))\quad \end{aligned}$$
(4)

Given \(g^*_{t} ({\mathbf {z}}_n)\), (4) can be solved to minimize the weighted sum of two divergences only. Since all views admit separate aggregate posteriors, the M-step can be executed for each view independent of all other views using the conventional M-step for a mixture model. In order that all the views are in coherence with the latest posterior beliefs of all other views, an M-step for the view should follow every view’s posterior update. This prevents the algorithm from getting stuck at local minima by avoiding parameter estimates that may agree with old and potentially discordant beliefs of other views. For specific updates in the M-step based on the chosen mixture model, please refer to Appendix 4.

The above procedure of inferring per-view responsibilities independently, followed by a coherence enforcing step using (3) and (4) and a view-specific M-step is computed for each view. The complete algorithm for the proposed global co-regularization, called GRECO (Global REnyi divergence based CO-regularization) and is provided by Algorithm 1. The inference step for GRECO, specifically the inner loop update is shown in Fig. 2a for a toy problem with three views for \(K = 3\). Note that (3) and (4) can be computed in parallel for data samples \(n \in [N]\). This is because our target co-regularized posterior is independent for each sample (and each view) and can be factored in the product form over samples as well as views. A detailed proof of how that leads to an embarrassingly parallel co-regularization algorithm is provided in Appendix 2.

figure a
Fig. 2
figure 2

a Left Inner loop of inference in GRECO for \(V = 3\) and \(K = 3\) at iteration t. The inference shows updates for view 1. \(p (z^i | x^i, {\varvec{\varPsi }}_t^i)\) for \(i \in [3]\) are the independent view-wise posteriors, \(g_{t}^* (z)\) is the global distribution for GRECO. \(q_t ({\mathbf {z}}^1)\) is the re-estimated posterior for view 1. b Right Inner loop of inference for LYRIC

4.2 Local co-regularization

We now consider two limiting cases, when \(w_g =0\) and \(w_g = 1\) in  (5). The first case (\(w_g = 0\)) is trivial as it does not use co-regularization at all and is therefore equivalent to the ensemble method  (Strehl and Ghosh 2003). The latter recovers a new method. We consider the non-trivial case when \(w_g=1\) separately for several reasons. First, we are able recover an existing multiview clustering algorithm, Co-EM as a special case of this setting for a certain choice of Rényi divergences (\(\gamma \rightarrow 1\)). Thus Co-EM is also a special case of our most general setting, GRECO for \(\gamma \rightarrow 1\) and \(w_g = 1\). Further, our empirical evaluation suggests better performance of the most general case of the proposed method (GRECO) in most cases as opposed to this special case (\(w_g = 1\)) suggesting that a non-trivial trade-off between the global posterior and the view-specific unregularized posterior in the E-step is advantageous. A useful analogy we would like to draw here is between Gaussian mixture models versus the k-means algorithm, used for soft and hard clustering respectively. The latter is a limiting case of the former (as widths go to zero, and using identical, isotropic covariances) but is considered as a separate algorithm because of its special properties. The co-regularization framework in GRECO with \(w_g = 1\) does not involve an additional trade-off between the global posterior and the unregularized view-specific posterior. The minimizer of  (5) in this case is exactly equal to \(g^* ({\mathbf {z}}_n)\). Thus, the view-specific co-regularized posterior \(q ({\mathbf {z}}_n^v)\) is equal to the global posterior \(g^* ({\mathbf {z}}_n)\). Note that in a given iteration, only view v is co-regularized so that \(q ({\mathbf {z}}_n^v) = g^* ({\mathbf {z}}_n)\). All other views are not updated in the inference and the learning step in the same iteration. The procedure is repeated subsequently for all views. We call this algorithm LYRIC (LocallY weighted Rényi dIvergence Co-regularization). As in GRECO, the outer loop of LYRIC iterates over each view v and the inner loop carries out a coherence enforcing E-step for the given view followed by an M-step. The E-step comprises of estimating independent view-specific posteriors followed by a local co-regularization step that updates the current view’s posterior. It is important to highlight that LYRIC does not result in the same estimates as GRECO every iteration. This is because view-specific posteriors will be different in each iteration for GRECO and LYRIC owing to the different stages of co-regularization. The details of local co-regularization (LYRIC) are now explained in the following.

Let v be the current view to be updated at any iteration t and let \(q_t ({\mathbf {z}}_n^v)\) denote the newly obtained posterior at view v for sample n. Local co-regularization solves the coherence equation given by (5).

$$\begin{aligned} q_t ({\mathbf {z}}_n^v) = \mathop {\hbox {arg min}}\limits _{q ({\mathbf {z}}_n^v)} \sum _{i \in [V]} \frac{w_i}{\gamma } \mathcal {D}_{\gamma } (p ({\mathbf {z}}_n^i | {\mathbf {x}}_n^i, {\varvec{\varPsi }}_t^i) || q ({\mathbf {z}}_n^v)) \end{aligned}$$
(5)

Similar to GRECO, a per-view M-step can now be executed to update per-view parameters according to the modified responsibilities. The procedure is repeated iteratively for all views \(v \in [V]\). The final algorithm, LYRIC, is illustrated in Algorithm 2. Figure 2b shows the inference step for a single view in LYRIC.

figure b

4.3 Special case I: \(\gamma \rightarrow 1\)

If \(\gamma \) is chosen such that \(\gamma \rightarrow 1\), the minimizer of weighted sum of Rényi divergences admits a closed form solution. Specifically, \(\gamma \rightarrow 1\) reduces the cost to a weighted sum of KL-divergences with the target distribution on the right hand side of KL-divergence  (Storkey et al. 2014). Consider (5), for instance, with \(\gamma \rightarrow 1\). Let the per-view posterior, \(p ({\mathbf {z}}^i | {\mathbf {x}}^i, {\varvec{\varPsi }}^i)\) be parametrized by \({\varvec{\theta }}^i \in \varDelta ^{K}\). Let the target distribution \(q ({\mathbf {z}}_n^v)\), be parametrized by \({\varvec{\phi }}^v \in \varDelta ^{K}\). The cost function given by (5) can be simplified to (6).

$$\begin{aligned} q ({\mathbf {z}}^v) = \mathop {\hbox {arg min}}\limits _{q ({\mathbf {z}}^v)} \sum _{i \in [V]} w_{i} {\mathrm {KL}\!\left( {p ({\mathbf {z}}^i | {\mathbf {x}}^i, {\varvec{\varPsi }}^i)}\Vert {q ({\mathbf {z}}^v)}\right) } \end{aligned}$$
(6)

For categorical distributions, the closed form solution of (6) is given by (7) as was derived by Garg et al. (2004). Refer to Appendix 3 for a proof.

$$\begin{aligned} {\varvec{\phi }}^v = \sum _{i \in [V]} w_i {\varvec{\theta }}^i \end{aligned}$$
(7)

Note that the linear aggregation closed form is not specific to LYRIC and can be generalized to GRECO for the choice of \(\gamma \rightarrow 1\). In GRECO, both (3) and (4) reduce to linearly aggregating over per-view posteriors in the former and weighted divergence minimization between the global posterior and the current view’s posterior in the latter case.

Specifically, if \(w_v = (1 - \alpha )\) for the view v currently being updated, and \(w_i = \frac{\alpha }{V-1}\), where \(0 \le \alpha \le 1\) for \(i \ne v, i \in [V]\), the LYRIC algorithm recovers Co-EM when \(\gamma \rightarrow 1\). Thus Co-EM is a special case of LYRIC.

4.4 Special case II: \(\gamma \rightarrow 0\)

When \(\gamma \rightarrow 0\), (5) has been shown by Storkey et al. (2014) to be equivalent to a minimization over a weighted sum of the KL-divergences with the target distribution as the argument on the left-hand side of KL-terms. The closed form solution in this case is an averaging of the parameters \({\varvec{\theta }}^i \ \forall i \in [V]\) in the log-space weighted by \(w_i \ \forall i \in [V]\)  (Garg et al. 2004) as shown in (8). The proof is detailed in Appendix 3.

$$\begin{aligned} \log {{\varvec{\phi }}^v} = \sum _{i \in [V]} w_i \log {{\varvec{\theta }}^i} \end{aligned}$$
(8)

This result is also general and applicable to (3) and (4) with appropriate weighting. For these special cases, the variational updates can be avoided to use the simpler closed form updates for GRECO and LYRIC. Note that (8) can be equivalently written as:

$$\begin{aligned} {\varvec{\phi }}^v = \prod _{i \in [V]} {{\varvec{\theta }}^i}^{w_i } \end{aligned}$$
(9)

This further suggests that when \(\gamma \rightarrow 1\), the parameters across views contribute equally owing to linear averaging as opposed to when \(\gamma \rightarrow 0\) (9) where extreme values of the posteriors may dominate. Conventionally, a product of experts model  (Hinton 2002; Storkey et al. 2014) uses such a product to combine beliefs from independently trained models, for example in an ensemble setting.

Note that co-regularization in each GRECO and LYRIC adds an additional complexity of \(\mathcal {O} (NKV^2)\) per iteration where N is the sample size, K is the number of clusters and V is the number of views, compared to the unregularized method. As suggested before, the operations can be trivially parallelized over data samples as well as for calculations required to estimate unnormalized variational parameters for each cluster (see Appendices 1, 2). For the case where all views are Gaussian mixtures, the complexity per outer iteration is \(\mathcal {O} (NKV^2T_{inner} + NKV + \sum _{v \in [V]}d_v^2 K )\) where \(T_{inner}\) is the number of inner iterations for variational estimation of co-regularized posteriors, \(d_v\) is the dimension of view v. In each of the special cases described earlier, i.e. when \(\gamma \rightarrow 0\) and \(\gamma \rightarrow 1\), the complexity reduces to \(\mathcal {O} (NKV + \sum _{v \in [V]}d_v^2 K )\) per iteration, same as that of Co-EM, due to closed form solutions available for co-regularization. In the general case, the largest source of computational overhead in the proposed algorithm is due to the variational procedure currently employed to impose co-regularization. However, we are not bound to such a procedure and any accelerated methods available for solving (3), (4) and (5) can be adopted, if available. Further, our variational procedure is trivially parallelizable over samples (see Appendix 2 for relevant proof) whereas co-regularization/co-training techniques for the baselines (see Sect. 5.1) are not. This allows us to improve training efficiency to scale to large datasets.

4.5 Choice of weights and Rényi divergences

For empirical studies, we parametrize the weights for easy comparison with baselines. Let \(0 \le \alpha \le 1\) be a scalar. For every view \(v \in [V]\) being updated, \(w_v = 1 - \alpha \). For all other views, \(w_i = \frac{\alpha }{V-1} \, \forall i \, \in [V], \ i \ne v\). At every stage in the outer loop of either GRECO or LYRIC, the current view being updated is weighted by \(1-\alpha \) and the rest are weighted equally \(\frac{\alpha }{V-1}\). This also ensures fair comparison with Co-EM by maintaining the same parametrization of weights. Therefore, all experiments demonstrate that a significant boost in clustering performance can be obtained via a suitable choice of Rényi divergences. We evaluated the performance of GRECO and LYRIC for different choices of \(\alpha \) and \(\gamma \). Section 5.3 shows the performance of the model across different choices of the divergence parameter. Specifically, for comparison with baselines, we choose the best performing set of \(\alpha \) and \(\gamma \) based on average accuracy of hold-out clustering assignment across five trials.

4.6 Prediction on hold-out samples

For out-of-sample cluster assignment, the conventional E-step with the learned parameters is used to obtain per-view posteriors for a test sample for all views independently. It is now desirable to obtain one aggregate posterior \(q ({\mathbf {z}})\) as follows.

$$\begin{aligned} q ({\mathbf {z}}) = \mathop {\hbox {arg min}}\limits _{q ({\mathbf {z}})} \sum _{v \in [V]} w_v \mathcal {D}_{\gamma } (p ({\mathbf {z}}^v | {\mathbf {x}}^v, {{\varvec{\varPsi }}^v}^*) || q ({\mathbf {z}})) \end{aligned}$$
(10)

For LYRIC, a global posterior can then be obtained using (10) for a given choice of \(\gamma \) and \({\mathbf {w}}\) (see Sect. 4.5) and the set of corresponding learned parameters from LYRIC \({{\varvec{\varPsi }}^v}^*\). Similarly for GRECO, the E-step is run for all views independently followed by executing (10) to obtain a global posterior. A hard clustering is simply the MAP assignment of \({\mathbf {z}}\) w.r.t. the distribution \(q ({\mathbf {z}})\). Empirical performance of LYRIC at \(\gamma \rightarrow 1\) can differ from Co-EM due to different methods of obtaining the consensus clustering. Specifically, let \({\varvec{\pi }}_n^c\) and \({{\varvec{\varPsi }}^v}^c, \forall v \in [V]\) be the estimates of the prior distribution parameter and mixture model parameters learned by Co-EM respectively. Then the consensus clustering distribution and the corresponding MAP assignment w.r.t. the consensus distribution in Co-EM is given by (11) for a data sample \({\mathbf {x}}\equiv \{{\mathbf {x}}^v ,\, \forall v \in [V]\}\),

$$\begin{aligned} {\hbox {arg max}}_{k \in [K]} \frac{\pi _{k}^c \prod _{v \in [V]} p ({\mathbf {x}}^v | {{\varvec{\varPsi }}^v}^c, {\mathbf {z}}_{k}=1)}{\sum _{k' \in [K]} \pi _{k'}^c \prod _{v \in [V]} p ({\mathbf {x}}^v | {{\varvec{\varPsi }}^v}^c, {\mathbf {z}}_{k'}=1)} \end{aligned}$$
(11)

Note that this method of obtaining a consensus clustering used by Co-EM is equivalent to the E-step of a multiview latent variable model that shares a single latent clustering variable across all views. As opposed to Co-EM, GRECO and LYRIC obtain a consensus via linear aggregation for \(\gamma \rightarrow 1\) and weights \({\mathbf {w}}\) as shown in (12).

$$\begin{aligned} \mathop {\hbox {arg max}}\limits _{k \in [K]} \sum _{v \in [V]} w_v p (z_k^v = 1 | {\mathbf {x}}^v, {{\varvec{\varPsi }}^v}^*) \end{aligned}$$
(12)

5 Experiments

The proposed methods have been extensively compared with existing multiview clustering models to show that the choice of divergence obtained by tuning \(\gamma \) is of significance, as well as to demonstrate that Rényi divergence is a reasonable choice for co-regularization. All datasets were trained using both LYRIC and GRECO algorithms for different values of \(\gamma \in [0, 1]\) discretized in the corresponding log-space. Very high values of Rényi divergences did not matter significantly affect the performance. The weights \({\mathbf {w}}\) are reparametrized as described in Sect. 4.5. For all datasets, ground-truth cluster labels are known and utilized for objective evaluation and comparison to baselines. All models and baselines were trained on the same training and hold-out data for five trials with best performing models chosen based on average clustering accuracy for comparison purposes. The mapping between cluster labels to ground truth labels is solved using Hungarian matching  (Kuhn 1955). For comparison to baselines, we only report the best performance obtained across different choices of \({\mathbf {w}}\) and \(\gamma \). Hold-out assignment results have only been compared to baselines that explicitly mention a mechanism to obtain hold-out cluster assignment and empirically test the same. We report Clustering Accuracy, Precision, Recall, F-measure, NMI  (Strehl and Ghosh 2003) and Entropy  (Bickel and Scheffer 2005) for our evaluation. Lower entropy is better while higher values of other metrics show a better performing algorithm. All metrics are defined in Appendix 5. Note that the empirical evaluation here maintains prior cluster distribution \({\varvec{\pi }}_n\) to be equal for all samples n for all probabilistic models, including GRECO and LYRIC without loss of generality. Results demonstrating empirical convergence for a sample fold with multiple initializations (in negative log-likelihood) of GRECO and LYRIC have been included in Appendix 6.Footnote 1 To the best of our knowledge, our empirical evaluation is the most extensive evaluation of multiview clustering methods compared to prior work in terms of the number of datasets, number of views and comparison to existing baselines.

5.1 Baselines

The proposed methods are compared to an extensive set of baselines. The baselines are briefly described here.

  • Shared Latent Variable Model (Joint): An alternative way of modeling multiple views is to have one latent variable that denotes the cluster membership across all views. This is called the ‘Joint’ model. This model is equivalent to concatenating views especially in the most commonly assumed scenario i.e. all views are Gaussian mixtures with diagnonal covariances.

  • Ensemble Clustering Model (Ensemble)  (Strehl and Ghosh 2003): This model trains each view independently followed by a consensus evaluation. To predict the hard clustering assignment, the label correspondence among views is obtained using Hungarian matching  (Kuhn 1955). A single posterior is obtained using the same equation as (10) with KL-divergence (log-aggregation), followed by a MAP assignment. This method is compared to only when at most two views are available.

  • Co-EM  (Bickel and Scheffer 2005): Co-EM estimates a mixture model per view subject to cross-entropy constraints. The weights for each view are parametrized by \(\eta \in [0,1]\) and the results corresponding to the best performing \(\eta \) are reported.

  • Co-regularized Spectral Clustering (Co-reg (Sp))  (Kumar et al. 2011): This is the state-of-the-art spectral multiview clustering. The results corresponding to the best performing \(\lambda \) parameter (between 0.01 to 0.1 as suggested by authors) are reported. The implementation provided by the authors is used.Footnote 2

  • Minimizing Disagreement (Min-dis (Sp))  (Sa 2005): This is another spectral clustering technique proposed by  (Sa 2005) for 2 views only. The implementation used was implemented and compared to by Kumar et al. (2011).

  • CCA for Mixture Models (CCA-mvc)  (Chaudhuri et al. 2009): This method uses Canonical Correlation Analysis to project views on a lower dimensional space. This model can be used for 2 views only.

  • NMF based Multiview Clustering (NMF-mvc) (Liu et al. 2013): This method uses non-negative matrix factorization for multiview clustering. The original implementation provided by the authors was used for empirical evaluation.Footnote 3

A k-means clustering algorithm is used independently for each view to initialize distribution parameters for all probabilistic models. An approximate Hungarian matching problem is solved using the k-means cluster assignments for initialization.

5.2 Datasets

The datasets are chosen referencing prior work in multiview clustering. Details of the datasets are provided in the following.

  • Twitter multiview Footnote 4 (Greene and Cunningham 2013): This is a collection of twitter datasets in five topical areas (politics-UK, politics-Ireland, Football etc.). Each user has views corresponding to users they follow, their followers, mentions, tweet content etc. We use the politics-uk dataset with three views (mentions, re-tweets and follows). The labels correspond to one of five party memberships of each user. Each view is a bag-of-words vector and modeled as a mixture of multinomials for the probabilistic models.

  • WebKB Footnote 5: This dataset consists of webpage information from four university websites: Cornell, Texas, Washington and Wisconsin. We show results for the Cornell dataset. Each sample is a webpage with two views, one view of which is the text content (bag-of-words) format and web-links into and out of the webpage (binary bag-of-words vector). Each webpage can be clustered into one of five topics. Each view is modeled as a mixture of multinomials.

  • NUS-Wide Object Footnote 6 (Chua et al. 2009): This dataset consists of 31 object classes. Of these, we sub-sample in a balanced manner for 10 classes, with 50 samples belonging to each class. We use 6 views, namely edge histograms (mixture of Gaussians), bag-of-visual words of SIFT features (mixture of multinomial distributions) and normalized correlogram (mixture of Gaussians), color histogram (mixture of multinomials), wavelet texture (mixture of Gaussians) and block-wise color moments (mixture of Gaussians).

  • CUB-200-2011 Footnote 7 (Wah et al. 2011): This dataset consists of 200 classes and 11,800 data samples. We use the binary attributes and Fisher Vector representations of images as our views. The binary attributes are modeled as mixture-of-multinomials and the Fisher vectors as Gaussian mixtures. We assume diagonal covariances for all views modeled as a mixture of Gaussians in all datasets.

5.3 Results

Tables 1, 2, 3 and 4 show clustering and out-of-sample cluster assignment results for the datasets mentioned in Sect. 5.2 in that order. Note that results are marked NA if any of the baseline methods were not extendable to more than two views or could not be compared due to limiting model assumptions e.g. non-negativity required by NMF-mvc  (Liu et al. 2013). The tables only consist of results corresponding to the \(\gamma \) parameter that provided the best results across different choices of \(\gamma \) for both GRECO and LYRIC on a hold-out dataset. Additionally, Fig. 3 shows performance of GRECO and LYRIC using different Rényi divergences parametrized by \(\log { (\gamma )}\) in comparison with Co-EM, that uses linear aggregation, corresponding to \(\gamma \rightarrow 1\). The performance across different \(\gamma \) provides further insights into performance of the proposed co-regularization method.

Table 1 Twitter data (politics-uk, 3 views), best results obtained for \(\gamma = 0.01\) for GRECO and LYRIC
Table 2 Cornell (WebKB 2-views), best results obtained for \(\gamma = 0.1\) for GRECO and \(\gamma \rightarrow 1\) for LYRIC
Table 3 NUSWideObj Dataset (6 views), best results obtained for \(\gamma = 0.1\) for GRECO and LYRIC
Table 4 CUB-200-2011 (2 views), best results obtained for \(\gamma \rightarrow 1\) for GRECO and LYRIC

The proposed methods outperform almost all the baselines consistently across different datasets. In addition, hold-out cluster assignment performance is better for both models across most datasets. Improved performance over ensemble methods suggests co-regularization improves on the view-wise clustering approaches. In addition, results also suggest that sharing a single latent variable (see Joint Model) across views is restrictive. In the low bias regime, GRECO has particular advantages over LYRIC because of the additional trade-off in regularization. When the bias across views is low, the additional regularization potentially accelerates convergence by restricting the deviation from view-specific unregularized posteriors, especially when initial model parameters may be noisy. In the high bias case, LYRIC shows some advantage (see Table 2-WebKB data). It is important to note that overall, the general trend of performance of both GRECO and LYRIC is consistent for each dataset (see Fig. 3). In particular, the performance peaks for the most appropriate choice of \(\gamma \) that best captures inherent biases across views for both algorithms for all datasets and this choice of divergence is the same for GRECO as well as LYRIC.

For Twitter data, the \(\gamma \) parameter of 0.01 resulted in the best clustering accuracy as measured on hold-out set (see Table 1). This provides further insight that the views do have some bias in the latent clustering distribution. In the absence of such a bias, the best clustering parameter should have corresponded to \(\gamma \rightarrow 0\). Thus the value of the divergence parameter \(\gamma \) provides an intuitive understanding of inherent incoherence in clustering beliefs in the data. It is notable that characterizing this bias has resulted in almost an order of magnitude increase in clustering accuracy compared to baselines like multiview NMF and spectral clustering methods. To the best of our knowledge, there is little work in terms of designing robust learning models when underlying model assumptions may be violated. The results on Twitter data strongly highlight the significance of such an approach.

Fig. 3
figure 3

Clustering accuracy of GRECO and LYRIC w.r.t. \(\log {\gamma }\) on a Twitter data, b WebKB data, c NUSWideObj data and d CUB_200_2011 data

Similar observations on the WebKB data suggests a high degree of incoherence across views on the clustering distributions, suggested by the fact that linear aggregation (\(\gamma \rightarrow 1\)) provides the best results on the hold-out dataset. Note that in such a scenario, i.e. when views completely disagree (in terms of the MAP estimate of the clustering) across views, learning each view independently is equally useful, as demonstrated by competitive performance of Ensemble methods relative to GRECO and LYRIC. Again, this further reinforces the advantage of our model in terms of robustness to violations of model assumptions. Figure 3 also suggests that as the underlying bias is assumed increase, the model performance in both LYRIC and GRECO consistently improves. In addition, the improvement over Co-EM at \(\gamma \rightarrow 1\) suggests that the method proposed to estimate a hold-out clustering assignment using (10) is better or comparable to that of Co-EM. Note that although GRECO and LYRIC do not perform the best on training data in terms of NMI and Entropy, the results on hold-out set are competitive—suggesting that the models do not overfit the training data.

From the results of NUS-Wide Object dataset, where six views are modeled jointly, the improvement in performance is significant when an appropriate divergence parameter \(\gamma \) is used, as compared to Co-EM, which enforces linear aggregation and the joint model that estimates a single clustering posterior across all views. This further suggests advantages of GRECO and LYRIC when the number of views available is large. The best performing divergence parameter is relatively high (\(\gamma = 0.1\)). This also suggests that as the number of views being modeled increases, the views are likely to be more incoherent and an assumption of a high bias (higher \(\gamma \)) is a better modeling assumption. This is also apparent from the deteriorated performance of the joint model. Both GRECO and LYRIC perform the best at the limiting case \(w_g = 1\) as expected in a slightly high bias case, when additional regularization of GRECO is not necessarily advantageous. Figure 3 also suggests that at lower values of \(\gamma \) both LYRIC and GRECO may be getting stuck in local minima (suggested by the high observed variance at \(\gamma = 0.01\)) potentially reflecting sensitivity to choice of \(\gamma \) for this data.

For a large dataset like CUB-200-2011 with 200 clusters and \(\sim \)11,000 samples and high dimensionality (\(\sim \)8000), the improvement in unsupervised learning performance of GRECO and LYRIC is more pronounced compared to Co-EM even though the best performance is obtained at \(\gamma \rightarrow 1\). This suggests that our inference on hold-out set works better than Co-EM (see Sect. 4.6 for details). Further, the best performance divergence parameter \(\gamma \rightarrow 1\) suggests the attribute view and the Fisher vector views, used from the CUB_200_2011 data, are potentially incoherent in terms of the latent clustering distribution. Comparison to other probabilistic methods, i.e. Joint model and Ensemble model, suggest restrictive model assumptions may fail and general methods like GRECO and LYRIC may be more reliable in large scale settings. Ensemble model also relies on Hungarian matching to solve the correspondence problem between cluster indices (200 clusters) across views. Improved performance in GRECO and LYRIC is obtained at a significant computational cost compared to CCA-mvc which provides comparable performance very fast. This corroborates the model assumptions made by CCA-mvc, namely that views of a sample are uncorrelated conditioned on cluster identity of sample (weaker assumptions than those made by the Joint model) can provide improvement in unsupervised learning performance. Faster inference for GRECO/LYRIC in such settings can be obtained by parallelization and/or any improvements to the variational inference procedure used to impose co-regularization.

Overall, the best Rényi divergence suitable for a particular dataset differs, indicating that GRECO and LYRIC capture potential differences in coherence between views with respect to cluster memberships significantly better than comparable methods. The biases between views demonstrably affect clustering performance. This also suggests that the multiview assumption of a single underlying cluster membership distribution is not always satisfied in real data. Thus flexible models such as GRECO and LYRIC are preferable. All results further show that the choice of the class of Rényi divergences is beneficial for improving multiview clustering performance and both methods generalize better to unseen data compared to baselines.

A comparison of training time suggests that the increased accuracy of GRECO and LYRIC is obtained at the cost of increased training time. However, as suggested in Sect. 4, the variational update required for co-regularization is the major contributing factor to training time. Since these updates can be trivially executed in a distributed setting across samples as well as for estimating unnormalized cluster membership distributions, the training time can be easily improved. Further, any alternative inference procedure to solve the co-regularization constraint will directly improve training times for the proposed method. Also note that training times are comparable to Co-EM and other baselines for special cases (see Tables 2,  4).

Additional advantages of GRECO and LYRIC compared to other methods are noteworthy. Both Twitter and WebKB datasets consist of at least one view with relational data. The twitter data is sparse (as is the case with social network data), i.e., a lot of the entries are 0. In these cases probabilistic methods outperform other methods suggesting the importance of probabilistic models in general. The NUS-Wide Object dataset and CUB datasets have mixed views, i.e. bag-of-words as well as numeric features (e.g. Fisher vector representations). Empirical evaluation also demonstrates that our methods handle mixed data well.

Some limitations of the proposed methods arise in selecting an appropriate choice of weights and the best suited Rényi divergence parameter for a given dataset.  Storkey et al. (2014) have proposed a method for automatic selection of weights which can be easily incorporated in GRECO or LYRIC via minor changes to the variational procedures described in Appendix 1. However, we chose to use manual selection of weights inorder to highlight significance of the choice of Rényi divergences as opposed to a finer choice of weights, especially to highlight the generalization over Co-EM. Note that automatic selection or learning the best divergence parameter in an unsupervised setting suitable for a given data is a challenging and novel problem that we expose. Particularly, conventional model selection methods that trade-off model complexity and likelihood are not applicable in this scenario as model complexity does not change w.r.t. different \(\gamma \). Automatic selection of such a model parameter is deferred to future work. However, we point out that both GRECO and LYRIC provide better performance compared to all existing baselines for all choices of \(\gamma \) that we tested. A more appropriate choice of \(\gamma \) further boosts performance. In case computational constraints exist, we suggest using either of the closed form methods suggested in Sect. 4.

6 Discussion and conclusion

This work proposed a co-regularization approach to multiview clustering that builds on a novel idea of directly minimizing a weighted sum of divergences between view-specific posteriors that indicate probabilities of cluster memberships. This approach encourages coherence between the posterior memberships by bringing them ‘closer’ in distribution. The resulting co-regularization techniques, GRECO and LYRIC significantly improve performance over existing multiview clustering methods. By maintaining per-view posteriors and using a flexible choice of Rényi divergences for imposing coherence, these models are robust to incoherence among views. In addition, Co-EM is recovered as a special case of LYRIC. Co-EM proposes linear aggregation of posteriors, which is best suited when aggregating among incoherent posterior memberships. We show empirically that better performance can be achieved by accounting for incoherence via a flexible family of divergences. We also achieve closed form updates to impose co-regularization for two special cases, when the divergence parameter \(\gamma \rightarrow 0\) and \(\gamma \rightarrow 1\).

For future work, a more general framework for multiview parameter estimation that accounts for divergence aggregation can be explored. Additional performance and computational gains may be obtained by learning the regularization weights and the divergence parameter \(\gamma \). Theoretical analyses of special cases and studying the effects of other class of divergence can provide insights in further developing such flexible models. Such a framework could also offer advantages when views may be arbitrarily missing or in distributed settings when minimal interaction between views is expected due to communication constraints.