Rényi divergence minimization based coregularized multiview clustering
 1.2k Downloads
 1 Citations
Abstract
Multiview clustering is a framework for grouping objects given multiple views, e.g. text and image views describing the same set of entities. This paper introduces coregularization techniques for multiview clustering that explicitly minimize a weighted sum of divergences to impose coherence between perview learned models. Specifically, we iteratively minimize a weighted sum of divergences between posterior memberships of clusterings, thus learning viewspecific parameters that produce similar clusterings across views. We explore a flexible family of divergences, namely Rényi divergences for coregularization. An existing method of probabilistic multiview clustering is recovered as a special case of the proposed method. Extensive empirical evaluation suggests improved performance over a variety of existing multiview clustering techniques as well as related methods developed for information fusion with multiview data.
Keywords
Multiview Clustering Rényi divergence Coregularization1 Introduction
Multiple views of entities are often readily available in modern datasets, for example, a webpage entity has text, images and hyperlinks, each of which can be considered as views of the webpage entity. A problem of practical interest is to harness complementary information available in multiple views to improve over conventional learning algorithms. Multiview learning has been studied as a potential framework to achieve such improved performance. Multiview methods operate with the assumption that different views cluster or label entities similarly. Such similarities have been exploited via cotraining (Blum and Mitchell 1998) and coregularization (Sindhwani and Rosenberg 2008). Cotraining learns one hypothesis for each view which then bootstrap other views to converge to a coherent model (Blum and Mitchell 1998). Coregularization, on the other hand, explicitly minimizes disagreement between views during training. Multiview methods have substantial theoretical and practical advantages over learning a single hypothesis by concatenating views (Nigam and Ghani 2000). For instance, Dasgupta et al. (2001) show that for semisupervised multiview learning with two views, the probability of disagreement between views is an upper bound on the probability of error of either view’s hypothesis.
Due to noisy measurements or unknown biases, different views may not cluster the entities similarly. To be robust to such misspecification in model assumptions made by multiview clustering, we propose a method that maintains a separate posterior distribution for each view. In the proposed method, clustering coherence is imposed by encouraging posterior distributions of viewspecific cluster memberships to be ‘close’ to each other, where closeness is measured via suitable divergences. Specifically, a weighted sum of divergences between current posterior estimates of cluster memberships is minimized. This coregularization technique is combined with ExpectationMaximization (EM) (Dempster et al. 1977) to maximize the loglikelihood. The training process thus alternates between an inference phase that estimates and updates viewwise posterior distributions to encourage coherence followed by perview parameter updates.
To specifically account for potential incoherence among views, we formulate the cost function as a weighted sum of Rényi divergences (Rényi 1960). Storkey et al. (2014) have observed that when aggregating opinions from biased experts or agents, the maximum entropy distribution is obtained via Rényi divergence aggregation (see Definition 1). An extreme case is when views don’t agree on the cluster memberships, in which case, linear aggregation provides the best aggregate posterior. For instance, Bickel and Scheffer use linear aggregation in CoEM (Bickel and Scheffer 2005), inadvertently assuming that different views mostly do not agree with respect to the cluster membership. Instead of assuming a bias free condition, we explore the utility of various aggregation strategies applied to the coregularization framework. Hence, the proposed method can be applied with appropriate Rényi divergences best suited for different levels of discordance in view memberships. CoEM (Bickel and Scheffer 2005) is a special case of our framework as it can be recovered as a specific setting of the Rényi divergence parameter for a fixed parametrization of weights as shown in Sect. 4.3. Extensive empirical evaluations are presented to demonstrate improved performance over existing multiview clustering methods as well as other methods of fusing information from multiple views.

We propose a novel coregularized multiview clustering algorithm that minimizes weighted sums of Rényi divergences.

We show that an existing approach to probabilistic multiview clustering, namely CoEM can be recovered as a special case of our framework.

We present extensive empirical evaluation showing that the proposed class of methods significantly outperform strong baselines. Moreover, the choice of Rényi divergence can affect clustering performance, while simultaneously capturing biases in viewspecific posterior cluster memberships. Empirical evaluation also demonstrates that our methods handle mixed data e.g. discrete and continuous data very well.
2 Related work
Related work in multiview unsupervised learning goes back to neural network models, a few of which are noted here. Becker and Hinton (1992), Schmidhuber and Prelinger (1993) maximize agreement between a given neural network module and a weighted output of its neighbors. De Sa and Ballard (1993) take advantage of complementary information available in different views by using separate modules for views feeding into a common output. Bickel and Scheffer (2005) introduced probabilistic multiview clustering using cotraining.
Relatively recent models, like those proposed by Chaudhuri et al. (2009), Sa (2005) construct lower dimensional projections using multiple views of data. However, these methods are only applicable when at most two views are available. Kumar et al. (2010) and Tzortzis and Likas (2012) explore multiple kernel learning techniques where each view is represented as a kernel. Closely related to kernel techniques are multiview spectral clustering methods described in detail below.
Zhou and Burges (2007) propose a multiview spectral clustering method as a generalization of the normalized cuts algorithm. In a similar vein, Kumar and DaumeIII (2011) update the similarity matrix of a given view based on the clustering of another view iteratively to produce a coherent clustering. Kumar et al. (2011) minimize disagreement between views by constraining the similarity matrices of views to be close in the Frobenius norm. While spectral methods are effective, they do not estimate cluster centroids, making interpretation and outofsample cluster assignments more challenging to implement. Our empirical studies show that proposed methods outperform spectral multiview clustering methods.
Further, connections between nonnegative matrix factorization and clustering have also been utilized when multiple views are observed. For example, Liu et al. (2013) have shown that modeling userfeature matrices via multiview clustering based on nonnegative matrix factorization (NMF) admits better empirical clustering performance compared to collective matrix factorization (Akata et al. 2011), a popular method for combining information from multiple sources. This further illustrates the advantage of multiview clustering over other related methods. Another approach for multiview clustering using convex subspace representation learning has been proposed by Guo (2013). These methods estimate a subspace where different views are clustered similarly. Many of these methods, however, provide little insight into how views interact within the data. Probabilistic techniques such as ours are particularly useful when such exploration is required. Our empirical evaluation suggests improved performance of our models over NMF based multiview clustering. In addition, many models also deal with partially missing views (Eaton et al. 2014; Li et al. 2014) and demonstrate improved performance using multiview clustering. Lian et al. (2015) use a shared latent factor model to model heterogeneous multiview data and can also handle arbitrarily missing views, i.e., the case when a complete view may be missing for a sample. However, this model assumes a shared latent matrix across all views as opposed the proposed method which maintains separate cluster membership variables for each view. Our proposed methods can easily extend to handle missing views by simply not coregularizing over the missing view.
Note that multiview clustering is distinct from cluster ensemble methods (Ghosh and Acharya 2011; Strehl and Ghosh 2003) that learn hypotheses for each view independently and find a consensus among the perview results posttraining. The latter methods do not share information during training and are thus more suitable for knowledge reuse (Strehl and Ghosh 2003).
One of the more popular applications of multiview clustering is to jointly model images and annotations, each constituting a view. The objective is to utilize annotations and images to learn the underlying clustering of images. This problem has been modeled in varied ways using unsupervised as well as supervised methods. We compare our multiview clustering framework to other relevant methods in the context of this application to motivate the differences in model assumptions. Recently, much supervised work has explored the utility of rich representations of label words and/or annotations in a high dimensional embedding space (Mikolov et al. 2013). A mapping is learned directly from the image view to the word embedding space (annotation view) so that relevant tags or labels are closer under some similarity metric (Frome et al. 2013; Akata et al. 2013) or ranked higher compared to the rest (Weston et al. 2011). Additionally, Akata et al. (2013) learn a mapping to a predefined attribute space to extend supervised image classification to unseen labels. Thus in this case, the target labeling is the same as the text or label view. In contrast, multiview clustering models aim to find the best underlying grouping of data jointly, thus differing in the underlying modeling assumptions. The multiview clustering methods presented in this paper are for completely unsupervised scenarios, and thus do not assume availability of labels for images. Further, the target clustering does not necessarily have to have a onetoone mapping to the annotation views. Hence, in our empirical evaluation, we only compare our models to unsupervised multiview methods with similar modeling and data assumptions as ours.
3 Preliminaries
The Rényi divergences (Rényi 1960) are a parametric family of divergences with many similar properties to the KLdivergence. Since our focus is on using these divergences to measure distances of distributions over cluster labels, we will focus on Rényi divergences for distributions over discrete random variables.
Definition 1
The definition may been extended for divergences of other orders like \(\gamma = 0, \ \gamma \rightarrow 1,\ \text {and} \ \gamma \rightarrow \infty \) (van Erven and Harremoës 2012). Rényi divergences are nonnegative \(\forall \gamma \in [0,\infty ]\). In addition, they are jointly convex in \( (p,\ q) \ \forall \gamma \in [0, 1]\) and convex in the second argument \(q \ \forall \gamma \in [0,\infty ]\). As discussed in the comprehensive survey of Rényi divergences by van Erven and Harremoës (2012), many special cases of other commonly used divergences are recovered for specific choices of \(\gamma \). For example, \(\gamma = \frac{1}{2}\) and \(\gamma = 2\) give Rényi divergences which are closely related to the Hellinger and \(\chi ^2\) divergences, respectively, and the KLdivergence is recovered as a limiting case when \(\gamma {\,\rightarrow \,}1\). For the rest of the manuscript, we will abuse notation slightly and use \(p (\omega )\) and \(p ({\mathbf {z}})\) interchangeably to denote the same categorical distribution over outcomes in [K].
4 Coregularized multiview clustering using Rényi divergence minimization
 For each n:
 For each view v:

Choose \({\mathbf {z}}_n^v \sim p ({\mathbf {z}}_n^v ; {\varvec{\pi }}_n)\) categorical distribution parametrized by \({\varvec{\pi }}_n\).

Choose \({\mathbf {x}}_n^v \sim p ({\mathbf {x}}_n^v  z_{n,k}^v=1 , \varPsi _{k}^v)\) i.e., sample feature from the \(k^{th}\) cluster.


4.1 Global coregularization
The proposed method minimizes a weighted sum of divergences between the current posterior or cluster membership estimates available at all views to estimate a new ‘global’ categorical distribution. We would like to tradeoff between the ‘global’ posterior (accounting for coregularization) and the viewspecific unregularized posteriors. A new posterior distribution is estimated for every view \(v \in [V]\) by minimizing the sum of divergences between the global categorical distribution and the viewspecific posterior \(p ({\mathbf {z}}_n^i  {\mathbf {x}}_n^i, {\varvec{\varPsi }}^i) \ \forall i \in [V]\).
4.2 Local coregularization
We now consider two limiting cases, when \(w_g =0\) and \(w_g = 1\) in (5). The first case (\(w_g = 0\)) is trivial as it does not use coregularization at all and is therefore equivalent to the ensemble method (Strehl and Ghosh 2003). The latter recovers a new method. We consider the nontrivial case when \(w_g=1\) separately for several reasons. First, we are able recover an existing multiview clustering algorithm, CoEM as a special case of this setting for a certain choice of Rényi divergences (\(\gamma \rightarrow 1\)). Thus CoEM is also a special case of our most general setting, GRECO for \(\gamma \rightarrow 1\) and \(w_g = 1\). Further, our empirical evaluation suggests better performance of the most general case of the proposed method (GRECO) in most cases as opposed to this special case (\(w_g = 1\)) suggesting that a nontrivial tradeoff between the global posterior and the viewspecific unregularized posterior in the Estep is advantageous. A useful analogy we would like to draw here is between Gaussian mixture models versus the kmeans algorithm, used for soft and hard clustering respectively. The latter is a limiting case of the former (as widths go to zero, and using identical, isotropic covariances) but is considered as a separate algorithm because of its special properties. The coregularization framework in GRECO with \(w_g = 1\) does not involve an additional tradeoff between the global posterior and the unregularized viewspecific posterior. The minimizer of (5) in this case is exactly equal to \(g^* ({\mathbf {z}}_n)\). Thus, the viewspecific coregularized posterior \(q ({\mathbf {z}}_n^v)\) is equal to the global posterior \(g^* ({\mathbf {z}}_n)\). Note that in a given iteration, only view v is coregularized so that \(q ({\mathbf {z}}_n^v) = g^* ({\mathbf {z}}_n)\). All other views are not updated in the inference and the learning step in the same iteration. The procedure is repeated subsequently for all views. We call this algorithm LYRIC (LocallY weighted Rényi dIvergence Coregularization). As in GRECO, the outer loop of LYRIC iterates over each view v and the inner loop carries out a coherence enforcing Estep for the given view followed by an Mstep. The Estep comprises of estimating independent viewspecific posteriors followed by a local coregularization step that updates the current view’s posterior. It is important to highlight that LYRIC does not result in the same estimates as GRECO every iteration. This is because viewspecific posteriors will be different in each iteration for GRECO and LYRIC owing to the different stages of coregularization. The details of local coregularization (LYRIC) are now explained in the following.
4.3 Special case I: \(\gamma \rightarrow 1\)
Specifically, if \(w_v = (1  \alpha )\) for the view v currently being updated, and \(w_i = \frac{\alpha }{V1}\), where \(0 \le \alpha \le 1\) for \(i \ne v, i \in [V]\), the LYRIC algorithm recovers CoEM when \(\gamma \rightarrow 1\). Thus CoEM is a special case of LYRIC.
4.4 Special case II: \(\gamma \rightarrow 0\)
Note that coregularization in each GRECO and LYRIC adds an additional complexity of \(\mathcal {O} (NKV^2)\) per iteration where N is the sample size, K is the number of clusters and V is the number of views, compared to the unregularized method. As suggested before, the operations can be trivially parallelized over data samples as well as for calculations required to estimate unnormalized variational parameters for each cluster (see Appendices 1, 2). For the case where all views are Gaussian mixtures, the complexity per outer iteration is \(\mathcal {O} (NKV^2T_{inner} + NKV + \sum _{v \in [V]}d_v^2 K )\) where \(T_{inner}\) is the number of inner iterations for variational estimation of coregularized posteriors, \(d_v\) is the dimension of view v. In each of the special cases described earlier, i.e. when \(\gamma \rightarrow 0\) and \(\gamma \rightarrow 1\), the complexity reduces to \(\mathcal {O} (NKV + \sum _{v \in [V]}d_v^2 K )\) per iteration, same as that of CoEM, due to closed form solutions available for coregularization. In the general case, the largest source of computational overhead in the proposed algorithm is due to the variational procedure currently employed to impose coregularization. However, we are not bound to such a procedure and any accelerated methods available for solving (3), (4) and (5) can be adopted, if available. Further, our variational procedure is trivially parallelizable over samples (see Appendix 2 for relevant proof) whereas coregularization/cotraining techniques for the baselines (see Sect. 5.1) are not. This allows us to improve training efficiency to scale to large datasets.
4.5 Choice of weights and Rényi divergences
For empirical studies, we parametrize the weights for easy comparison with baselines. Let \(0 \le \alpha \le 1\) be a scalar. For every view \(v \in [V]\) being updated, \(w_v = 1  \alpha \). For all other views, \(w_i = \frac{\alpha }{V1} \, \forall i \, \in [V], \ i \ne v\). At every stage in the outer loop of either GRECO or LYRIC, the current view being updated is weighted by \(1\alpha \) and the rest are weighted equally \(\frac{\alpha }{V1}\). This also ensures fair comparison with CoEM by maintaining the same parametrization of weights. Therefore, all experiments demonstrate that a significant boost in clustering performance can be obtained via a suitable choice of Rényi divergences. We evaluated the performance of GRECO and LYRIC for different choices of \(\alpha \) and \(\gamma \). Section 5.3 shows the performance of the model across different choices of the divergence parameter. Specifically, for comparison with baselines, we choose the best performing set of \(\alpha \) and \(\gamma \) based on average accuracy of holdout clustering assignment across five trials.
4.6 Prediction on holdout samples
5 Experiments
The proposed methods have been extensively compared with existing multiview clustering models to show that the choice of divergence obtained by tuning \(\gamma \) is of significance, as well as to demonstrate that Rényi divergence is a reasonable choice for coregularization. All datasets were trained using both LYRIC and GRECO algorithms for different values of \(\gamma \in [0, 1]\) discretized in the corresponding logspace. Very high values of Rényi divergences did not matter significantly affect the performance. The weights \({\mathbf {w}}\) are reparametrized as described in Sect. 4.5. For all datasets, groundtruth cluster labels are known and utilized for objective evaluation and comparison to baselines. All models and baselines were trained on the same training and holdout data for five trials with best performing models chosen based on average clustering accuracy for comparison purposes. The mapping between cluster labels to ground truth labels is solved using Hungarian matching (Kuhn 1955). For comparison to baselines, we only report the best performance obtained across different choices of \({\mathbf {w}}\) and \(\gamma \). Holdout assignment results have only been compared to baselines that explicitly mention a mechanism to obtain holdout cluster assignment and empirically test the same. We report Clustering Accuracy, Precision, Recall, Fmeasure, NMI (Strehl and Ghosh 2003) and Entropy (Bickel and Scheffer 2005) for our evaluation. Lower entropy is better while higher values of other metrics show a better performing algorithm. All metrics are defined in Appendix 5. Note that the empirical evaluation here maintains prior cluster distribution \({\varvec{\pi }}_n\) to be equal for all samples n for all probabilistic models, including GRECO and LYRIC without loss of generality. Results demonstrating empirical convergence for a sample fold with multiple initializations (in negative loglikelihood) of GRECO and LYRIC have been included in Appendix 6.^{1} To the best of our knowledge, our empirical evaluation is the most extensive evaluation of multiview clustering methods compared to prior work in terms of the number of datasets, number of views and comparison to existing baselines.
5.1 Baselines

Shared Latent Variable Model (Joint): An alternative way of modeling multiple views is to have one latent variable that denotes the cluster membership across all views. This is called the ‘Joint’ model. This model is equivalent to concatenating views especially in the most commonly assumed scenario i.e. all views are Gaussian mixtures with diagnonal covariances.

Ensemble Clustering Model (Ensemble) (Strehl and Ghosh 2003): This model trains each view independently followed by a consensus evaluation. To predict the hard clustering assignment, the label correspondence among views is obtained using Hungarian matching (Kuhn 1955). A single posterior is obtained using the same equation as (10) with KLdivergence (logaggregation), followed by a MAP assignment. This method is compared to only when at most two views are available.

CoEM (Bickel and Scheffer 2005): CoEM estimates a mixture model per view subject to crossentropy constraints. The weights for each view are parametrized by \(\eta \in [0,1]\) and the results corresponding to the best performing \(\eta \) are reported.

Coregularized Spectral Clustering (Coreg (Sp)) (Kumar et al. 2011): This is the stateoftheart spectral multiview clustering. The results corresponding to the best performing \(\lambda \) parameter (between 0.01 to 0.1 as suggested by authors) are reported. The implementation provided by the authors is used.^{2}

Minimizing Disagreement (Mindis (Sp)) (Sa 2005): This is another spectral clustering technique proposed by (Sa 2005) for 2 views only. The implementation used was implemented and compared to by Kumar et al. (2011).

CCA for Mixture Models (CCAmvc) (Chaudhuri et al. 2009): This method uses Canonical Correlation Analysis to project views on a lower dimensional space. This model can be used for 2 views only.

NMF based Multiview Clustering (NMFmvc) (Liu et al. 2013): This method uses nonnegative matrix factorization for multiview clustering. The original implementation provided by the authors was used for empirical evaluation.^{3}
5.2 Datasets

Twitter multiview ^{4} (Greene and Cunningham 2013): This is a collection of twitter datasets in five topical areas (politicsUK, politicsIreland, Football etc.). Each user has views corresponding to users they follow, their followers, mentions, tweet content etc. We use the politicsuk dataset with three views (mentions, retweets and follows). The labels correspond to one of five party memberships of each user. Each view is a bagofwords vector and modeled as a mixture of multinomials for the probabilistic models.

WebKB ^{5}: This dataset consists of webpage information from four university websites: Cornell, Texas, Washington and Wisconsin. We show results for the Cornell dataset. Each sample is a webpage with two views, one view of which is the text content (bagofwords) format and weblinks into and out of the webpage (binary bagofwords vector). Each webpage can be clustered into one of five topics. Each view is modeled as a mixture of multinomials.

NUSWide Object ^{6} (Chua et al. 2009): This dataset consists of 31 object classes. Of these, we subsample in a balanced manner for 10 classes, with 50 samples belonging to each class. We use 6 views, namely edge histograms (mixture of Gaussians), bagofvisual words of SIFT features (mixture of multinomial distributions) and normalized correlogram (mixture of Gaussians), color histogram (mixture of multinomials), wavelet texture (mixture of Gaussians) and blockwise color moments (mixture of Gaussians).

CUB2002011 ^{7} (Wah et al. 2011): This dataset consists of 200 classes and 11,800 data samples. We use the binary attributes and Fisher Vector representations of images as our views. The binary attributes are modeled as mixtureofmultinomials and the Fisher vectors as Gaussian mixtures. We assume diagonal covariances for all views modeled as a mixture of Gaussians in all datasets.
5.3 Results
Twitter data (politicsuk, 3 views), best results obtained for \(\gamma = 0.01\) for GRECO and LYRIC
Clustering results  

Method  Accuracy  Precision  Recall  Fmeasure  NMI  Entropy  Time (s) 
GRECO  0.9075 (0.0201)  0.9403 (0.0316)  0.8713 (0.0366)  0.9039 (0.0217)  0.7887 (0.0478)  0.2971 (0.1001)  7.1241 (1.0122) 
LYRIC  0.886 (0.0284)  0.9601 (0.01)  0.8441 (0.0596)  0.8976 (0.0372)  0.8045 (0.0403)  0.2434 (0.0431)  7.0888 (0.9755) 
CoEM  0.8346 (0.0488)  0.8973 (0.0346)  0.7559 (0.0757)  0.8197 (0.0566)  0.7058 (0.0406)  0.3876 (0.0536)  2.3714 (0.9746) 
Joint  0.7893 (0.0491)  0.7737 (0.0792)  0.7167 (0.0679)  0.7413 (0.0535)  0.5806 (0.053)  0.6497 (0.11)  0.3623 (0.1047) 
Ensemble  NA  NA  NA  NA  NA  NA  NA 
Coreg (Sp)  0.557 (0.0221)  0.7122 (0.0215)  0.4326 (0.0197)  0.5382 (0.0213)  0.5079 (0.018)  0.6329 (0.0293)  1.6324 (0.1538) 
CCAmvc  NA  NA  NA  NA  NA  NA  NA 
Mindis (Sp)  NA  NA  NA  NA  NA  NA  NA 
NMFmvc  0.4418 (0)  0.3802 (0)  0.972 (0)  0.5466 (0)  0.0161 (0)  1.5769 (0)  6.0709 (0.0895) 
Holdout cluster assignment results  
GRECO  0.9238 (0.0136)  0.9047 (0.0384)  0.9021 (0.0559)  0.9022 (0.0307)  0.7784 (0.0418)  0.3417 (0.0703)  NA 
LYRIC  0.8452 (0.0854)  0.8537 (0.0283)  0.8123 (0.1353)  0.8291 (0.0888)  0.6803 (0.0635)  0.4438 (0.0697)  NA 
CoEM  0.781 (0.0287)  0.8282 (0.0406)  0.6735 (0.0425)  0.7425 (0.0379)  0.5916 (0.0566)  0.4988 (0.1454)  NA 
Joint  0.769 (0.0644)  0.6629 (0.064)  0.7093 (0.1399)  0.6797 (0.0828)  0.4916 (0.0829)  0.7895 (0.1495)  NA 
Ensemble  NA  NA  NA  NA  NA  NA  NA 
CCAmvc  NA  NA  NA  NA  NA  NA  NA 
Cornell (WebKB 2views), best results obtained for \(\gamma = 0.1\) for GRECO and \(\gamma \rightarrow 1\) for LYRIC
Clustering results  

Method  Accuracy  Precision  Recall  Fmeasure  NMI  Entropy  Time (s) 
GRECO  0.5859 (0.0148)  0.431 (0.0385)  0.617 (0.0451)  0.5066 (0.0327)  0.2747 (0.0145)  1.5578 (0.0379)  0.3404 (0.0323) 
LYRIC  0.5885 (0.0254)  0.4135 (0.0351)  0.6591 (0.0292)  0.5075 (0.0295)  0.2771 (0.024)  1.5697 (0.0489)  0.3174 (0.03) 
CoEM  0.5269 (0.0325)  0.3753 (0.0324)  0.5485 (0.0777)  0.4432 (0.0374)  0.1908 (0.0187)  1.7216 (0.0366)  0.8036 (0.1299) 
Joint  0.4179 (0.025)  0.3232 (0.0184)  0.4805 (0.0849)  0.3846 (0.0334)  0.1405 (0.0084)  1.8257 (0.0051)  0.1855 (0.0178) 
Ensemble  0.5064 (0.0304)  0.3535 (0.0199)  0.6008 (0.1335)  0.4376 (0.0326)  0.2099 (0.0352)  1.7026 (0.0592)  0.0341 (0.0011) 
Coreg (Sp)  0.5551 (0.0494)  0.5083 (0.0157)  0.4596 (0.0354)  0.4824 (0.0252)  0.3929 (0.0167)  1.2719 (0.0386)  2.065 (0.0201) 
CCAmvc  0.4526 (0.014)  0.3118 (7e\(04\))  0.4751 (0.0304)  0.3762 (0.0095)  0.1665 (0.0019)  1.2664 (0.0903)  0.0786 (0.0188) 
Mindis (Sp)  0.3756 (0.0154)  0.32 (0.0023)  0.3116 (0.0524)  0.3139 (0.0251)  0.1614 (0.0048)  1.7744 (0.0207)  0.0904 (0.0366) 
NMFmvc  0.4103 (0)  0.2606 (0)  0.9605 (0)  0.41 (0)  0.0569 (0)  2.0497 (0)  5.6911 (0) 
Holdout cluster assignment results  
GRECO  0.4513 (0.0739)  0.2995 (0.051)  0.5782 (0.1985)  0.3872 (0.0795)  0.1777 (0.0573)  1.7211 (0.147)  NA 
LYRIC  0.5026 (0.0693)  0.3493 (0.0683)  0.5541 (0.2054)  0.4238 (0.1034)  0.2223 (0.1096)  1.63 (0.2153)  NA 
CoEM  0.4205 (0.0862)  0.2788 (0.0606)  0.538 (0.1851)  0.3626 (0.0908)  0.1762 (0.035)  1.7269 (0.0966)  NA 
Joint  0.4564 (0.0585)  0.2861 (0.0344)  0.6214 (0.0806)  0.39 (0.0391)  0.1934 (0.0583)  1.7096 (0.0844)  NA 
Ensemble  0.5487 (0.1082)  0.4123 (0.1742)  0.7356 (0.107)  0.5016 (0.1051)  0.2981 (0.1633)  1.5027 (0.407)  NA 
CCAmvc  0.4103 (0)  0.3103 (0.007)  0.4 (0.0123)  0.3494 (0.0074)  0.1192 (0.0191)  1.7107 (0.0361)  NA 
NUSWideObj Dataset (6 views), best results obtained for \(\gamma = 0.1\) for GRECO and LYRIC
Clustering results  

Method  Accuracy  Precision  Recall  Fmeasure  NMI  Entropy  Time (s) 
GRECO  0.3805 (0.0089)  0.245 (0.0058)  0.3362 (0.0347)  0.2829 (0.0146)  0.3276 (0.0199)  2.2687 (0.0574)  8.0385 (1.2579) 
LYRIC  0.3805 (0.0089)  0.245 (0.0058)  0.3362 (0.0347)  0.2829 (0.0146)  0.3276 (0.0199)  2.2687 (0.0574)  8.0099 (1.2586) 
CoEM  0.347 (0.0118)  0.2171 (0.011)  0.3006 (0.0184)  0.2518 (0.0092)  0.2903 (0.0089)  2.3918 (0.0319)  4.3041 (0.7188) 
Joint  0.3115 (0.0151)  0.1882 (0.016)  0.346 (0.0303)  0.2437 (0.0202)  0.2454 (0.0157)  2.5884 (0.0481)  2.8231 (0.7605) 
Ensemble  NA  NA  NA  NA  NA  NA  NA 
Coreg (Sp)  0.3785 (0.0202)  0.2629 (0.0128)  0.2816 (0.0196)  0.2718 (0.0153)  0.318 (0.0162)  2.273 (0.0531)  2.5275 (0.0541) 
CCAmvc  NA  NA  NA  NA  NA  NA  NA 
Mindis (Sp)  NA  NA  NA  NA  NA  NA  NA 
NMFmvc  NA  NA  NA  NA  NA  NA  NA 
Holdout cluster assignment results  
GRECO  0.412 (0.0409)  0.225 (0.0228)  0.3369 (0.0177)  0.2691 (0.017)  0.4178 (0.0246)  1.9934 (0.0893)  NA 
LYRIC  0.412 (0.0409)  0.225 (0.0228)  0.3369 (0.0177)  0.2691 (0.017)  0.4178 (0.0246)  1.9934 (0.0893)  NA 
CoEM  0.372 (0.0217)  0.2074 (0.0232)  0.2964 (0.0405)  0.2437 (0.0289)  0.3975 (0.026)  2.052 (0.0856)  NA 
Joint  0.334 (0.0241)  0.1806 (0.019)  0.352 (0.0374)  0.2387 (0.0248)  0.329 (0.0294)  2.3533 (0.092)  NA 
Ensemble  NA  NA  NA  NA  NA  NA  NA 
CCAmvc  NA  NA  NA  NA  NA  NA  NA 
CUB2002011 (2 views), best results obtained for \(\gamma \rightarrow 1\) for GRECO and LYRIC
Clustering results  

Method  Accuracy  Precision  Recall  Fmeasure  NMI  Entropy  Time (s) 
GRECO  0.2231 (0.0039)  0.1052 (0.0034)  0.1757 (0.005)  0.1316 (0.0038)  0.5109 (0.006)  3.8498 (0.0508)  2255.5 (169.34) 
LYRIC  0.2189 (0.0061)  0.099 (0.004)  0.1748 (0.005)  0.1264 (0.0036)  0.5071 (0.0051)  3.8867 (0.045)  2069.6 (143.74) 
CoEM  0.0939 (0.0104)  0.0111 (0.0014)  0.0891 (0.0135)  0.0197 (0.0019)  0.301 (0.0146)  5.5905 (0.1318)  3355.9 (2382.4) 
Joint  0.0715 (0.0035)  0.0109 (2e\(04\))  0.0582 (0.0035)  0.0183 (2e\(04\))  0.2473 (0.0063)  5.9822 (0.0511)  2004.1 (124.85) 
Ensemble  0.0432 (9e\(04\))  0.0084 (3e\(04\))  0.0809 (0.0119)  0.0151 (3e\(04\))  0.1756 (0.0067)  6.5442 (0.0589)  767.78 (56.32) 
Coreg (Sp)  0.2118 (0.0081)  0.1031 (0.0042)  0.118 (0.0053)  0.11 (0.0046)  0.4896 (0.0059)  3.9224 (0.0431)  901.21 (11.716) 
CCAmvc  0.2213 (0.007)  0.0759 (0.0066)  0.1527 (0.0069)  0.1012 (0.006)  0.5003 (0.0038)  3.4551 (0.0454)  4.8814 (0.1651) 
Mindis (Sp)  0.1994 (0.0093)  0.0795 (0.0043)  0.1214 (0.0077)  0.0961 (0.0054)  0.4691 (0.0055)  4.1377 (0.0408)  594.78 (20.514) 
NMFmvc  NA  NA  NA  NA  NA  NA  NA 
Holdout cluster assignment results  
GRECO  0.2133 (0.0078)  0.0601 (0.0046)  0.1304 (0.008)  0.0822 (0.0057)  0.57 (0.0048)  3.4714 (0.0417)  NA 
LYRIC  0.2066 (0.0085)  0.0531 (0.0027)  0.1284 (0.0045)  0.0751 (0.0025)  0.5644 (0.0043)  3.5276 (0.0417)  NA 
CoEM  0.0712 (0.0712)  0.0086 (0.0086)  0.1208 (0.1208)  0.0159 (0.0159)  0.3347 (0.02)  5.5129 (0.1822)  NA 
Joint  0.0603 (0.0603)  0.0093 (0.0093)  0.0671 (0.0671)  0.0163 (0.0163)  0.3259 (0.0116)  5.5296 (0.0935)  NA 
Ensemble  0.0508 (0.0508)  0.0088 (0.0088)  0.09 (0.09)  0.016 (0.016)  0.2808 (0.0182)  5.8884 (0.1399)  NA 
CCAmvc  0.2512 (0.0064)  0.0727 (0.0048)  0.1444 (0.0081)  0.0965 (0.004)  0.6043 (0.0023)  3.158 (0.0248)  NA 
The proposed methods outperform almost all the baselines consistently across different datasets. In addition, holdout cluster assignment performance is better for both models across most datasets. Improved performance over ensemble methods suggests coregularization improves on the viewwise clustering approaches. In addition, results also suggest that sharing a single latent variable (see Joint Model) across views is restrictive. In the low bias regime, GRECO has particular advantages over LYRIC because of the additional tradeoff in regularization. When the bias across views is low, the additional regularization potentially accelerates convergence by restricting the deviation from viewspecific unregularized posteriors, especially when initial model parameters may be noisy. In the high bias case, LYRIC shows some advantage (see Table 2WebKB data). It is important to note that overall, the general trend of performance of both GRECO and LYRIC is consistent for each dataset (see Fig. 3). In particular, the performance peaks for the most appropriate choice of \(\gamma \) that best captures inherent biases across views for both algorithms for all datasets and this choice of divergence is the same for GRECO as well as LYRIC.
Similar observations on the WebKB data suggests a high degree of incoherence across views on the clustering distributions, suggested by the fact that linear aggregation (\(\gamma \rightarrow 1\)) provides the best results on the holdout dataset. Note that in such a scenario, i.e. when views completely disagree (in terms of the MAP estimate of the clustering) across views, learning each view independently is equally useful, as demonstrated by competitive performance of Ensemble methods relative to GRECO and LYRIC. Again, this further reinforces the advantage of our model in terms of robustness to violations of model assumptions. Figure 3 also suggests that as the underlying bias is assumed increase, the model performance in both LYRIC and GRECO consistently improves. In addition, the improvement over CoEM at \(\gamma \rightarrow 1\) suggests that the method proposed to estimate a holdout clustering assignment using (10) is better or comparable to that of CoEM. Note that although GRECO and LYRIC do not perform the best on training data in terms of NMI and Entropy, the results on holdout set are competitive—suggesting that the models do not overfit the training data.
From the results of NUSWide Object dataset, where six views are modeled jointly, the improvement in performance is significant when an appropriate divergence parameter \(\gamma \) is used, as compared to CoEM, which enforces linear aggregation and the joint model that estimates a single clustering posterior across all views. This further suggests advantages of GRECO and LYRIC when the number of views available is large. The best performing divergence parameter is relatively high (\(\gamma = 0.1\)). This also suggests that as the number of views being modeled increases, the views are likely to be more incoherent and an assumption of a high bias (higher \(\gamma \)) is a better modeling assumption. This is also apparent from the deteriorated performance of the joint model. Both GRECO and LYRIC perform the best at the limiting case \(w_g = 1\) as expected in a slightly high bias case, when additional regularization of GRECO is not necessarily advantageous. Figure 3 also suggests that at lower values of \(\gamma \) both LYRIC and GRECO may be getting stuck in local minima (suggested by the high observed variance at \(\gamma = 0.01\)) potentially reflecting sensitivity to choice of \(\gamma \) for this data.
For a large dataset like CUB2002011 with 200 clusters and \(\sim \)11,000 samples and high dimensionality (\(\sim \)8000), the improvement in unsupervised learning performance of GRECO and LYRIC is more pronounced compared to CoEM even though the best performance is obtained at \(\gamma \rightarrow 1\). This suggests that our inference on holdout set works better than CoEM (see Sect. 4.6 for details). Further, the best performance divergence parameter \(\gamma \rightarrow 1\) suggests the attribute view and the Fisher vector views, used from the CUB_200_2011 data, are potentially incoherent in terms of the latent clustering distribution. Comparison to other probabilistic methods, i.e. Joint model and Ensemble model, suggest restrictive model assumptions may fail and general methods like GRECO and LYRIC may be more reliable in large scale settings. Ensemble model also relies on Hungarian matching to solve the correspondence problem between cluster indices (200 clusters) across views. Improved performance in GRECO and LYRIC is obtained at a significant computational cost compared to CCAmvc which provides comparable performance very fast. This corroborates the model assumptions made by CCAmvc, namely that views of a sample are uncorrelated conditioned on cluster identity of sample (weaker assumptions than those made by the Joint model) can provide improvement in unsupervised learning performance. Faster inference for GRECO/LYRIC in such settings can be obtained by parallelization and/or any improvements to the variational inference procedure used to impose coregularization.
Overall, the best Rényi divergence suitable for a particular dataset differs, indicating that GRECO and LYRIC capture potential differences in coherence between views with respect to cluster memberships significantly better than comparable methods. The biases between views demonstrably affect clustering performance. This also suggests that the multiview assumption of a single underlying cluster membership distribution is not always satisfied in real data. Thus flexible models such as GRECO and LYRIC are preferable. All results further show that the choice of the class of Rényi divergences is beneficial for improving multiview clustering performance and both methods generalize better to unseen data compared to baselines.
A comparison of training time suggests that the increased accuracy of GRECO and LYRIC is obtained at the cost of increased training time. However, as suggested in Sect. 4, the variational update required for coregularization is the major contributing factor to training time. Since these updates can be trivially executed in a distributed setting across samples as well as for estimating unnormalized cluster membership distributions, the training time can be easily improved. Further, any alternative inference procedure to solve the coregularization constraint will directly improve training times for the proposed method. Also note that training times are comparable to CoEM and other baselines for special cases (see Tables 2, 4).
Additional advantages of GRECO and LYRIC compared to other methods are noteworthy. Both Twitter and WebKB datasets consist of at least one view with relational data. The twitter data is sparse (as is the case with social network data), i.e., a lot of the entries are 0. In these cases probabilistic methods outperform other methods suggesting the importance of probabilistic models in general. The NUSWide Object dataset and CUB datasets have mixed views, i.e. bagofwords as well as numeric features (e.g. Fisher vector representations). Empirical evaluation also demonstrates that our methods handle mixed data well.
Some limitations of the proposed methods arise in selecting an appropriate choice of weights and the best suited Rényi divergence parameter for a given dataset. Storkey et al. (2014) have proposed a method for automatic selection of weights which can be easily incorporated in GRECO or LYRIC via minor changes to the variational procedures described in Appendix 1. However, we chose to use manual selection of weights inorder to highlight significance of the choice of Rényi divergences as opposed to a finer choice of weights, especially to highlight the generalization over CoEM. Note that automatic selection or learning the best divergence parameter in an unsupervised setting suitable for a given data is a challenging and novel problem that we expose. Particularly, conventional model selection methods that tradeoff model complexity and likelihood are not applicable in this scenario as model complexity does not change w.r.t. different \(\gamma \). Automatic selection of such a model parameter is deferred to future work. However, we point out that both GRECO and LYRIC provide better performance compared to all existing baselines for all choices of \(\gamma \) that we tested. A more appropriate choice of \(\gamma \) further boosts performance. In case computational constraints exist, we suggest using either of the closed form methods suggested in Sect. 4.
6 Discussion and conclusion
This work proposed a coregularization approach to multiview clustering that builds on a novel idea of directly minimizing a weighted sum of divergences between viewspecific posteriors that indicate probabilities of cluster memberships. This approach encourages coherence between the posterior memberships by bringing them ‘closer’ in distribution. The resulting coregularization techniques, GRECO and LYRIC significantly improve performance over existing multiview clustering methods. By maintaining perview posteriors and using a flexible choice of Rényi divergences for imposing coherence, these models are robust to incoherence among views. In addition, CoEM is recovered as a special case of LYRIC. CoEM proposes linear aggregation of posteriors, which is best suited when aggregating among incoherent posterior memberships. We show empirically that better performance can be achieved by accounting for incoherence via a flexible family of divergences. We also achieve closed form updates to impose coregularization for two special cases, when the divergence parameter \(\gamma \rightarrow 0\) and \(\gamma \rightarrow 1\).
For future work, a more general framework for multiview parameter estimation that accounts for divergence aggregation can be explored. Additional performance and computational gains may be obtained by learning the regularization weights and the divergence parameter \(\gamma \). Theoretical analyses of special cases and studying the effects of other class of divergence can provide insights in further developing such flexible models. Such a framework could also offer advantages when views may be arbitrarily missing or in distributed settings when minimal interaction between views is expected due to communication constraints.
Footnotes
Notes
Acknowledgments
This work is supported in part by NSF SCH1417697 and the United States Office of Naval Research, Grant No. N000141410039. MDR is supported by an ARC Discovery Early Career Research Award (DE130101605). Authors would also like to acknowledge Suriya Gunasekar for reviewing the manuscript for clarity.
References
 Akata, Z., Perronnin, F., Harchaoui, Z., & Schmid, C. (2013). Labelembedding for attributebased classification. In IEEE conference on computer vision and pattern recognition (CVPR), 2013, pp. 819–826.Google Scholar
 Akata, Z., Thurau, C., & Bauckhage, C. (2011). Nonnegative matrix factorization in multimodality data for segmentation and label prediction. In: W. Andreas, S. Sabine & G. Martin (Eds.), 16th Computer vision winter workshop.Google Scholar
 Becker, S., & Hinton, G. E. (1992). Selforganizing neural network that discovers surfaces in randomdot stereograms. Nature, 355, 161–163.CrossRefGoogle Scholar
 Bickel, S., & Scheffer, T. (2005). Estimation of mixture models using CoEM. In Machine learning: ECML 2005, 16th European conference on machine learning, Porto, Portugal.Google Scholar
 Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with cotraining. In Proceedings of the eleventh annual conference on computational learning theory. ACM.Google Scholar
 Chaudhuri, K., Kakade, S. M., Livescu, K., & Sridharan, K. (2009). Multiview clustering via canonical correlation analysis. In Proceedings of the 26th annual international conference on machine learning.Google Scholar
 Chua, T.S., Tang, J., Hong, R., Li, H., Luo, Z., & Zheng, Y.T. (2009). NUSWIDE: A realworld web image database from National University of Singapore. In Proceedings of ACM conference on image and video retrieval.Google Scholar
 Dasgupta, S., Littman, M. L., & McAllester, D. A. (2001). PAC generalization bounds for cotraining. In Advances in neural information processing systems NIPS.Google Scholar
 de Sa V. R., & Ballard, D. H. (1993). Selfteaching through correlated input. In Computation and neural systems. New York: Springer.Google Scholar
 De Sa, V. R. (2005). Spectral clustering with two views. In Proceedings of the workshop on learning with multiple views, international conference on machine learning.Google Scholar
 Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B, 39, 1–38.MathSciNetzbMATHGoogle Scholar
 Eaton, E., des Jardins, M., & Jacob, S. (2014). Multiview constrained clustering with an incomplete mapping between views. Knowledge and Information Systems, 38, 231–257.CrossRefGoogle Scholar
 Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., & Mikolov, T., et al. (2013). Devise: A deep visualsemantic embedding model. In Advances in neural information processing systems, pp. 2121–2129.Google Scholar
 Garg, A., Jayram, T. S., Vaithyanathan, S., & Zhu, H. (2004). Generalized opinion pooling. In AMAI.Google Scholar
 Geoffrey, E. (2002). Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800.CrossRefzbMATHGoogle Scholar
 Ghosh, J., & Acharya, A. (2011). Cluster ensembles. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1, 305–315.Google Scholar
 Greene, D., & Cunningham, P. (2013). Producing a unified graph representation from multiple social network views. In Proceedings of the 5th annual ACM web science conference.Google Scholar
 Guo, Y. (2013). Convex subspace representation learning from multiview data. In Proceedings of the twentyseventh AAAI conference on artificial intelligence, 2013.Google Scholar
 Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1), 83–97.MathSciNetCrossRefzbMATHGoogle Scholar
 Kumar, A., & DaumeIII, H. (2011). A cotraining approach for multiview spectral clustering. In Proceedings of the 28th international conference on machine learning.Google Scholar
 Kumar, A., Rai, P., & Daumé III, H. (2010). Coregularized spectral clustering with multiple kernels. In NIPS Workshop on new directions in multiple kernel learning.Google Scholar
 Kumar, A., Rai, P., & Daume, H. (2011). Coregularized multiview spectral clustering. In Advances in neural information processing systems 24. Curran Associates, Inc.,.Google Scholar
 Li, S.Y., Jiang, Y., & Zhou, Z.H. (2014). Partial multiview clustering. In AAAI Conference on artificial intelligence.Google Scholar
 Lian, W., Rai, P., Salazar, E., & Carin, L. (2015). Integrating features and similarities: Flexible models for heterogeneous multiview data. In Twentyninth AAAI conference on artificial intelligence.Google Scholar
 Liu, J., Wang, C., Gao, J., & Han, J. (2013). Multiview clustering via joint nonnegative matrix factorization. In Proceedings of 2013 SIAM data mining conference.Google Scholar
 Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
 Nigam, K., & Ghani, R. (2000). Analyzing the effectiveness and applicability of cotraining. In Proceedings of the ninth international conference on information and knowledge management. ACM.Google Scholar
 Rényi, A. (1960). On measures of entropy and information. In Proceedings of the 4th Berkeley symposium on mathematics, statistics and probability.Google Scholar
 Schmidhuber, J., & Prelinger, D. (1993). A novel unsupervised classification method. In Third international conference on artificial neural networks, 1993.Google Scholar
 Sindhwani, V., & Rosenberg, D. S. (2008). An RKHS for multiview learning and manifold coregularization. In Proceedings of the 25th international conference on machine learning. ACM.Google Scholar
 Storkey, A., Zhu, Z., & Hu, J. (2014). A continuum from mixtures to products: Aggregation under bias. In ICML workshop on divergence methods for probabilistic inference.Google Scholar
 Strehl, A., & Ghosh, J. (2003). Cluster ensembles–A knowledge reuse framework for combining multiple partitions. The Journal of Machine Learning Research, 3, 583–617.MathSciNetzbMATHGoogle Scholar
 Tzortzis, G., & Likas, A. (2012). Kernelbased weighted multiview clustering. 2013 IEEE 13th international conference on data mining.Google Scholar
 van Erven, T., & Harremoës, P. (2012). Rényi divergence and KullbackLeibler divergence. ArXiv eprints.Google Scholar
 Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The CaltechUCSD Birds2002011 Dataset. Technical Report CNSTR2011001, California Institute of Technology.Google Scholar
 Weston, J., Bengio, S., & Usunier, N. (2011). Wsabie: Scaling up to large vocabulary image annotation. In IJCAI, (Vol. 11, pp. 2764–2770).Google Scholar
 Zhou, D., & Burges, C. J. C. (2007). Spectral clustering and transductive learning with multiple views. In Proceedings of the 24th international conference on machine learning.Google Scholar