Discriminative clustering with representation learning with any ratio of labeled to unlabeled data

Jones, Corinne; Roulet, Vincent; Harchaoui, Zaid

doi:10.1007/s11222-021-10067-x

Discriminative clustering with representation learning with any ratio of labeled to unlabeled data

Published: 29 January 2022

Volume 32, article number 17, (2022)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

379 Accesses
1 Altmetric
Explore all metrics

Abstract

We present a discriminative clustering approach in which the feature representation can be learned from data and moreover leverage labeled data. Representation learning can give a similarity-based clustering method the ability to automatically adapt to an underlying, yet hidden, geometric structure of the data. The proposed approach augments the DIFFRAC method with a representation learning capability, using a gradient-based stochastic training algorithm and an optimal transport algorithm with entropic regularization to perform the cluster assignment step. The resulting method is evaluated on several real datasets when varying the ratio of labeled data to unlabeled data and thereby interpolating between the fully unsupervised regime and the fully supervised regime. The experimental results suggest that the proposed method can learn powerful feature representations even in the fully unsupervised regime and can leverage even small amounts of labeled data to improve the feature representations and to obtain better clusterings of complex datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

GOLFS: feature selection via combining both global and local information for high dimensional clustering

Article 03 August 2023

Rethinking Unsupervised Feature Selection: From Pseudo Labels to Pseudo Must-Links

Generalizing from Example Clusters

Notes

The code to produce the plots was adapted from Andrej Karpathy’s Matlab code, which can be found here: https://cs.stanford.edu/people/karpathy/cnnembed/.
Their code may be found here: https://github.com/facebookresearch/deepcluster.

References

Alayrac, J.B., Bojanowski, P., Agrawal, N., Sivic, J., Laptev, I., Lacoste-Julien, S.: Unsupervised learning from narrated instruction videos. In: Conference on Computer Vision and Pattern Recognition, pp. 4575–4583 (2016)
Ardila, D., Kiraly, A.P., Bharadwaj, S., Choi, B., Reicher, J.J., Peng, L., Tse, D., Etemadi, M., Ye, W., Corrado, G., Naidich, D.P., Shetty, S.: End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat. Med. 25(6), 954–961 (2019)
Article Google Scholar
Asano, Y.M., Rupprecht, C., Vedaldi, A.: Self-labelling via simultaneous clustering and representation learning. In: International Conference on Learning Representations (2020)
Bach, F.R., Harchaoui, Z.: DIFFRAC: a discriminative and flexible framework for clustering. In: Advances in Neural Information Processing Systems, pp. 49–56, (2007)
Bach, F.R., Jordan, M.I.: Learning spectral clustering, with application to speech separation. J. Mach. Learn. Res. 7, 1963–2001 (2006)
MathSciNet MATH Google Scholar
Bachman, P., Alsharif, O., Precup, D.: Learning with pseudo-ensembles. In: Advances in Neural Information Processing Systems, pp. 3365–3373,(2014)
Bachman, P., Hjelm, R.D., Buchwalter, W.: Learning representations by maximizing mutual information across views. In: Advances in Neural Information Processing Systems, pp. 15509–15519,(2019)
Basu, S., Banerjee, A., Mooney, R.: Semi-supervised clustering by seeding. In: International Conference on Machine Learning, pp 27–34,(2002)
Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 7, 2399–2434 (2006)
MathSciNet MATH Google Scholar
Belkin, M., Ma, S., Mandal, S.: To understand deep learning we need to understand kernel learning. In: International Conference on Machine Learning, pp. 540–548 (2018)
Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., Raffel, C.: MixMatch: A holistic approach to semi-supervised learning. In: Advances in Neural Information Processing Systems, pp 5050–5060(2019)
Bertsekas, D.P.: Nonlinear programming, 3rd edn. Athena Scientific, Belmont (2016)
MATH Google Scholar
Beyer, L., Zhai, X., Oliver, A., Kolesnikov, A.: S4L: self-supervised semi-supervised learning. In: International Conference on Computer Vision, pp 1476–1485 (2019)
Bilenko, M., Basu, S., Mooney, R.J.: Integrating constraints and metric learning in semi-supervised clustering. In: International Conference on Machine Learning (2004)
Bo, L., Lai, K., Ren, X., Fox, D.: Object recognition with hierarchical kernel descriptors. In: Conference on Computer Vision and Pattern Recognition, pp 1729–1736 (2011)
Bock, R., Chilingarian, A., Gaug, M., Hakl, F., Hengstebeck, T., Jirina, M., Klaschka, J., Kotrc, E., Savicky, P., Towers, S., Vaicilius, A., Wittek, W.: Methods for multidimensional event classification: a case study using images from a Cherenkov gamma-ray telescope. Nucl. Instrum. Methods Phys. Res. A 516(2), 511–528 (2004)
Article Google Scholar
Bojanowski, P., Joulin, A.: Unsupervised learning by predicting noise. In: International Conference on Machine Learning, pp 517–526 (2017)
Bojanowski, P., Lajugie, R., Bach, F., Laptev, I., Ponce, J., Schmid, C., Sivic, J.: Weakly supervised action labeling in videos under ordering constraints. In: European Conference on Computer Vision, pp 628–643 (2014)
Bojanowski, P., Lajugie, R., Grave, E., Bach, F., Laptev, I., Ponce, J., Schmid, C.: Weakly-supervised alignment of video with text. In: International Conference on Computer Vision, pp 4462–4470 (2015)
Bouveyron, C., Celeux, G., Murphy, T.B., Raftery, A.E.: Model-based clustering and classification for data science. With applications in R. Cambridge University Press, Cambridge (2019)
Book MATH Google Scholar
Byerly, A., Kalganova, T., Dear, I.: A branching and merging convolutional network with homogeneous filter capsules. CoRR abs/2001.09136(2020)
Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning of visual features. In: European Conference on Computer Vision, pp 139–156 (2018)
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 1–27 (2011)
Article Google Scholar
Chapelle, O., Schlkopf, B., Zien, A.: Semi-supervised learning, 1st edn. The MIT Press, London (2010)
Google Scholar
Daniely, A., Frostig, R., Singer, Y.: Toward deeper understanding of neural networks: the power of initialization and a dual view on expressivity. In: Advances in Neural Information Processing Systems, pp 2253–2261 (2016)
Daniely, A., Frostig, R., Gupta, V., Singer, Y.: Random features for compositional kernels. CoRR abs/1703.07872(2017)
Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: International Conference on Computer Vision, pp 1422–1430 (2015)
Dosovitskiy, A., Fischer, P., Springenberg, J.T., Riedmiller, M.A., Brox, T.: Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 38(9), 1734–1747 (2016)
Article Google Scholar
van Engelen, J.E., Hoos, H.H.: A survey on semi-supervised learning. Mach. Learn. 109(2), 373–440 (2020)
Article MathSciNet MATH Google Scholar
Flammarion, N., Palaniappan, B., Bach, F.: Robust discriminative clustering with sparse regularizers. J. Mach. Learn. Res. 18, 1–50 (2017)
MATH Google Scholar
Fukumizu, K., Gretton, A., Lanckriet, G., Schölkopf, B., Sriperumbudur, B.K.: Kernel choice and classifiability for rkhs embeddings of probability distributions. In: Advances in Neural Information Processing Systems, vol 22 (2009)
Ghasedi Dizaji, K., Herandi, A., Deng, C., Cai, W., Huang, H.: Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. In: International Conference on Computer Vision, pp 5747–5756 (2017)
Goodfellow, I.J., Bengio, Y., Courville, A.C.: Deep learning, adaptive computation and machine learning. MIT Press, London (2016)
MATH Google Scholar
Grandvalet, Y., Bengio, Y.: Semi-supervised learning by entropy minimization. In: Advances in Neural Information Processing Systems, pp 529–536 (2004)
Guyon, I., Gunn, S.R., Ben-Hur, A., Dror, G.: Result analysis of the NIPS 2003 feature selection challenge. In: Advances in Neural Information Processing Systems, pp 545–552 (2004)
Hagen, L.W., Kahng, A.B.: New spectral methods for ratio cut partitioning and clustering. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 11(9), 1074–1085 (1992)
Article Google Scholar
Häusser, P., Mordvintsev, A., Cremers, D.: Learning by association - a versatile semi-supervised training method for neural networks. In: Conference on Computer Vision and Pattern Recognition, pp 626–635 (2017)
Hennig, C., Meila, M., Murtagh, F., Rocci, R.: Handbook of cluster analysis. Handbooks of modern statistical methods, CRC Press, United States (2015)
Book MATH Google Scholar
Hyvärinen, A., Morioka, H.: Unsupervised feature extraction by time-contrastive learning and nonlinear ICA. In: Advances in Neural Information Processing Systems, pp 3765–3773 (2016)
Iscen, A., Tolias, G., Avrithis, Y., Chum, O.: Label propagation for deep semi-supervised learning. In: Conference on Computer Vision and Pattern Recognition, pp 5070–5079 (2019)
Jalali, A., Han, Q., Dumitriu, I., Fazel, M.: Exploiting tradeoffs for exact recovery in heterogeneous stochastic block models. In: Advances in Neural Information Processing Systems, pp 4871–4879 (2016)
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with GPUs. IEEE Trans Big Data 7(3), 535–547 (2021)
Article Google Scholar
Jones, C.: Representation learning for partitioning problems. PhD thesis, University of Washington (2020)
Joulin, A., Bach, F.R.: A convex relaxation for weakly supervised classifiers. In: International Conference on Machine Learning (2012)
Joulin, A., Bach, F.R., Ponce, J.: Discriminative clustering for image co-segmentation. In: Conference on Computer Vision and Pattern Recognition, pp 1943–1950 (2010)
Kamnitsas, K., Castro, D.C., Folgoc, L.L., Walker, I., Tanno, R., Rueckert, D., Glocker, B., Criminisi, A., Nori, A.V.: Semi-supervised learning via compact latent space clustering. In: International Conference on Machine Learning, pp 2464–2473 (2018)
Karp, R.M.: Reducibility among combinatorial problems. In: Complexity of computer computations (pp. 85-103). Springer, Boston, MA (1975)
Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. University of Toronto, (2009)
Law, M.T., Urtasun, R., Zemel, R.S.: Deep spectral clustering learning. In: International Conference on Machine Learning, pp 1985–1994 (2017)
LeCun, Y.: Modeles connexionnistes de l’apprentissage. PhD thesis, Université P. et M. Curie (Paris 6) (1987)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. In: Intelligent Signal Processing, IEEE Press, pp 306–351 (2001)
Lee, D.H.: Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In: International Conference on Machine Learning Workshop on Challenges in Representation Learning (2013)
Lee, J., Bahri, Y., Novak, R., Schoenholz, S.S., Pennington, J., Sohl-Dickstein, J.: Deep neural networks as Gaussian processes. In: International Conference on Learning Representations (2018)
Li, Y., Wang, G., Ji, X., Xiang, Y., Fox, D.: DeepIM: Deep iterative matching for 6D pose estimation. In: European Conference on Computer Vision, pp 695–711 (2018)
Löwe, S., O’Connor, P., Veeling, B.: Putting an end to end-to-end: gradient-isolated learning of representations. In: Advances in Neural Information Processing Systems, pp 3033–3045 (2019)
Lütkepohl, H.: Handbook of matrices. Wiley, Chichester (1996)
MATH Google Scholar
von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)
Article MathSciNet Google Scholar
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Berkeley Symposium on Mathematical Statistics and Probability (1967)
Mairal, J.: End-to-end kernel learning with supervised convolutional kernel networks. In: Advances in Neural Information Processing Systems, pp 1399–1407 (2016)
Mairal, J., Koniusz, P., Harchaoui, Z., Schmid, C.: Convolutional kernel networks. In: Advances in Neural Information Processing Systems, pp 2627–2635 (2014)
Matthews, A., Hron, J., Rowland, M., Turner, R.E., Ghahramani, Z.: Gaussian process behaviour in wide deep neural networks. In: International Conference on Learning Representations (2018)
McQueen, J., Meilă, M., VanderPlas, J., Zhang, Z.: Megaman: Scalable manifold learning in Python. J. Mach. Learn. Res. 17(148), 1–5 (2016)
MathSciNet MATH Google Scholar
Meila, M.: Spectral clustering. In: Handbook of cluster analysis. Boca Raton, pp. 125–141. CRC Press, FL (2016)
Meila, M., Shortreed, S.M., Xu, L.: Regularized spectral learning. In: Workshop on Artificial Intelligence and Statistics (2005)
Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of machine learning, adaptive computation and machine learning. MIT Press, London (2012)
MATH Google Scholar
Nesterov, Y.: Lectures on convex optimization, 2nd edn. Springer, Cham (2018)
Book MATH Google Scholar
Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: European Conference on Computer Vision, pp 69–84 (2016)
Oglic, D., Gärtner, T.: Nyström method with kernel k-means++ samples as landmarks. In: International Conference on Machine Learning, pp 2652–2660 (2017)
Oliver, A., Odena, A., Raffel, C.A., Cubuk, E.D., Goodfellow, I.J.: Realistic evaluation of deep semi-supervised learning algorithms. In: Advances in Neural Information Processing Systems, pp 3239–3250 (2018)
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: PyTorch: An imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, pp 8024–8035 (2019)
Perez-Cruz, F., Bousquet, O.: Kernel methods and their potential use in signal processing. IEEE Signal Process. Mag. 21(3), 57–65 (2004)
Article Google Scholar
Peyré, G., Cuturi, M.: Computational optimal transport. Found. Trends Mach. Learn. 11(5–6), 355–607 (2019)
Article MATH Google Scholar
Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems, pp 1177–1184 (2007)
Schölkopf, B., Smola, A., Müller, K.R.: Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput. 10(5), 1299–1319 (1998)
Article Google Scholar
Schrijver, A.: Combinatorial optimization: polyhedra and efficiency. Algorithms and Combinatorics, Springer, Berlin (2003)
MATH Google Scholar
Sermanet, P., Lynch, C., Chebotar, Y., Hsu, J., Jang, E., Schaal, S., Levine, S.: Time-contrastive networks: Self-supervised learning from video. In: International Conference on Robotics and Automation, pp 1134–1141 (2018)
Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)
Article Google Scholar
Sinkhorn, R., Knopp, P.: Concerning nonnegative matrices and doubly stochastic matrices. Pac. J. Math. 21, 343–348 (1967)
Article MathSciNet MATH Google Scholar
Swamy, C.: Correlation clustering: maximizing agreements via semidefinite programming. In: ACM-SIAM Symposium on Discrete Algorithms, pp 526–527 (2004)
Thickstun, J., Harchaoui, Z., Foster, D.P., Kakade, S.M.: Invariances and data augmentation for supervised music transcription. In: International Conference on Acoustics, Speech and Signal Processing, pp 2241–2245 (2018)
Van Der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
MATH Google Scholar
Virtanen, P., Gommers, R., Oliphant, T.E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S.J., Brett, M., Wilson, J., Millman, K.J., Mayorov, N., Nelson, A.R.J., Jones, E., Kern, R., Larson, E., Vázquez-Baeza, Y.: Scipy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17(3), 261–272 (2020)
Article Google Scholar
Vrbik, I., McNicholas, P.D.: Fractionally-supervised classification. J. Classif. 32(3), 359–381 (2015)
Article MathSciNet MATH Google Scholar
Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: International Conference on Computer Vision, pp 2794–2802 (2015)
White, M., Schuurmans, D.: Generalized optimal reverse prediction. In: International Conference on Artificial Intelligence and Statistics, pp 1305–1313 (2012)
Williams, C.K.I., Seeger, M.W.: Using the Nyström method to speed up kernel machines. In: Advances in Neural Information Processing Systems, pp 682–688 (2000)
Wu, Z., Leahy, R.M.: An optimal graph theoretic approach to data clustering: theory and its application to image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 15(11), 1101–1113 (1993)
Article Google Scholar
Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Conference on Computer Vision and Pattern Recognition, pp 3733–3742 (2018)
Xie, J., Girshick, R.B., Farhadi, A.: Unsupervised deep embedding for clustering analysis. In: International Conference on Machine Learning, pp 478–487 (2016)
Xing, E.P., Jordan, M.I. On semidefinite relaxation for normalized k-cut and connections to spectral clustering. Tech. Rep. UCB/CSD-03-1265, EECS Department, University of California, Berkeley (2003)
Xu, L., White, M., Schuurmans, D.: Optimal reverse prediction: a unified perspective on supervised, unsupervised and semi-supervised learning. In: International Conference on Machine Learning, pp 1137–1144 (2009)
Yang, J., Parikh, D., Batra, D.: Joint unsupervised learning of deep representations and image clusters. In: Conference on Computer Vision and Pattern Recognition, pp 5147–5156 (2016)
Yoder, J., Priebe, C.E.: Semi-supervised $k$-means$++$. J. Stat. Comput. Simul. 87(13), 2597–2608 (2017)
MathSciNet MATH Google Scholar
Zass, R., Shashua, A.: Doubly stochastic normalization for spectral clustering. In: Advances in Neural Information Processing Systems, pp 1569–1576 (2006)
Zha, H., He, X., Ding, C.H.Q., Gu, M., Simon, H.D.: Spectral relaxation for k-means clustering. In: Advances in Neural Information Processing Systems, pp 1057–1064 (2001)
Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: European Conference on Computer Vision, pp 649–666 (2016)

Download references

Acknowledgements

The authors thank the reviewers for their valuable comments that helped to improve the manuscript. The authors gratefully acknowledge support from the National Science Foundation under grants NSF CCF-1740551 and NSF DMS-1810975, the program “Learning in Machines and Brains” of the Canadian Institute For Advanced Research, and faculty research awards. This work was first presented at the Women in Machine Learning Workshop in December 2019, for which the first author received travel funding from the National Science Foundation under grant NSF IIS-1833154.

Author information

Authors and Affiliations

Swiss Data Science Center, École polytechnique fédérale de Lausanne, Lausanne, 1015, Switzerland
Corinne Jones
Department of Statistics, University of Washington, Seattle, WA, 98195, USA
Vincent Roulet & Zaid Harchaoui

Authors

Corinne Jones
View author publications
You can also search for this author in PubMed Google Scholar
Vincent Roulet
View author publications
You can also search for this author in PubMed Google Scholar
Zaid Harchaoui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Corinne Jones.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

Smoothness of the objective function

In this appendix we estimate the smoothness constants for “forward prediction” regularized least squares and “reverse prediction” least squares, following the terminology of Section 3 of Xu et al. (2009). Regularized forward prediction least squares learns to predict the label matrix Y from the features $\Phi $:

$$\begin{aligned} \min _W \frac{1}{n}\Vert Y-\Phi W -\mathbb {1}_n b^T\Vert _F^2+\lambda \Vert W\Vert ^2_F\;. \end{aligned}$$

In contrast, reverse prediction least squares learns to predict the features $\Phi $ from the labels Y:

$$\begin{aligned} \min _{W}\frac{1}{n}\Vert \Phi -YW\Vert _F^2\,. \end{aligned}$$

As noted by Xu et al. (2009), the solution of the forward prediction problem can be recovered from the solution of the reverse prediction problem as long as $\Phi $ is full rank.

Now we return to Proposition 1, which compared the Lipschitz constants of the two objectives, and provide its proof.

Proof

After minimizing in the classifier variable W, the forward prediction objective reads

$$\begin{aligned} F_f(\Phi ) = \lambda {{\,\mathrm{tr}\,}}[YY^T\Pi _n(\Pi _n\Phi \Phi ^T\Pi _n + n\lambda {\text { I}})^{-1}\Pi _n]\,. \end{aligned}$$

Define $G(\Phi ) = \left( \Pi _n\Phi \Phi ^T\Pi _n+ n\lambda {\text { I}}_{n}\right) ^{-1}$. The gradient of $F_f$ is then

$$\begin{aligned} \nabla F_f(\Phi ) = -2\lambda \Pi _nG(\Phi ) \Pi _nYY^T\Pi _n G(\Phi ) \Pi _n\Phi . \end{aligned}$$

Since $\Vert G(\Phi )\Vert _2\le 1/(n\lambda )$, $\Vert YY^T\Vert _2\le n_\text {max}$, $\Vert \Pi _n\Vert _2 \le 1$ and $\Vert G(\Phi ) \Pi _n\Phi \Vert _2\le {\Vert \Pi _n\Phi \Vert _2}/{(n \lambda )}$, we obtain

$$\begin{aligned} \Vert \nabla F_f(\Phi ) \Vert _2&\le \frac{2Bn_{\max }}{n^2\lambda } {=}{:}L_f\,. \end{aligned}$$

Recall that the reverse prediction objective for fixed cluster assignments Y may be written as $F_r(\Phi )\! =\! \frac{1}{n}{{\,\mathrm{tr}\,}}[({\text { I}}-P_Y)\Phi \Phi ^T] $ where $P_Y=Y(Y^TY)^{-1}Y^T$ is an orthonormal projector. Its gradient, $\nabla F_r(\Phi ) = \frac{2}{n}({\text { I}}-P_Y)\Phi $, can therefore be bounded as

$$\begin{aligned} \Vert \nabla F_r(\Phi )\Vert _2 \le 2B/n {=}{:}L_r\,. \end{aligned}$$

Hence, taking $\lambda \ge n_{\max }/n$, we have $L_f\le L_r$. $\square $

Before moving on to the smoothness of the gradient we prove a lemma. The lemma estimates the Lipschitz constant of the gradient of the “forward prediction” objective function $F_f(\Phi )$ from Sect. 4.1.

Lemma 1

Consider a feature matrix $\Phi \in \mathbb {R}^{n\times D}$ from the set of all possible feature matrices $\mathcal {Z}$. Define the function $F_f:\mathbb {R}^{n\times D}\rightarrow \mathbb {R}$ by $F_f(\Phi ) = \lambda {{\,\mathrm{tr}\,}}[YY^T\Pi _n(\Pi _n \Phi \Phi ^T\Pi _n + n\lambda {\text { I}})^{-1}\Pi _n].$ Assume there exists B such that for all $\Phi \in \mathcal {Z}$, $\Vert \Phi \Vert _2\le B$. Then for all $\Phi _1,\Phi _2\in \mathcal {Z}$,

$$\begin{aligned}&\left\| \nabla F_f(\Phi _1)-\nabla F_f(\Phi _2)\right\| _2 \\ \le&\left( \frac{8B^2n_{\max }}{n^3\lambda ^2} + \frac{2n_{\max }}{n^2\lambda }\right) \Vert \Phi _1-\Phi _2\Vert _2\,. \end{aligned}$$

Hence, an upper bound on the Lipschitz constant of the gradient of $F_f(\Phi )$ is given by

$$\begin{aligned} \ell _f {:}{=}\frac{2n_{\max }}{n^2\lambda }+\frac{8B^2n_{\max }}{n^3\lambda ^2} \,. \end{aligned}$$

Proof

Note that the gradient of $F_f$ is given by $\nabla F_f(\Phi ) = -2\lambda \Pi _n G(\Phi ) \Pi _nYY^T\Pi _n G(\Phi ) \Pi _n\Phi $, where we have defined $G(\Phi )=(\Pi _n\Phi \Phi ^T\Pi _n+n\lambda I)^{-1}$. Now define ${\tilde{Y}} = \Pi _nY$, ${\tilde{\Phi }}_1= \Pi _n\Phi _1$, and ${\tilde{\Phi }}_2= \Pi _n\Phi _2$. Moreover, let $\Vert \cdot \Vert =\Vert \cdot \Vert _2$ denote the spectral norm. Using the fact that $\Vert \Pi _n\Vert \le 1$ since $\Pi _n$ is a projection matrix, observe that

$$\begin{aligned}&\frac{1}{2\lambda }\left\| \nabla F_f(\Phi _1)-\nabla F_f(\Phi _2)\right\| \\ =\,&\left\| \Pi _n \left[ G(\Phi _1){\tilde{Y}}{\tilde{Y}}^TG(\Phi _1){\tilde{\Phi }}_1-G(\Phi _2){\tilde{Y}}{\tilde{Y}}^T G(\Phi _2){\tilde{\Phi }}_2\right] \right\| \\ \le \,&\underbrace{\left\| G(\Phi _1){\tilde{Y}}{\tilde{Y}}^TG(\Phi _1){\tilde{\Phi }}_1-G(\Phi _2){\tilde{Y}}{\tilde{Y}}^TG(\Phi _2){\tilde{\Phi }}_1 \right\| }_{(a)} \\&+ \underbrace{\left\| G(\Phi _2){\tilde{Y}}\tilde{Y}^TG(\Phi _2){\tilde{\Phi }}_1-G(\Phi _2){\tilde{Y}}{\tilde{Y}}^T G(\Phi _2){\tilde{\Phi }}_2 \right\| }_{(b)}\,. \end{aligned}$$

First consider term (a). We have that

$$\begin{aligned}&\left\| G(\Phi _1){\tilde{Y}}{\tilde{Y}}^TG(\Phi _1){\tilde{\Phi }}_1-G(\Phi _2){\tilde{Y}}{\tilde{Y}}^TG(\Phi _2){\tilde{\Phi }}_1 \right\| \\ \le \,&\underbrace{\left\| G(\Phi _1){\tilde{Y}}{\tilde{Y}}^TG(\Phi _1)-G(\Phi _2){\tilde{Y}}{\tilde{Y}}^TG(\Phi _2) \right\| }_{(c)} \Vert \Phi _1\Vert \,. \end{aligned}$$

We may bound term (c) by

$$\begin{aligned}&\left\| G(\Phi _1){\tilde{Y}}{\tilde{Y}}^TG(\Phi _1)-G(\Phi _2){\tilde{Y}}{\tilde{Y}}^TG(\Phi _2) \right\| \\&\quad \le \, \left\| G(\Phi _1){\tilde{Y}}{\tilde{Y}}^TG(\Phi _1)-G(\Phi _1){\tilde{Y}}{\tilde{Y}}^TG(\Phi _2) \right\| \\&\qquad +\left\| G(\Phi _1){\tilde{Y}}{\tilde{Y}}^TG(\Phi _2)-G(\Phi _2){\tilde{Y}}{\tilde{Y}}^T G(\Phi _2) \right\| \\&\quad \le \, \left\| G(\Phi _1){\tilde{Y}}{\tilde{Y}}^T\right\| \underbrace{\left\| G(\Phi _1)-G(\Phi _2) \right\| }_{(d)} \\&\qquad + \left\| G(\Phi _2){\tilde{Y}}{\tilde{Y}}^T\right\| \left\| G(\Phi _1)-G(\Phi _2) \right\| \,. \end{aligned}$$

Furthermore, we can bound term (d) via

$$\begin{aligned}&\left\| G(\Phi _1)-G(\Phi _2) \right\| \\&\quad =\, \left\| G(\Phi _1)\left[ G(\Phi _1)^{-1}-G(\Phi _2)^{-1}\right] G(\Phi _2) \right\| \\&\quad \le \, \left\| G(\Phi _1)\right\| \left\| G(\Phi _2)\right\| \underbrace{\left\| G(\Phi _1)^{-1}-G(\Phi _2)^{-1}\right\| }_{(e)} \;. \end{aligned}$$

Finally, we can bound term (e) using

$$\begin{aligned}&\left\| G(\Phi _1)^{-1}-G(\Phi _2)^{-1}\right\| \\&\quad =\, \left\| \Pi _n \Phi _1\Phi _1^T \Pi _n- \Pi _n\Phi _2\Phi _2^T \Pi _n\right\| \\&\quad \le \, \left\| \Phi _1\Phi _1^T-\Phi _1\Phi _2^T\right\| + \left\| \Phi _1\Phi _2^T-\Phi _2\Phi _2^T\right\| \\&\quad \le \, \Vert \Phi _1\Vert \Vert \Phi _1-\Phi _2\Vert + \Vert \Phi _2\Vert \Vert \Phi _1-\Phi _2\Vert \,. \end{aligned}$$

Using this above, a bound on term (a) is thus

$$\begin{aligned}&\left\| G(\Phi _1){\tilde{Y}}{\tilde{Y}}^TG(\Phi _1){\tilde{\Phi }}_1-G(\Phi _2){\tilde{Y}}{\tilde{Y}}^TG(\Phi _2){\tilde{\Phi }}_1 \right\| \\&\quad \le \, \left( \left\| G(\Phi _1){\tilde{Y}}{\tilde{Y}}^T\right\| + \left\| G(\Phi _2){\tilde{Y}}{\tilde{Y}}^T\right\| \right) \left\| G(\Phi _1)\right\| \left\| G(\Phi _2)\right\| \\&\qquad \times \left( \Vert \Phi _1\Vert +\Vert \Phi _2\Vert \right) \Vert \Phi _1\Vert \Vert \Phi _1-\Phi _2\Vert \,. \end{aligned}$$

Next, consider term (b). We have that

$$\begin{aligned}&\left\| G(\Phi _2){\tilde{Y}}{\tilde{Y}}^TG(\Phi _2){\tilde{\Phi }}_1-G(\Phi _2){\tilde{Y}}{\tilde{Y}}^T G(\Phi _2){\tilde{\Phi }}_2 \right\| \\&\quad \le \, \left\| G(\Phi _2){\tilde{Y}}{\tilde{Y}}^TG(\Phi _2) \right\| \left\| \Phi _1-\Phi _2\right\| \,. \end{aligned}$$

Therefore, returning to the original quantity of interest, we have

$$\begin{aligned}&\frac{1}{2\lambda }\left\| \nabla F_f(\Phi _1)-\nabla F_f(\Phi _2)\right\| \\&\quad \le \, \left\{ \left( \left\| G(\Phi _1){\tilde{Y}}{\tilde{Y}}^T\right\| + \left\| G(\Phi _2){\tilde{Y}}{\tilde{Y}}^T\right\| \right) \left\| G(\Phi _1)\right\| \left\| G(\Phi _2)\right\| \right. \\&\qquad \left. \times \left( \Vert \Phi _1\Vert +\Vert \Phi _2\Vert \right) \Vert \Phi _1\Vert + \left\| G(\Phi _2){\tilde{Y}}{\tilde{Y}}^TG(\Phi _2) \right\| \right\} \\&\qquad \times \Vert \Phi _1-\Phi _2\Vert \,. \end{aligned}$$

Next, note that $\Vert YY^T\Vert _2\le n_{\max }$, where $n_{\max }$ is a bound on the maximum size of the clusters. Lastly, $\Vert G(\Phi _1)\Vert _2\le 1/(n\lambda )$ and $\Vert G(\Phi _2)\Vert _2\le 1/(n\lambda )$. Therefore, we have

$$\begin{aligned}&\frac{1}{2\lambda }\left\| \nabla F_f(\Phi _1)-\nabla F_f(\Phi _2)\right\| _2\\&\quad \le \, \left( \frac{4B^2n_{\max }}{n^3\lambda ^3} + \frac{n_{\max }}{n^2\lambda ^2}\right) \Vert \Phi _1-\Phi _2\Vert _2\,, \end{aligned}$$

and so an upper bound on the Lipschitz constant is given by

$$\begin{aligned} \ell _f {:}{=}\frac{2n_{\max }}{n^2\lambda }+\frac{8B^2n_{\max }}{n^3\lambda ^2} \,. \end{aligned}$$

$\square $

Recall that Proposition 2 from Sect. 4.1, compared the Lipschitz constants of the gradients of the forward and reverse prediction objectives. Here we provide its proof.

Proof

The gradient of $F_f$ is given by

$$\begin{aligned} \nabla F_f(\Phi ) = -2\lambda \Pi _nG(\Phi ) \Pi _n Y \tilde{Y}^T \Pi _n G(\Phi ) \Pi _n \Phi \,. \end{aligned}$$

By Lemma 1 we have that

$$\begin{aligned}&\left\| \nabla F_f(\Phi _1)-\nabla F_f(\Phi _2)\right\| _2\\&\quad \le \left( \frac{2n_{\max }}{n^2\lambda } + \frac{8B^2n_{\max }}{n^3\lambda ^2}\right) \Vert \Phi _1-\Phi _2\Vert _2\,. \end{aligned}$$

Next, observe that the gradient of $F_r$ is

$$\begin{aligned} \nabla F_r(\Phi ) = \frac{2}{n}({\text { I}}-P_Y)\Phi \,. \end{aligned}$$

Hence, we have

$$\begin{aligned} \Vert \nabla F_r(\Phi _1)-F_r(\Phi _2)\Vert _2 \le \frac{2}{n}\Vert \Phi _1-\Phi _2\Vert _2\,. \end{aligned}$$

For $\lambda \ge n_{\max }/(2n)+\sqrt{n_{\max }^2+16B^2n_{\max }}/(2n)$, we therefore have $\ell _f\le \ell _r$. $\square $

NP-completeness of the label assignment problem

Now we address the problem of optimizing the labels for the unlabeled data. The following proposition shows that this discrete problem is in general NP-complete for $k>2$.

Proposition 3

Let $A\in \mathbb {R}^{n\times n}$. The label assignment problem

$$\begin{aligned} \min _Y\,&{{\,\mathrm{tr}\,}}(YY^TA)\\ s.t.\,&\sum _{j=1}^k Y_{ij}=1, \quad i=1,\dots , n\\&Y_{ij}\in \{0,1\} \quad \forall \, i=1,\dots , n,\, j=1,\dots , k \end{aligned}$$

is NP-complete for $k>2$.

Proof

The proof will follow by showing that the k-coloring problem is a special case of the matrix balancing problem. Let G be an undirected, unweighted graph with no self-loops. Define $A\in \{0,1\}^{n\times n}$ to be the adjacency matrix of G. Then G is k-colorable if and only if the following problem has minimum value zero:

$$\begin{aligned} \min _Y\,&\sum _{j=1}^k \sum _{i,i'\in A} Y_{i,j}Y_{i',j} \\ s.t.\,&\sum _{j=1}^k Y_{ij}=1, \quad i=1,\dots , n\\&Y_{ij}\in \{0,1\} \quad \forall \, i=1,\dots , n, \, j=1,\dots , k\,. \end{aligned}$$

Noting that

$$\begin{aligned} \sum _{j=1}^k \sum _{i,i'\in A} Y_{i,j}Y_{i',j} = {{\,\mathrm{tr}\,}}(YY^TA)\,, \end{aligned}$$

we may rewrite the above problem as

$$\begin{aligned} \min _Y\,&{{\,\mathrm{tr}\,}}(YY^TA) \\ s.t.\,&\sum _{j=1}^k Y_{ij}=1, \quad \forall \, i=1,\dots , n\\&Y_{ij}\in \{0,1\} \quad \forall \, i=1,\dots , n,\, j=1,\dots , k\,. \end{aligned}$$

This is a special case of the matrix balancing problem, in which A is the adjacency matrix of a graph. Therefore, as the k-coloring problem is NP-complete for $k>2$ (Karp 1975), the label assignment problem with discrete assignments is also NP-complete for $k>2$. $\square $

An alternative relaxation

Bach and Harchaoui (2007) propose alternative relaxations of the labeling subproblem. Define $\lambda _1 \le \lambda _2 \le \cdots \le \lambda _n$ to be the eigenvalues of the equivalence matrix M and let $\lambda _0\ge 0$. In Section 2.6 of their paper Bach and Harchaoui suggest solving the problem

$$\begin{aligned} \min _{M \in \mathbb {R}^{n \times n}} \quad&{{\,\mathrm{tr}\,}}(M^T A) \nonumber \\ \text{ subject } \text{ to } \quad&M=M^T \nonumber \\&{{\,\mathrm{tr}\,}}(M) = n \nonumber \\&M \succeq 0 \nonumber \\&\sum _{i=1}^n \min \left\{ \frac{\lambda _i}{\lambda _0}, 1\right\} \ge k \end{aligned}$$

(6)

in the unsupervised setting.

1.1 Derivation of the solution

Note that the symmetric and positive semi-definite constraints imply that we can write $M=U\Lambda U^T$ where U contains an orthonormal set of eigenvectors of M and $\Lambda \ge 0$ is a diagonal matrix containing the corresponding eigenvalues. After rewriting ${{\,\mathrm{tr}\,}}(M^T A)=\sum _{i=1}^n \lambda _iu_i^TAu_i$, with $u_i = U_{\cdot , i}$, we obtain the problem

$$\begin{aligned} \min _{U \in \mathbb {R}^{n \times n}, \lambda _1,\dots , \lambda _n\in \mathbb {R}} \quad&\sum _{i=1}^n \lambda _iu_i^TAu_i \\ \text{ subject } \text{ to } \quad&\sum _{i=1}^n \lambda _i = n\\&\sum _{i=1}^n \min \left\{ \frac{\lambda _i}{\lambda _0}, 1\right\} \ge k \\&u_i^Tu_i = 1 \quad \forall i \\&u_i^Tu_j = 0 \quad \forall i\ne j \\&\lambda _i \ge 0 \quad \forall i \,. \end{aligned}$$

Introducing Lagrange multipliers and defining the Lagrangian

$$\begin{aligned}&\mathcal {L}(U, \Lambda , \alpha , \beta , \gamma , \delta , \epsilon ) \\ =\,&\sum _{i=1}^n \lambda _iu_i^TAu_i + \alpha \left( \sum _{i=1}^n \lambda _i - n\right) \\&- \beta \left( \sum _{i=1}^n \min \left\{ \frac{\lambda _i}{\lambda _0}, 1\right\} - k\right) \\&+ \sum _{i=1}^n \gamma _i(u_i^Tu_i - 1) + \sum _{i\ne j} \delta _{ij} u_i^Tu_j - \sum _{i=1}^n \epsilon _i\lambda _i \,, \end{aligned}$$

we can rewrite the problem as

$$\begin{aligned} \max _{\alpha \in \mathbb {R}, \beta \in \mathbb {R}, \gamma \in \mathbb {R}^n, \delta \in \mathbb {R}^{n^2}} \min _{U \in \mathbb {R}^{n \times n}, \lambda _1,\dots , \lambda _n\in \mathbb {R}} \quad&\mathcal {L}(U, \Lambda , \alpha , \beta , \gamma , \delta , \epsilon ) \\ \text{ subject } \text{ to } \quad&\beta \ge 0, \,\, \epsilon _i \ge 0 \,\,\, \forall i\,, \end{aligned}$$

where $\alpha \in \mathbb {R}, \beta \in \mathbb {R}, \gamma \in \mathbb {R}^n, \delta \in \mathbb {R}^{n^2}, \epsilon \in \mathbb {R}^n$ and we define $\delta _{ii}=0$ for all i. The optimal parameter values must satisfy the first order conditions

$$\begin{aligned}&2\lambda _i^\star Au_i^\star + 2\gamma _i^\star u_i^\star + \sum _{i\ne j} \delta _{ij}^\star u_j^\star = 0 \quad \forall i \nonumber \\&{u_i^\star }^TAu_i^\star + \alpha ^\star - \beta ^\star \left[ \frac{1}{2\lambda _0}(1-\text {sign}(\lambda _i^\star -\lambda _0))\right] - \epsilon _i^\star \ni 0 \quad \forall i \,. \end{aligned}$$

(7)

From line (7) we can see that $UAU^T$ is diagonal, and hence U consists of a set of eigenvectors of A. Defining $0 \le a_1\le a_2\le \cdots \le a_n$ to be the eigenvalues of A, we can then rewrite the problem as

$$\begin{aligned} \min _{\lambda _1,\dots , \lambda _n\in \mathbb {R}} \quad&\sum _{i=1}^n \lambda _ia_i \end{aligned}$$

(8)

$$\begin{aligned} \text { subject } \text { to } \quad&\sum _{i=1}^n \lambda _i = n \nonumber \\&\sum _{i=1}^n \min \left\{ \frac{\lambda _i}{\lambda _0}, 1\right\} \ge k \nonumber \\&\lambda _i \ge 0 \quad \forall i \,. \end{aligned}$$

(9)

To solve this, consider a possible solution ${\tilde{\lambda }}_1,\dots , {\tilde{\lambda }}_n$. We will consider several cases. First, suppose there exists $i < j$ such that ${\tilde{\lambda }}_i, {\tilde{\lambda }}_j > \lambda _0$. Then define ${\tilde{\lambda }}_i'= {\tilde{\lambda }}_i + {\tilde{\lambda }}_j-\lambda _0$, ${\tilde{\lambda }}_j'=\lambda _0$, and ${\tilde{\lambda }}_m'={\tilde{\lambda }}_m$ for $m\notin \{i,j\}$. Since ${\tilde{\lambda }}_i', {\tilde{\lambda }}_j' \ge \lambda _0$ and ${\tilde{\lambda }}_i'+{\tilde{\lambda }}_j'={\tilde{\lambda }}_i+{\tilde{\lambda }}_j$ the constraints are still satisfied. Therefore, since $a_i\le a_j$, $\sum _{i=1}^n {\tilde{\lambda }}_i'a_i \le \sum _{i=1}^n {\tilde{\lambda }}_ia_i$, and so we know that there always exists an optimum with at most one i such that $\lambda _i>\lambda _0$. Moreover, suppose that this index i is larger than 1. Then, we could set ${\tilde{\lambda }}_i'= {\tilde{\lambda }}_1$, ${\tilde{\lambda }}_1'={\tilde{\lambda }}_i$, and $\tilde{\lambda }_m'={\tilde{\lambda }}_m$ for $m\notin \{1,i\}$, thereby obtaining $\sum _{i=1}^n {\tilde{\lambda }}_i'a_i \le \sum _{i=1}^n {\tilde{\lambda }}_ia_i$. Thus, there always exists an optimum $\lambda _1^\star ,\dots , \lambda _n^\star $ with $\lambda _2^\star ,\dots , \lambda _n^\star \le \lambda _0.$

Next, suppose there exists $i<j$ such that $0<{\tilde{\lambda }}_i, {\tilde{\lambda }}_j < \lambda _0$. Then define $\tilde{\lambda }_i'={\tilde{\lambda }}_i + \min \{\lambda _0-{\tilde{\lambda }}_i, {\tilde{\lambda }}_j\}$, ${\tilde{\lambda }}_j'={\tilde{\lambda }}_j-\min \{\lambda _0-{\tilde{\lambda }}_i, {\tilde{\lambda }}_j\}$, and ${\tilde{\lambda }}_m'={\tilde{\lambda }}_m$ for $m\notin \{i,j\}$. Since ${\tilde{\lambda }}_i', {\tilde{\lambda }}_j' \le \lambda _0$ and ${\tilde{\lambda }}_i'+{\tilde{\lambda }}_j'={\tilde{\lambda }}_i+{\tilde{\lambda }}_j$ the constraints are still satisfied. Therefore, since $a_i\le a_j$, $\sum _{i=1}^n {\tilde{\lambda }}_i'a_i \le \sum _{i=1}^n {\tilde{\lambda }}_ia_i$, we know that there always exists an optimum with at most one i such that $0<\lambda _i<\lambda _0$. Now suppose that this i is not the largest index such that $\lambda _i>0$. Then there exists an optimum with a $j>i$ such that $\lambda _j=\lambda _0$. Then we could set ${\tilde{\lambda }}_i'= \tilde{\lambda }_j$, ${\tilde{\lambda }}_j'={\tilde{\lambda }}_i$, and $\tilde{\lambda }_m'={\tilde{\lambda }}_m$ for $m\notin \{i,j\}$, thereby obtaining $\sum _{i=1}^n {\tilde{\lambda }}_i'a_i \le \sum _{i=1}^n {\tilde{\lambda }}_ia_i$. Thus, there always exists an optimum $\lambda _1^\star ,\dots , \lambda _n^\star $ with $\lambda _1^\star \ge \lambda _0$, $\lambda _2^\star ,\dots , \lambda _{i-1}^\star =\lambda _0$, $0\le \lambda _i^\star \le \lambda _0$ for some i, and, if $i\ne n$, $\lambda _{i+1}^\star ,\dots , \lambda _n^\star =0$.

Now from constraint (9) we can see that there must exist at least k non-zero $\lambda _i$’s in the solution. If $n=k$, then we must have $\lambda _0=1$ and hence the optimum is given by $\lambda _1^\star ,\dots , \lambda _k^\star =1$. Now consider the case where $n>k$. Suppose there exists a solution ${\tilde{\lambda }}_1,\dots , {\tilde{\lambda }}_n$ such that ${\tilde{\lambda }}_{k+1} \ne 0$. Then, since ${\tilde{\lambda }}_1,\dots , {\tilde{\lambda }}_k \ge \lambda _0$ we can set ${\tilde{\lambda }}_1'={\tilde{\lambda }}_1+{\tilde{\lambda }}_{k+1}$, ${\tilde{\lambda }}_{k+1} = 0$, and ${\tilde{\lambda }}_j'={\tilde{\lambda }}_j$ for $j\notin \{1,k+1\}$. This once again satisfies the constraints and $\sum _{i=1}^n {\tilde{\lambda }}_i'a_i \le \sum _{i=1}^n {\tilde{\lambda }}_ia_i$. Therefore, there exists a solution such that $\lambda _1\ge \lambda _0$ and $\lambda _2,\dots , \lambda _k=\lambda _0$. In particular, a solution is $\lambda _1^\star =n-(k-1)\lambda _0$, $\lambda _2^\star ,\dots , \lambda _k^\star =\lambda _0$.

In summary, the optima of this problem depend on the values of k and n. In particular, we have:

If $n>k$, an optimum is given by $\lambda _1^\star =n-(k-1)\lambda _0$, $\lambda _2^\star ,\dots , \lambda _k^\star =\lambda _0$, $\lambda _{k+1}^\star ,\dots , \lambda _n^\star =0$.
If $n=k$, the optimum is given by $\lambda _1^\star ,\dots , \lambda _k^\star =1$.

Returning to the original problem (6), we therefore have that an optimal M is

$$\begin{aligned} M^\star =\sum _{i=1}^n \lambda _i^\star u_iu_i^T\,, \end{aligned}$$

where $u_1,\dots , u_n$ are eigenvectors corresponding to the eigenvalues $a_1\le a_2,\dots \le a_n$ of A and where $\lambda _1^\star ,\dots , \lambda _n^\star $ are as defined above.

1.2 Comparison to the XSDC relaxation

We now compare the convex relaxation of the labeling subproblem presented in Sect. 4.2 to the relaxation proposed by Bach and Harchaoui (2007). As accommodating constraints on cluster labels is less natural in the latter relaxation, we compare the relaxations when training a LeNet-5 CKN on MNIST with no labeled data. Figure 12 compares our matrix balancing method, the eigendecomposition method from the previous subsection, and the eigendecomposition method followed by k-means clustering. Prior to clustering, the rows of the eigenvector matrix were normalized to have unit $\ell _2$ norm. The value of $\lambda _0$ was chosen from the set $\{0.01n_b, 0.02n_b,\dots , 0.1n_b\}$, where $n_b$ is the size of a mini-batch, based on the performance on the validation set.

From Fig. 12 we can see that the convex relaxation used to derive the matrix balancing method is superior to the relaxations leading to the eigendecomposition-based methods. On average, matrix balancing performs 17% better than the eigendecomposition method and 12% better than the eigendecomposition method followed by k-means. This suggests that the constraint from the convex relaxation requiring the diagonal of M to consist of all 1’s and/or the constraint requiring all entries of M to be positive are important for the performance of the labeling method.

In Fig. 13 we examine the eigengap of A across iterations. The eigengap is defined as $\lambda _{k+1}-\lambda _{k}$, where k is the number of classes and $\lambda _1\le \dots \le \lambda _n$ are the eigenvalues of A. As noted by Meila et al. (2005), having a larger eigengap makes the subspace spanned by the first k eigenvectors of A more stable to perturbations. From the figure we can see that the eigendecomposition-based methods tend to increase the eigengap as the learning proceeds. For the eigendecomposition method, the eigengap increased from $2\times 10^{-6}$ to $6\times 10^{-5}$ on average after 50 iterations. Similarly, for the eigendecomposition method followed by k-means, the eigengap increased from $5\times 10^{-6}$ to $5\times 10^{-5}$ on average after 50 iterations. It is interesting to note that matrix balancing, which does not yield low-rank solutions $M^\star $, leads to eigengaps that are extremely small (on the order of $10^{-15}$) across the iterations. Nevertheless, it outperforms the eigendecomposition-based methods.

Additional experimental details

Here, we provide additional details related to the training and the additional constraints we consider.

1.1 Parameter tuning

The algorithm proposed in this paper and the models used require a large number of parameters to be set. Next we discuss the choices for these parameters.

1.1.1 Fixed parameters

The parameters that are fixed throughout the experiments and not validated are as follows. The number of filters in the networks is set to 32 and the network’s parameters V are initialized layer-wise with 32 feature maps drawn uniformly at random from the output of the previous layer. The networks use the Nyström method to approximate the kernel at each layer. The regularization in the Nyström approximation is set to 0.001, and 20 Newton iterations are used to compute the inverse square root of the Gram matrix on the parameters $V_\ell $ at each layer $\ell $, as done by Jones (2020). The bandwidth is set to the median pairwise distance between the first 1000 observations for the single-layer networks. It is set to 0.6 for the convolutional networks. The batch size for both the labeled and unlabeled data is set to 4096 for Gisette and MAGIC and 1024 for MNIST and CIFAR-10 (due to GPU memory constraints). The features output by the network $\phi $ are centered and normalized so that on average they have unit $\ell _2$ norm, as in Mairal et al. (2014). The initial training phase on just the labeled data is performed for 100 iterations, as the validation loss has typically started leveling off by 100 iterations. The entropic regularization parameter $\nu $ in the matrix balancing is set to the median absolute value of the entries in A. If this value results in divergence of the algorithm, it is multiplied by a factor of two until the algorithm converges. The value $n_\Delta $ is set to zero unless otherwise specified. The number of iterations of alternating minimization in the matrix balancing algorithm is set to 10. The number of nearest neighbors used for estimating the labels on the unlabeled data is set to 1.

1.1.2 Hold-out validation

Due to the large number of hyperparameters, we tune them sequentially as follows when labeled data, and hence a labeled validation set, exists. First, we tune the penalty $\lambda $ on the classifier weights over the values $2^i$ for $i=-40,-39,\dots ,0$. To do so, we train the classifier on only the labeled data using the initial random network parameters. We then re-validate this value every 100 iterations. Next, we tune the learning rate for the labeled data. For the modest value of $\zeta =2^{-4}$ we validate the fixed learning rate for the labeled data over the values $2^i$ for $i=-10,-9,\dots , 5$. To evaluate the performance the labels for the unlabeled data are estimated using 1-nearest neighbor. The labeled and unlabeled data are then used to train the classifier used to compute the performance. For the unbalanced experiments on MNIST only we then tune the minimum and maximum size of the classes over the values $0.01b, 0.02b,\dots , 0.2b$, where b is the batch size (fixing the semi-supervised learning rate to $2^{-5}$). For all other experiments we fix these values to b/k, where k is the number of classes in the dataset. We then tune the semi-supervised learning rate, again over the values $2^i$ for $i=-10,-9,\dots , 5$. For the single-layer networks we then tune $\zeta $ over the values $2^i$ for $i=-10,-9,\dots , 10$. For the convolutional networks we do not penalize the filters since they are constrained to lie on the sphere.

When no labeled data exists we consider the hyperparameters in the same manner as during the hold-out validation. First, we consider the values $2^i$ for $i=-10,-9,\dots , 5$ for the semi-supervised learning rate. Next we consider the values $2^i$ for $i=-40,-39,\dots ,0$ for $\lambda $. Finally, if applicable, we consider the values $2^i$ for $i=-10,-9,\dots , 10$ for $\zeta $. We report the best performance observed on the test set. Developing a method for tuning the hyperparameters on an unlabeled validation set is left for future work.

1.1.3 Comparison details

In the comparisons we substitute our matrix balancing method with alternative labeling methods and retain the remainder of the XSDC algorithm. The pseudo-labeling code is our own, but we used code from Caron et al. (2018) to implement the k-means version of deep clustering.^{Footnote 2} Two important details regarding the implementations are as follows. First, for pseudo-labeling when some of the data is labeled we estimate W and b based on the labeled data in the current mini-batch, as that is what is done in XSDC. When labeled data is not present we estimate W and b based on the cluster assignments for the entire dataset. Second, for deep clustering we modify the dimension of the dimensionality reduction. In the original implementation the authors performed PCA, reducing the dimensionality of the features output by the network to 256. As the features output by the networks we consider have dimension less than 256, we instead keep the fewest number of components that account for 95% of the variance.

We perform the parameter tuning as follows. First, we follow the tuning procedure as detailed in Sect. D.1. For pseudo-labeling there are no additional parameters to tune. However, for deep clustering there are two additional parameters to tune: the number of clusters in k-means and the number of iterations between cluster updates. During the initial parameter tuning stage these parameters are set to the true number of clusters k, and 50 iterations, respectively. Afterward we tune these two remaining parameters sequentially. We first tune the number of clusters over the values k, 2k, 4k, 8k, 16k, 32k where k is the true number of clusters. We then tune the number of iterations between cluster updates over the values 10, 25, 50, 100.

1.2 Additional constraints

In one set of experiments we examine the effect of adding additional constraints. We consider two types of constraints: (1) constraints based on knowledge of whether the label was in the set $\{4,9\}$ or not; and (2) random correct must-link and must-not-link constraints among pairs of unlabeled observations and random correct must-not-link constraints between pairs of unlabeled and labeled observations.

The two types of constraints are illustrated in Fig. 14. Each grid point (i, j), if filled, denotes whether observations i and j have the same label (1) or not (0). The true labels are the values outside of the grids. Green backgrounds correspond to knowing the labels corresponding to (i, j). Purple backgrounds denote the additional known constraints. The left-most panel gives an example of an initial matrix M in which the labels corresponding to the first two observations are known (0 and 9). The second panel shows the entries we can fill in once we know whether each observation belongs to the set $\{4,9\}$. Finally, the third panel shows random correct constraints. The constraint at entry (2, 3) is a must-not-link constraint, whereas the constraint at entry (3, 4) is a must-link constraint.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jones, C., Roulet, V. & Harchaoui, Z. Discriminative clustering with representation learning with any ratio of labeled to unlabeled data. Stat Comput 32, 17 (2022). https://doi.org/10.1007/s11222-021-10067-x

Download citation

Received: 12 October 2020
Accepted: 06 November 2021
Published: 29 January 2022
DOI: https://doi.org/10.1007/s11222-021-10067-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Discriminative clustering with representation learning with any ratio of labeled to unlabeled data

Abstract

Access this article

Similar content being viewed by others

GOLFS: feature selection via combining both global and local information for high dimensional clustering