Abstract
A least squares semi-supervised local clustering algorithm based on the idea of compressed sensing is proposed to extract clusters from a graph with known adjacency matrix. The algorithm is based on a two-stage approach similar to the one proposed by Lai and Mckenzie (SIAM J Math Data Sci 2:368–395, 2020). However, under a weaker assumption and with less computational complexity, our algorithms are shown to be able to find a desired cluster with high probability. The “one cluster at a time" feature of our method distinguishes it from other global clustering methods. Numerical experiments are conducted on the synthetic data such as stochastic block model and real data such as MNIST, political blogs network, AT &T and YaleB human faces data sets to demonstrate the effectiveness and efficiency of our algorithms.
Similar content being viewed by others
Data availability
A sample demo program for reproducing Fig. 4 in this paper can be found at https://github.com/zzzzms/LeastSquareClustering. All other demonstration codes or data are available upon request.
References
Abbe, E.: Community detection and stochastic block models: recent developments. J. Mach. Learn. Res. 18(177), 1–86 (2018)
Abbe, E., Sandon, C.: Recovering communities in the general stochastic block model without knowing the parameters. In: Advances in Neural Information Processing Systems, pp. 676–684 (2015)
Adamic, L.A., Glance, N.: The political blogosphere and the 2004 US election: divided they blog. In: Proceedings of the 3rd International Workshop on Link Discovery, pp. 36–43 (2005)
Andersen, R., Chung, F., Lang, K.: Using pagerank to locally partition a graph. Internet Math. 4(1), 35–64 (2007)
Camps-Valls, G., Bandos, T., Zhou, D.: Semisupervised graph-based hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 45(10), 3044–3054 (2007)
Chen, Y., Wang, J.Z., Krovetz, R.: Clue: cluster-based retrieval of images by unsupervised learning. IEEE Trans. Image Process. 14(8), 1187–1201 (2005)
Chung, F.: Spectral Graph Theory, vol. 92. American Mathematical Society, Providence (1997)
Chung, F., Lu, L.: Complex Graphs and Networks, vol. 107. American Mathematical Society, Providence (2006)
Dai, W., Milenkovic, O.: Subspace pursuit for compressive sensing signal reconstruction. IEEE Trans. Inform. Theory 55(5), 2230–2249 (2009)
Dhillon, S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 269–274 (2001)
Dhillon, S., Guan, Y., Kulis, B.: Kernel k-means: spectral clustering and normalized cuts. In: Proceedings of the 10th ACM SIGKDD Conference (2004)
Ding, C., He, X., Zha, H., Gu, M., Simon, H.D.: A min–max cut algorithm for graph partitioning and data clustering. In: Proceedings of IEEE ICDM, vol. 2001, pp. 107–114 (2001)
Erdős, P., Rényi, A.: On random graphs, I. Publ. Math. Debrecen 6, 290–297 (1959)
Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3–5), 75–174 (2010)
Ha, W., Fountoulakis, K., Mahoney, M.W.: Statistical guarantees for local graph clustering. In: The 23rd International Conference on Artificial Intelligence and Statistics (2020)
Holland, P.W., Laskey, K., Leinhardt, S.: Stochastic blockmodels: first steps. Soc. Netw. 5(2), 109–137 (1983)
Hric, D., Darst, R.K., Fortunato, S.: Community detection in networks: structural clusters versus ground truth. Phys. Rev. E 9, 062805 (2014)
Jacobs, M., Merkurjev, E., Esedoglu, S.: Auction dynamics: a volume constrained MBO scheme. J. Comput. Phys. 354, 288–310 (2018)
Kingma, D. P., Mohamed, S., Rezende, D.J., Welling, M.: Semi-supervised learning with deep generative models. In: Advances in Neural Information Processing Systems 27 (NIPS 2014), pp. 3581–3589 (2014)
Kloster, K., Gleich, D.F.: Heat kernel based community detection. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1386–1395 (2014)
Kossinets, G., Watts, D.J.: Empirical analysis of an evolving social network. Science 311(5757), 88–90 (2016)
Kulis, B., Basu, S., Dhillon, I., Mooney, R.: Semi-supervised graph clustering: a kernel approach. In: Proceedings of ICML (2005)
Lai, M.-J., Mckenzie, D.: Compressive sensing approach to cut improvement and local clustering. SIAM J. Math. Data Sci. 2, 368–395 (2020)
Lee, D.-H.: Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on Challenges in Representation Learning, ICML (2013)
Luxburg, U.V.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)
Mahoney, M.W., Orecchia, L., Vishnoi, N.K.: A local spectral method for graphs: with applications to improving graph partitions and exploring data graphs locally. J. Mach. Learn. Res. 13(1), 2339–2365 (2012)
Mihalcea, R., Radev, D.: Graph-Based Natural Language Processing and Information Retrieval. Cambridge University Press, Cambridge (2011)
Mossel, E., Neeman, J., Sly, A.: A proof of the block model threshold conjecture. Combinatorica 38(3), 665–708 (2018)
Needell, D., Tropp, J.A.: CoSaMP: iterative signal recovery from incomplete and inaccurate samples. Appl. Comput. Harmonic Anal. 26(3), 301–321 (2009)
Ng, A. Y., Jordan, M.I., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: Advances in Neural Information Processing Systems, pp. 849-856 (2002)
Shi, Pan, He, Kun, Bindel, David, Hopcroft, John E.: Locally-biased spectral approximation for community detection. Knowl.-Based Syst. 164, 459–472 (2019)
Pitelis, N., Russell, C., Agapito, L.: Semi-supervised learning using an unsupervised atlas. In: Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2014), pp. 565–580. Springer (2014)
Qin, T., Rohe, K.: Regularized spectral clustering under the degree-corrected stochastic block model. In: Burges, C.J.C. , Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 26, pp. 3120–3128 (2013)
Rasmus, A., Berglund, M., Honkala, M., Valpola, H., Raiko, T.: Semi-supervised learning with ladder networks. In: Advances in Neural Information Processing Systems, pp. 3546–3554 (2015)
Santosa, F., Symes, W.: Linear inversion of band-limited reflection seismograms. SIAM J. Sci. Stat. Comput. 7(4), 1307–1330 (1986). https://doi.org/10.1137/0907087
Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 58(1), 267–88 (1996)
Veldt, N., Klymko, C., Gleich, D. F.: Flow-based local graph clustering with better seed set inclusion. In: Proceedings of the SIAM International Conference on Data Mining (2019)
Wang, D., Fountoulakis, K., Henziger, M., Mahoney, Michael W., Rao, S.: Capacity releasing diffusion for speed and locality. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 3598–3607 (2017)
Yan, Y., Bian, Y., Luo, D., Lee, D., Zhang, X.: Constrained local graph clustering by colored random walk. In: Proceedings of World Wide Web Conference, pp. 2137–2146 (2019)
Yin, H., Benson, A. R., Leskovec, J., Gleich, D. F.: Local higher-order graph clustering. In: Proceedings of the 23rd ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 555–564 (2017)
Yin, K., Tai, X.-C.: An effective region force for some variational models for learning and clustering. J. Sci. Comput. 74, 175–196 (2018)
Zelnik-Manor, L., Perona, P.: Self-tuning spectral clustering. In: Advances in neural information processing systems, pp. 1601–1608 (2004)
Zhang, A. Y., Zhou, H. H., Gao, C., Ma, Z.: Achieving optimal misclassification proportion in stochastic block model. arXiv:1505.03772 (2015)
Funding
The first author is supported by the Simons Foundation collaboration grant #864439.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have disclosed any competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A Hyperparameters for Numerical Experiments
For each cluster to be recovered, we sampled the seed vertices \(\varGamma _i\) uniformly from \(C_i\) during all implementations. We fix the random walk depth with \(t=3\), use random walk threshold parameter \(\delta =0.8\) for political blogs network and \(\delta =0.6\) for all the other experiments. We vary the rejection parameters \(R\in (0,1)\) for each specific experiments based on the estimated sizes of clusters. In the case of no knowledge of estimated sizes of clusters nor the number of clusters are given, we may have to refer to the spectra of graph Laplacian and use the large gap between two consecutive spectrum to estimate the number of clusters, and then use the average size to estimate the size of each cluster.
We fix the least squares threshold parameter with \(\gamma =0.2\) for all the experiments, which is totally heuristic. However, we have experimented that the performance of algorithms will not be affected too much by varying \(\gamma \in [0.15,0.4]\). The hyperparameter MaxIter is chosen according to the size of initial seed vertices relative to the total number of vertices in the cluster. For purely comparison purpose, we keep \(MaxIter=1\) for MNIST data. By experimenting on different choices of MaxIter, we find that the best performance for AT &T data occurs at \(MaxIter=2\) for \(10\%\) seeds and \(MaxIter=1\) for \(20\%\) and \(30\%\) seeds. For YaleB data, the best performance occurs at \(MaxIter=2\) for \(5\%\), \(10\%\), and \(20\%\) seeds. All the numerical experiments are implemented in MATLAB and can be run on a local personal machine, for the authenticity of our results, we put a sample demo code at https://github.com/zzzzms/LeastSquareClustering the purpose of reproducibility.
Appendix B Image Data Preprocessing
For YaleB human faces data, we have performed some data preprocessing techinuqes to avoid the poor quality images. Specifically, we abandoned the pictures which are too dark, and we cropped each image into size of \(54\times 46\) to reduce the effect of background noise. For the remaining qualified pictures, we randomly selected 20 images for each person.
All the image data in MNIST, AT &T, YaleB needs to be firstly constructed into an auxiliary graph before the implementation. Let \(\varvec{x}_i\in \mathbb {R}^n\) be the vectorization of an image from the original data set, we define the following affinity matrix of the K-NN auxiliary graph based on Gaussian kernel according to [18, 43],
The notation \(NN(\varvec{x}_i,K)\) indicates the set of K-nearest neighbours of \(\varvec{x}_i\), and \(\sigma _i:=\Vert \varvec{x}_i-\varvec{x}^{(r)}_i\Vert \) where \(\varvec{x}^{(r)}_i\) is the r-th closest point of \(\varvec{x}_i\). Note that the above \(A_{ij}\) is not necessary symmetric, so we consider \(\tilde{A}_{ij}=A^T A\) for symmetrization. Alternatively, one may also want to consider \(\tilde{A}=\max \{A_{ij}, A_{ji}\}\) or \(\tilde{A}=(A_{ij}+A_{ji})/2\). We use \(\tilde{A}\) as the input adjacency matrix for our algorithms.
We fix the local scaling parameters \(K=15\), \(r=10\) for the MNIST data, \(K=8\), \(r=5\) for the YaleB data, and \(K=5\), \(r=3\) for the AT &T data during implementation.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lai, MJ., Shen, Z. A Compressed Sensing Based Least Squares Approach to Semi-supervised Local Cluster Extraction. J Sci Comput 94, 63 (2023). https://doi.org/10.1007/s10915-022-02052-x
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10915-022-02052-x
Keywords
- Graph clustering
- Semi-supervised clustering
- Local clustering
- Compressed sensing
- Least squares
- Graph Laplacian