Skip to main content

Advertisement

Log in

A Compressed Sensing Based Least Squares Approach to Semi-supervised Local Cluster Extraction

  • Published:
Journal of Scientific Computing Aims and scope Submit manuscript

Abstract

A least squares semi-supervised local clustering algorithm based on the idea of compressed sensing is proposed to extract clusters from a graph with known adjacency matrix. The algorithm is based on a two-stage approach similar to the one proposed by Lai and Mckenzie (SIAM J Math Data Sci 2:368–395, 2020). However, under a weaker assumption and with less computational complexity, our algorithms are shown to be able to find a desired cluster with high probability. The “one cluster at a time" feature of our method distinguishes it from other global clustering methods. Numerical experiments are conducted on the synthetic data such as stochastic block model and real data such as MNIST, political blogs network, AT &T and YaleB human faces data sets to demonstrate the effectiveness and efficiency of our algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

A sample demo program for reproducing Fig. 4 in this paper can be found at https://github.com/zzzzms/LeastSquareClustering. All other demonstration codes or data are available upon request.

Notes

  1. https://git-disl.github.io/GTDLBench/datasets/att_face_dataset/.

  2. http://vision.ucsd.edu/~leekc/ExtYaleDatabase/ExtYaleB.html.

  3. http://yann.lecun.com/exdb/mnist/.

References

  1. Abbe, E.: Community detection and stochastic block models: recent developments. J. Mach. Learn. Res. 18(177), 1–86 (2018)

    MathSciNet  MATH  Google Scholar 

  2. Abbe, E., Sandon, C.: Recovering communities in the general stochastic block model without knowing the parameters. In: Advances in Neural Information Processing Systems, pp. 676–684 (2015)

  3. Adamic, L.A., Glance, N.: The political blogosphere and the 2004 US election: divided they blog. In: Proceedings of the 3rd International Workshop on Link Discovery, pp. 36–43 (2005)

  4. Andersen, R., Chung, F., Lang, K.: Using pagerank to locally partition a graph. Internet Math. 4(1), 35–64 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  5. Camps-Valls, G., Bandos, T., Zhou, D.: Semisupervised graph-based hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 45(10), 3044–3054 (2007)

    Article  Google Scholar 

  6. Chen, Y., Wang, J.Z., Krovetz, R.: Clue: cluster-based retrieval of images by unsupervised learning. IEEE Trans. Image Process. 14(8), 1187–1201 (2005)

    Article  Google Scholar 

  7. Chung, F.: Spectral Graph Theory, vol. 92. American Mathematical Society, Providence (1997)

    MATH  Google Scholar 

  8. Chung, F., Lu, L.: Complex Graphs and Networks, vol. 107. American Mathematical Society, Providence (2006)

    MATH  Google Scholar 

  9. Dai, W., Milenkovic, O.: Subspace pursuit for compressive sensing signal reconstruction. IEEE Trans. Inform. Theory 55(5), 2230–2249 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  10. Dhillon, S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 269–274 (2001)

  11. Dhillon, S., Guan, Y., Kulis, B.: Kernel k-means: spectral clustering and normalized cuts. In: Proceedings of the 10th ACM SIGKDD Conference (2004)

  12. Ding, C., He, X., Zha, H., Gu, M., Simon, H.D.: A min–max cut algorithm for graph partitioning and data clustering. In: Proceedings of IEEE ICDM, vol. 2001, pp. 107–114 (2001)

  13. Erdős, P., Rényi, A.: On random graphs, I. Publ. Math. Debrecen 6, 290–297 (1959)

    Article  MathSciNet  MATH  Google Scholar 

  14. Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3–5), 75–174 (2010)

    Article  MathSciNet  Google Scholar 

  15. Ha, W., Fountoulakis, K., Mahoney, M.W.: Statistical guarantees for local graph clustering. In: The 23rd International Conference on Artificial Intelligence and Statistics (2020)

  16. Holland, P.W., Laskey, K., Leinhardt, S.: Stochastic blockmodels: first steps. Soc. Netw. 5(2), 109–137 (1983)

    Article  MathSciNet  Google Scholar 

  17. Hric, D., Darst, R.K., Fortunato, S.: Community detection in networks: structural clusters versus ground truth. Phys. Rev. E 9, 062805 (2014)

    Article  Google Scholar 

  18. Jacobs, M., Merkurjev, E., Esedoglu, S.: Auction dynamics: a volume constrained MBO scheme. J. Comput. Phys. 354, 288–310 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  19. Kingma, D. P., Mohamed, S., Rezende, D.J., Welling, M.: Semi-supervised learning with deep generative models. In: Advances in Neural Information Processing Systems 27 (NIPS 2014), pp. 3581–3589 (2014)

  20. Kloster, K., Gleich, D.F.: Heat kernel based community detection. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1386–1395 (2014)

  21. Kossinets, G., Watts, D.J.: Empirical analysis of an evolving social network. Science 311(5757), 88–90 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  22. Kulis, B., Basu, S., Dhillon, I., Mooney, R.: Semi-supervised graph clustering: a kernel approach. In: Proceedings of ICML (2005)

  23. Lai, M.-J., Mckenzie, D.: Compressive sensing approach to cut improvement and local clustering. SIAM J. Math. Data Sci. 2, 368–395 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  24. Lee, D.-H.: Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on Challenges in Representation Learning, ICML (2013)

  25. Luxburg, U.V.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)

    Article  MathSciNet  Google Scholar 

  26. Mahoney, M.W., Orecchia, L., Vishnoi, N.K.: A local spectral method for graphs: with applications to improving graph partitions and exploring data graphs locally. J. Mach. Learn. Res. 13(1), 2339–2365 (2012)

    MathSciNet  MATH  Google Scholar 

  27. Mihalcea, R., Radev, D.: Graph-Based Natural Language Processing and Information Retrieval. Cambridge University Press, Cambridge (2011)

    Book  MATH  Google Scholar 

  28. Mossel, E., Neeman, J., Sly, A.: A proof of the block model threshold conjecture. Combinatorica 38(3), 665–708 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  29. Needell, D., Tropp, J.A.: CoSaMP: iterative signal recovery from incomplete and inaccurate samples. Appl. Comput. Harmonic Anal. 26(3), 301–321 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  30. Ng, A. Y., Jordan, M.I., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: Advances in Neural Information Processing Systems, pp. 849-856 (2002)

  31. Shi, Pan, He, Kun, Bindel, David, Hopcroft, John E.: Locally-biased spectral approximation for community detection. Knowl.-Based Syst. 164, 459–472 (2019)

    Article  Google Scholar 

  32. Pitelis, N., Russell, C., Agapito, L.: Semi-supervised learning using an unsupervised atlas. In: Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2014), pp. 565–580. Springer (2014)

  33. Qin, T., Rohe, K.: Regularized spectral clustering under the degree-corrected stochastic block model. In: Burges, C.J.C. , Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 26, pp. 3120–3128 (2013)

  34. Rasmus, A., Berglund, M., Honkala, M., Valpola, H., Raiko, T.: Semi-supervised learning with ladder networks. In: Advances in Neural Information Processing Systems, pp. 3546–3554 (2015)

  35. Santosa, F., Symes, W.: Linear inversion of band-limited reflection seismograms. SIAM J. Sci. Stat. Comput. 7(4), 1307–1330 (1986). https://doi.org/10.1137/0907087

    Article  MathSciNet  MATH  Google Scholar 

  36. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)

    Article  Google Scholar 

  37. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 58(1), 267–88 (1996)

    MathSciNet  MATH  Google Scholar 

  38. Veldt, N., Klymko, C., Gleich, D. F.: Flow-based local graph clustering with better seed set inclusion. In: Proceedings of the SIAM International Conference on Data Mining (2019)

  39. Wang, D., Fountoulakis, K., Henziger, M., Mahoney, Michael W., Rao, S.: Capacity releasing diffusion for speed and locality. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 3598–3607 (2017)

  40. Yan, Y., Bian, Y., Luo, D., Lee, D., Zhang, X.: Constrained local graph clustering by colored random walk. In: Proceedings of World Wide Web Conference, pp. 2137–2146 (2019)

  41. Yin, H., Benson, A. R., Leskovec, J., Gleich, D. F.: Local higher-order graph clustering. In: Proceedings of the 23rd ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 555–564 (2017)

  42. Yin, K., Tai, X.-C.: An effective region force for some variational models for learning and clustering. J. Sci. Comput. 74, 175–196 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  43. Zelnik-Manor, L., Perona, P.: Self-tuning spectral clustering. In: Advances in neural information processing systems, pp. 1601–1608 (2004)

  44. Zhang, A. Y., Zhou, H. H., Gao, C., Ma, Z.: Achieving optimal misclassification proportion in stochastic block model. arXiv:1505.03772 (2015)

Download references

Funding

The first author is supported by the Simons Foundation collaboration grant #864439.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ming-Jun Lai.

Ethics declarations

Conflict of interest

The authors have disclosed any competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Hyperparameters for Numerical Experiments

For each cluster to be recovered, we sampled the seed vertices \(\varGamma _i\) uniformly from \(C_i\) during all implementations. We fix the random walk depth with \(t=3\), use random walk threshold parameter \(\delta =0.8\) for political blogs network and \(\delta =0.6\) for all the other experiments. We vary the rejection parameters \(R\in (0,1)\) for each specific experiments based on the estimated sizes of clusters. In the case of no knowledge of estimated sizes of clusters nor the number of clusters are given, we may have to refer to the spectra of graph Laplacian and use the large gap between two consecutive spectrum to estimate the number of clusters, and then use the average size to estimate the size of each cluster.

We fix the least squares threshold parameter with \(\gamma =0.2\) for all the experiments, which is totally heuristic. However, we have experimented that the performance of algorithms will not be affected too much by varying \(\gamma \in [0.15,0.4]\). The hyperparameter MaxIter is chosen according to the size of initial seed vertices relative to the total number of vertices in the cluster. For purely comparison purpose, we keep \(MaxIter=1\) for MNIST data. By experimenting on different choices of MaxIter, we find that the best performance for AT &T data occurs at \(MaxIter=2\) for \(10\%\) seeds and \(MaxIter=1\) for \(20\%\) and \(30\%\) seeds. For YaleB data, the best performance occurs at \(MaxIter=2\) for \(5\%\), \(10\%\), and \(20\%\) seeds. All the numerical experiments are implemented in MATLAB and can be run on a local personal machine, for the authenticity of our results, we put a sample demo code at https://github.com/zzzzms/LeastSquareClustering the purpose of reproducibility.

Appendix B Image Data Preprocessing

For YaleB human faces data, we have performed some data preprocessing techinuqes to avoid the poor quality images. Specifically, we abandoned the pictures which are too dark, and we cropped each image into size of \(54\times 46\) to reduce the effect of background noise. For the remaining qualified pictures, we randomly selected 20 images for each person.

All the image data in MNIST, AT &T, YaleB needs to be firstly constructed into an auxiliary graph before the implementation. Let \(\varvec{x}_i\in \mathbb {R}^n\) be the vectorization of an image from the original data set, we define the following affinity matrix of the K-NN auxiliary graph based on Gaussian kernel according to [18, 43],

$$\begin{aligned} A_{ij} = {\left\{ \begin{array}{ll} e^{-\Vert \varvec{x}_i-\varvec{x}_j\Vert ^2/\sigma _i\sigma _j} &{} \text {if} \quad \varvec{x}_j\in NN(\varvec{x}_i,K) \\ 0 &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

The notation \(NN(\varvec{x}_i,K)\) indicates the set of K-nearest neighbours of \(\varvec{x}_i\), and \(\sigma _i:=\Vert \varvec{x}_i-\varvec{x}^{(r)}_i\Vert \) where \(\varvec{x}^{(r)}_i\) is the r-th closest point of \(\varvec{x}_i\). Note that the above \(A_{ij}\) is not necessary symmetric, so we consider \(\tilde{A}_{ij}=A^T A\) for symmetrization. Alternatively, one may also want to consider \(\tilde{A}=\max \{A_{ij}, A_{ji}\}\) or \(\tilde{A}=(A_{ij}+A_{ji})/2\). We use \(\tilde{A}\) as the input adjacency matrix for our algorithms.

We fix the local scaling parameters \(K=15\), \(r=10\) for the MNIST data, \(K=8\), \(r=5\) for the YaleB data, and \(K=5\), \(r=3\) for the AT &T data during implementation.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lai, MJ., Shen, Z. A Compressed Sensing Based Least Squares Approach to Semi-supervised Local Cluster Extraction. J Sci Comput 94, 63 (2023). https://doi.org/10.1007/s10915-022-02052-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10915-022-02052-x

Keywords

Mathematics Subject Classification

Navigation