Minimum spectral connectivity projection pursuit

Divisive clustering using optimal projections for spectral clustering

Abstract

We study the problem of determining the optimal low-dimensional projection for maximising the separability of a binary partition of an unlabelled dataset, as measured by spectral graph theory. This is achieved by finding projections which minimise the second eigenvalue of the graph Laplacian of the projected data, which corresponds to a non-convex, non-smooth optimisation problem. We show that the optimal univariate projection based on spectral connectivity converges to the vector normal to the maximum margin hyperplane through the data, as the scaling parameter is reduced to zero. This establishes a connection between connectivity as measured by spectral graph theory and maximal Euclidean separation. The computational cost associated with each eigen problem is quadratic in the number of data. To mitigate this issue, we propose an approximation method using microclusters with provable approximation error bounds. Combining multiple binary partitions within a divisive hierarchical model allows us to construct clustering solutions admitting clusters with varying scales and lying within different subspaces. We evaluate the performance of the proposed method on a large collection of benchmark datasets and find that it compares favourably with existing methods for projection pursuit and dimension reduction for data clustering. Applying the proposed approach for a decreasing sequence of scaling parameters allows us to obtain large margin clustering solutions, which are found to be competitive with those from dedicated maximum margin clustering algorithms.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Notes

  1. 1.

    An R implementation of the SCPP algorithm is available at https://github.com/DavidHofmeyr/SCPP.

  2. 2.

    https://archive.ics.uci.edu/ml/datasets.html.

  3. 3.

    http://genome-www.stanford.edu/cellcycle/.

  4. 4.

    https://cervisia.org/machine_learning_data.php/.

  5. 5.

    https://web.stanford.edu/~hastie/ElemStatLearn/.

  6. 6.

    We used the implementation provided by the authors, taken from https://sites.google.com/site/binzhao02/.

References

  1. Bach, F.R., Jordan, M.I.: Learning spectral clustering, with application to speech separation. J. Mach. Learn. Res. 7, 1963–2001 (2006)

    MathSciNet  MATH  Google Scholar 

  2. Bache, K., Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml

  3. Boumal, N., Mishra, B., Absil, P.A., Sepulchre, R.: Manopt, a matlab toolbox for optimization on manifolds. J. Mach. Learn. Res. 15, 1455–1459 (2014)

    MATH  Google Scholar 

  4. Burke, J.V., Lewis, A.S., Overton, M.L.: A robust gradient sampling algorithm for nonsmooth, nonconvex optimization. SIAM J. Optim. 15(3), 751–779 (2006)

    MathSciNet  MATH  Article  Google Scholar 

  5. Chi, Y., Song, X., Zhou, D., Hino, K., Tseng, B.L.: On evolutionary spectral clustering. ACM Trans. Knowl. 3(4), 17:1–17:30 (2009)

    Google Scholar 

  6. Edelman, A., Arias, T., Smith, S.T.: The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl. 20(2), 303–353 (1998)

    MathSciNet  MATH  Article  Google Scholar 

  7. Fan, K.: On a theorem of weyl concerning eigenvalues of linear transformations I. Proc. Natl. Acad. Sci. USA 35(11), 652 (1949)

    MathSciNet  Article  Google Scholar 

  8. Hagen, L., Kahng, A.B.: New spectral methods for ratio cut partitioning and clustering. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 11(9), 1074–1085 (1992)

    Article  Google Scholar 

  9. Hartigan, J.A., Hartigan, P.M.: The dip test of unimodality. Ann. Stat. 13(1), 70–84 (1985)

    MathSciNet  MATH  Article  Google Scholar 

  10. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Texts in Statistics, 2nd edn. Springer, New York (2009)

    Google Scholar 

  11. Hofmeyr, D., Pavlidis, N.: Maximum clusterability divisive clustering. In: 2015 IEEE Symposium Series on Computational Intelligence, pp. 780–786. IEEE (2015)

  12. Hofmeyr, D.: Improving spectral clustering using the asymptotic value of the normalised cut. arXiv preprint arXiv:1703.09975 (2017)

  13. Joachims, T.: Transductive inference for text classification using support vector machines. In: Proceedings of International Conference on Machine Learning (ICML), Bled, Slowenien, vol. 99, pp. 200–209 (1999)

  14. Kaiser, H.F.: The application of electronic computers to factor analysis. Educ. Psychol. Meas. 20(1), 141–151 (1960)

    Article  Google Scholar 

  15. Krause, A., Liebscher, V.: Multimodal projection pursuit using the dip statistic. Preprint-Reihe Mathematik 13 (2005)

  16. Lewis, A.S., Overton, M.L.: Eigenvalue optimization. Acta Numer. 5, 149–190 (1996)

    MathSciNet  MATH  Article  Google Scholar 

  17. Lewis, A., Overton, M.: Nonsmooth optimization via quasi-Newton methods. Math. Program. 141, 135–163 (2013)

    MathSciNet  MATH  Article  Google Scholar 

  18. Magnus, J.R.: On differentiating eigenvalues and eigenvectors. Econ. Theory 1(02), 179–191 (1985)

    Article  Google Scholar 

  19. Ng, A., Jordan, M.I., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: Dietterich, T., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems, vol. 14, pp. 849–856. MIT Press, Cambridge (2002)

    Google Scholar 

  20. Ning, H., Xu, W., Chi, Y., Gong, Y., Huang, T.S.: Incremental spectral clustering by efficiently updating the eigen-system. Pattern Recogn. 43(1), 113–127 (2010)

    MATH  Article  Google Scholar 

  21. Niu, D., Dy, J.G., Jordan, M.I.: Dimensionality reduction for spectral clustering. In: International Conference on Artificial Intelligence and Statistics, pp. 552–560 (2011)

  22. Nocedal, J., Wright, S.: Numerical Optimization. Springer, Berlin (2006)

    Google Scholar 

  23. Overton, M.L., Womersley, R.S.: Optimality conditions and duality theory for minimizing sums of the largest eigenvalues of symmetric matrices. Math. Program. 62(1–3), 321–357 (1993)

    MathSciNet  MATH  Article  Google Scholar 

  24. Pavlidis, N.G., Hofmeyr, D.P., Tasoulis, S.K.: Minimum density hyperplanes. J. Mach. Learn. Res. 17(156), 1–33 (2016)

    MathSciNet  MATH  Google Scholar 

  25. Peña, D., Prieto, F.J.: Cluster identification using projections. J. Am. Stat. Assoc. 147, 389 (2001)

    MathSciNet  MATH  Google Scholar 

  26. Polak, E.: On the mathematical foundations of nondifferentiable optimization in engineering design. SIAM Rev. 29(1), 21–89 (1987)

    MathSciNet  Article  Google Scholar 

  27. Rahimi, A., Recht, B.: Clustering with normalized cuts is clustering with a hyperplane. Stat. Learn. Comput. Vis. 56, 1 (2004)

    Google Scholar 

  28. Schur, J.: Bemerkungen zur theorie der beschränkten bilinearformen mit unendlich vielen veränderlichen. J. für die reine und Angew. Math. 140, 1–28 (1911)

    MathSciNet  MATH  Google Scholar 

  29. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)

    Article  Google Scholar 

  30. Strehl, A., Ghosh, J.: Cluster ensembles–a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3, 583–617 (2002)

    MathSciNet  MATH  Google Scholar 

  31. Tong, S., Koller, D.: Restricted Bayes optimal classifiers. In: AAAI/IAAI, pp. 658–664 (2000)

  32. Trillos, N.G., Slepčev, D., Von Brecht, J., Laurent, T., Bresson, X.: Consistency of cheeger and ratio graph cuts. J. Mach. Learn. Res. 17(1), 6268–6313 (2016)

    MathSciNet  MATH  Google Scholar 

  33. Vapnik, V.N., Kotz, S.: Estimation of Dependences Based on Empirical Data, vol. 40. Springer, New York (1982)

    Google Scholar 

  34. von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)

    MathSciNet  Article  Google Scholar 

  35. Wagner, D., Wagner, F.: Between Min Cut and Graph Bisection. Springer, Berlin (1993)

    Google Scholar 

  36. Wang, F., Zhao, B., Zhang, C.: Linear time maximum margin clustering. IEEE Trans. Neural Netw. 21(2), 319–332 (2010)

    Article  Google Scholar 

  37. Weiss, Y.: Segmentation using eigenvectors: a unifying view. In: Proceedings of the 7th IEEE International Conference on Computer Vision, vol. 2, pp. 975–982 (1999)

  38. Weyl, H.: Das asymptotische verteilungsgesetz der eigenwerte linearer partieller differentialgleichungen (mit einer anwendung auf die theorie der hohlraumstrahlung). Math. Ann. 71(4), 441–479 (1912)

    MathSciNet  MATH  Article  Google Scholar 

  39. Wolfe, P.: On the convergence of gradient methods under constraint. IBM J. Res. Dev. 16(4), 407–411 (1972)

    MathSciNet  MATH  Article  Google Scholar 

  40. Xu, L., Neufeld, J., Larson, B., Schuurmans, D.: Maximum margin clustering. In: Advances in Neural Information Processing Systems, pp. 1537–1544 (2004)

  41. Yan, D., Huang, L., Jordan, M.I.: Fast approximate spectral clustering. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 907–916. ACM (2009)

  42. Ye, Q.: Relative perturbation bounds for eigenvalues of symmetric positive definite diagonally dominant matrices. SIAM J. Matrix Anal. Appl. 31(1), 11–17 (2009)

    MathSciNet  MATH  Article  Google Scholar 

  43. Zelnik-Manor, L., Perona, P.: Self-tuning spectral clustering. In: Advances in Neural Information Processing Systems, pp. 1601–1608 (2004)

  44. Zhang, T., Ramakrishnan, R., Livny, M.: Birch: an efficient data clustering method for very large databases. In: ACM SIGMOD Record, vol. 25, pp. 103–114. ACM (1996)

  45. Zhang, B.: Dependence of clustering algorithm performance on clustered-ness of data. Technical Report, 20010417. Hewlett-Packard Labs (2001)

  46. Zhang, K., Tsang, I.W., Kwok, J.T.: Maximum margin clustering made practical. IEEE Trans. Neural Netw. 20(4), 583–596 (2009)

    Article  Google Scholar 

  47. Zhao, Y., Karypis, G.: Empirical and theoretical comparisons of selected criterion functions for document clustering. Mach. Learn. 55(3), 311–331 (2004)

    MATH  Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their insightful recommendations, which helped improve the quality of the paper. They would also like to thank Dr. Teemu Roos for his valuable comments on this work. Finally, they are very grateful to Dr. Kai Zhang for providing code to implement the iSVR algorithm.

Author information

Affiliations

Authors

Corresponding author

Correspondence to David P. Hofmeyr.

Additional information

David Hofmeyr acknowledges support from the EPSRC-funded EP/H023151/1 STOR-i centre for doctoral training as well as the Oppenheimer Memorial Trust. Idris Eckley was supported by EPSRC Grant EP/N031938/1 (StatScale).

Appendices

Avoiding outliers

It has been documented that spectral clustering can be sensitive to outliers (Rahimi and Recht 2004). Our experience has shown that this problem becomes more pronounced when performing dimension reduction based on the spectral clustering objective, especially in high-dimensional applications. Consider the extreme case where \(d>N\): since the linear system \(V^\top X = P\) is underdetermined, for any P there exists \(\pmb {\theta }\in \varTheta , c \in \mathbb {R}{\setminus }\{0\}\) s.t. \(V(\pmb {\theta })^\top X = cP\). The projected data can therefore be made to have any distribution (up to a scaling constant). In other words there will always be projections that contain outliers. We have found that even in problems of moderate dimensionality, there often exist projections which induce large separation of a small group of points from the remainder of the data. These projections frequently achieve the minimum spectral connectivity for both Ratio Cut and Normalised Cut.

We have found that by defining a metric which encourages the induced cluster boundaries to intersect a compact set, \(\pmb {\varDelta }(\pmb {\theta })\), around the mean of the projected data, the problem of outliers can be mitigated. This is achieved by reducing the distance, relative to the usual Euclidean metric, to points lying outside \(\pmb {\varDelta }(\pmb {\theta })\). Points lying outside \(\pmb {\varDelta }(\pmb {\theta })\), which may be outliers, therefore have increased similarity to all others. We define \(\pmb {\varDelta }(\pmb {\theta }) = \varDelta _1 \times \cdots \times \varDelta _l\), where \(\varDelta _i = [\mu _i-\beta \sigma _{i}, \mu _i + \beta \sigma _{i}]\); \(\mu _i\) and \(\sigma _{i}\) are the mean and standard deviation of the i-th component of the projected data; and \(\beta \geqslant 0\) controls the size of \(\pmb {\varDelta }(\pmb {\theta })\). The modified distance metric, \(d(\cdot , \cdot )\), is defined with respect to a continuously differentiable transformation, \(T_{\varDelta }\), of the projected data,

$$\begin{aligned} d(p_i, p_j)&= \Vert T_{\varDelta }(p_i) - T_{\varDelta }(p_j)\Vert _2, \end{aligned}$$
(19)
$$\begin{aligned} T_{\varDelta }(y)&= \left( t_{\varDelta _1}(y_1), \ldots , t_{\varDelta _l}(y_l)\right) , \end{aligned}$$
(20)
$$\begin{aligned} t_{\varDelta _i}(z)&:= \left\{ \begin{array}{ll} c_2 -\beta \sigma _i -\delta \left( c_1 -\beta \sigma _i - z \right) ^{1-\delta }, &{} z < -\beta \sigma _i\\ z, &{} z \in \varDelta _i\\ \beta \sigma _i + \delta \left( z - \beta \sigma _i + c_1 \right) ^{1-\delta } - c_2, &{} z > \beta \sigma _i, \end{array}\right. \end{aligned}$$
(21)

where \(\delta \in (0, 0.5]\) is the distance reducing parameter, and \(c_1\) and \(c_2\) are equal to \(\left( \delta \left( 1-\delta \right) \right) ^{1/\delta }\) and \(\delta c_1^{1-\delta }\), respectively. By construction \(\Vert T_{\varDelta }(p_i) - T_{\varDelta }(p_j)\Vert _2 \le \Vert p_i - p_j\Vert _2\) for any \(p_i,p_j \in \mathbb {R}^l\), with strict inequality when either or both \(p_i,p_j \notin \pmb {\varDelta }(\pmb {\theta })\).

Figure 7 illustrates the impact of \(T_{\varDelta }\) on pairwise distances in the univariate case. As shown, distance increases linearly in the interval \(\varDelta \), but outside \(\varDelta \) it increases much more slowly, with the rate being determined by \(\delta \). In the limit as \(\delta \) approaches zero, all points outside \(\varDelta \) are mapped to the boundary of \(\varDelta \). As a result distances between points outside \(\varDelta \) and all other points are much smaller after being transformed through \(T_\varDelta \), and points which can be characterised as outliers in terms of the original projections, \(\mathcal {P}\), do not appear as such in terms of \(T_\varDelta (\mathcal {P})\).

Fig. 7
figure7

Pairwise distances of points outside \(\varDelta \) are decreased through the transformation \(T_{\varDelta }\)

An illustration of the usefulness of this modified metric is provided in Fig. 8. The figure shows two-dimensional projections of the 64-dimensional optical recognition of handwritten digits dataset (Bache and Lichman 2013). The left plots show the true clusters, while the right plots show the clustering assignments based on spectral clustering using the normalised Laplacian (Shi and Malik 2000). Figure 8a shows the projection onto the first two principal components, which are also used as initialisation for our method. There are clearly a few points outlying from the remainder of the data, which are separated by the spectral clustering algorithm. Figure 8b shows the optimal projection from minimising \(\lambda _2(L_{\mathrm {N}}(\pmb {\theta }))\) using the Euclidean metric. The result is that the outlying points have been further separated from the remainder of the data, thereby exacerbating the outlier problem. Finally, Fig. 8c shows the same result but using the modified metric discussed above and with \(\beta = 3\). In this case the projection pursuit is able to find a projection which separates two of the true clusters clearly from the remainder.

Fig. 8
figure8

Two-dimensional projections of optical recognition of handwritten digits dataset. The left plots show the true clusters, while the right plots show the partitions made by spectral clustering. a PCA projection used for initialisation. b Optimal projection from minimising \(\lambda _2(L_{\mathrm {N}}(\pmb {\theta }))\) with the Euclidean metric. c Optimal projection from minimising \(\lambda _2(L_{\mathrm {N}}(\pmb {\theta }))\) with the modified metric (\(\beta = 3\))

Derivatives

Evaluating \(D_{P_i}\lambda _2(\cdot )\)

We first consider the standard Laplacian L and use \(\lambda \) and u to denote the second eigenvalue and corresponding eigenvector. By Eq. (11) we have \(d\lambda = u^\top d(L) u = u^\top d(D) u - u^\top d(A) u\).

Now,

$$\begin{aligned} \frac{\partial D_{ii}}{\partial P_{mn}}&= \sum _{j=1}^N \frac{\partial A_{ij}}{\partial P_{mn}} = \sum _{j=1}^N \frac{\partial s(P, i, j)}{\partial P_{mn}}, \\ \frac{\partial A_{ij}}{\partial P_{mn}}&= \frac{\partial s(P, i, j)}{\partial P_{mn}}, \end{aligned}$$

and so,

$$\begin{aligned} \frac{\partial {\lambda }}{\partial P_{mn}} = u^\top \frac{\partial L}{\partial P_{mn}} u = \frac{1}{2}\sum _{i, j}(u_i-u_j)^2\frac{\partial s(P, i, j)}{\partial P_{mn}}. \end{aligned}$$

For the normalised Laplacian, \(L_{\mathrm {N}}\), consider first

$$\begin{aligned} d(L_{\mathrm {N}})&= d(D^{-1/2}LD^{-1/2})\\&= d(D^{-1/2})LD^{-1/2}+D^{-1/2}d(D)D^{-1/2} \\&\quad - D^{-1/2}d(A)D^{-1/2} + D^{-1/2}L d(D^{-1/2}). \end{aligned}$$

We again use \(\lambda \) and u to denote the second eigenvalue and corresponding eigenvector. Using \(LD^{-1/2}u = \lambda D^{1/2}u\),

$$\begin{aligned} d\lambda&= u^\top d(D^{-1/2})LD^{-1/2}u + u^\top D^{-1/2}d(D)D^{-1/2}u\\&\quad - u^\top D^{-1/2}d(A)D^{-1/2}u + u^\top D^{-1/2}L d(D^{-1/2})u\\&= \lambda u^\top d(D^{-1/2})D^{1/2}u + u^\top D^{-1/2}d(D)D^{-1/2}u \\&\quad - u^\top D^{-1/2}d(A)D^{-1/2}u + \lambda u^\top D^{1/2}d(D^{-1/2})u\\&= (1-\lambda )u^\top D^{-1/2}d(D)D^{-1/2}u - u^\top D^{-1/2}d(A)D^{-1/2}u. \\&= u^\top D^{-1/2}d(L)D^{-1/2}u - \lambda u^\top D^{-1/2}d(D)D^{-1/2}u. \end{aligned}$$

where in the third step we made use of the fact that \(d(D^{-1/2})DD^{-1/2} + D^{-1/2}d(D)D^{-1/2} + D^{-1/2}Dd(D^{-1/2}) = d(D^{-1/2}DD^{-1/2}) = d(I) = \mathbf {0}\). Therefore,

$$\begin{aligned} \frac{\partial \lambda }{\partial P_{mn}}&= \frac{1}{2} \sum _{i, j} \left( \frac{u_i}{\sqrt{d_i}} - \frac{u_j}{\sqrt{d_j}}\right) ^2 \frac{\partial s(P, i, j)}{\partial P_{mn}} \\&\quad - \lambda \sum _{i, j} \frac{u_i^2}{d_i}\frac{\partial s(P, i, j)}{\partial P_{mn}}. \end{aligned}$$

Derivatives of the approximate eigenvalue functions based on microclusters

In the general case we may consider a set of m microclusters with centres \(c_1, \ldots , c_m\) and counts \(n_1, \ldots , n_m\). The derivations we provide are valid for \(n_i = 1 \ \forall i \in \{1, \dots , m\}\), and so apply to the exact formulation of the problem as well. Let \(\pmb {\theta }\in \varTheta \). We find it practically convenient to associate the transformation in Eq. (20), which incorporates the set \(\pmb {\varDelta }(\pmb {\theta })\), with the projection of the microclusters rather than with the computation of similarities. Specifically, we now let \({\mathcal {T}}\) be the transformed projected microcluster centres, i.e.

$$\begin{aligned} {\mathcal {T}}&= \{t_1, t_1, \dots , t_m, t_m \}\\&= \{T_{\pmb {\varDelta }(\pmb {\theta })}(V(\pmb {\theta })^\top c_1), T_{\pmb {\varDelta } (\pmb {\theta })}(V(\pmb {\theta })^\top c_1),\\&\ldots , T_{\pmb {\varDelta }(\pmb {\theta })}(V(\pmb {\theta })^\top c_m), T_{\pmb {\varDelta }(\pmb {\theta })}(V(\pmb {\theta })^\top c_m) \}, \end{aligned}$$

where each \(t_i\) is repeated \(n_i\) times. The reason for this is that with this formulation the majority of terms in the above sums corresponding to \(\partial \lambda \) (which are now partial derivatives w.r.t. the elements of \({\mathcal {T}}\), and not \(\mathcal {P}\) as before) are zero. Specifically, with this expression for \({\mathcal {T}}\), and letting T be the matrix with columns corresponding to elements in \({\mathcal {T}}\), we have

$$\begin{aligned} \frac{\partial \lambda }{\partial T_{mn}}&= \frac{1}{2} \sum _{i, j} (u_i-u_j)^2\frac{\partial k(\Vert t_i - t_j\Vert /\sigma )}{\partial T_{mn}}\nonumber \\&= \sum _{i \not = n } (u_i-u_n)^2\frac{\partial k(\Vert t_i - t_n\Vert /\sigma )}{\partial T_{mn}}, \end{aligned}$$
(22)

and similarly for the normalised Laplacian.

In Sect. 3 we expressed \(D_{\pmb {\theta }}\lambda \) via the chain rule decomposition \(D_P\lambda D_v PD_{\pmb {\theta }} v\), which we can now simply restructure as \(D_T\lambda D_v TD_{\pmb {\theta }} v\). The compression of \({\mathcal {T}}\) to the size m non-repeated set, \({\mathcal {T}}^C = \{t_1, \ldots , t_m \}\), requires a slight restructuring, as described in Sect. 5. We begin with the standard Laplacian, letting \(T^C\) be the matrix corresponding to \({\mathcal {T}}^C\), and define \(N(\pmb {\theta })\) and \(B(\pmb {\theta })\) as in Lemma 3. That is, \(N(\pmb {\theta })\) is the diagonal matrix with i-th diagonal element equal to \(\sum _{j=1}^m n_j k(\Vert t_i - t_j\Vert /\sigma )\) and \(B(\pmb {\theta })_{i,j} = \sqrt{n_i n_j} k(\Vert t_i - t_j\Vert /\sigma )\). The derivative of the second eigenvalue of the Laplacian relies on the corresponding eigenvector, u. However, this vector is not explicitly available as we only solve the \(m\times m\) eigen problem of \(N(\pmb {\theta }) - B(\pmb {\theta })\). Let \(u^C\) be the second eigenvector of \(N(\pmb {\theta }) - B(\pmb {\theta })\). As in the proof of Lemma 3 if ij are such that the i-th element of \({\mathcal {T}}\) corresponds to the j-th microcluster, then \(u^C_j = \sqrt{n_j}u_i\). The derivative of \(\lambda _2(N(\pmb {\theta })-B(\pmb {\theta }))\) with respect to the i-th column of \(\pmb {\theta }\), and thus equivalently of the second eigenvalue of the Laplacian, is therefore the vector with j-th entry given by

$$\begin{aligned} \sum _{k \not = j}\left( \frac{u^C_k}{\sqrt{n_k}}-\frac{u^C_j}{\sqrt{n_j}}\right) ^2n_kn_j\frac{\partial k\left( \frac{\Vert t_k-t_j\Vert }{\sigma }\right) }{\partial T^C_{kj}} D_{V_i} T^C_i D_{\pmb {\theta }_i}V_i, \end{aligned}$$

where \(D_{\pmb {\theta }_i}V_i\) is given in Eq. (12) and \(D_{V_i} T^C_i\) is expressed below. We provide expressions for the case where

$$\begin{aligned} \varDelta (\pmb {\theta }) = \prod _{i=1}^l[- \beta \sigma _{\pmb {\theta }_i}, \beta \sigma _{\pmb {\theta }_i}], \end{aligned}$$

as in our implementation, where we have again assumed that the data have been centred, i.e. have zero mean. Then \(D_{V_i} T^C_i\) is the \(m \times d\) matrix with j-th row equal to,

$$\begin{aligned} \frac{\delta (1-\delta )}{(-\beta \sigma _{\pmb {\theta }_i} - V_i^\top c_j + (\delta (1-\delta ))^{1/\delta })^\delta } \left( \frac{\beta }{\sigma _{\pmb {\theta }_i}}\varSigma V_i + c_j\right) , \end{aligned}$$

if \(V_i^\top c_j < -\beta \sigma _{\pmb {\theta }_i}\),

$$\begin{aligned} c_j, \end{aligned}$$

if \(-\beta \sigma _{\pmb {\theta }_i} \le V_i^\top c_j \le \beta \sigma _{\pmb {\theta }_i}\), and

$$\begin{aligned} \frac{\delta (1-\delta )}{(V_i^\top c_j - \beta \sigma _{\pmb {\theta }_i} + (\delta (1-\delta ))^{1/\delta })^\delta } \left( c_j - \frac{\beta }{\sigma _{\pmb {\theta }_i}}\varSigma V_i\right) + 2\frac{\beta }{\sigma _{\pmb {\theta }_i}} \varSigma V_i, \end{aligned}$$

if \(V_i^\top c_j>\beta \sigma _{\pmb {\theta }_i}\). Here \(\varSigma \) is the covariance matrix of the data.

For the normalised Laplacian, the reduced \(m\times m\) eigen problem has precisely the same form as the original \(N\times N\) problem, with the only difference being the introduction of the factors \(n_j n_k\). Specifically, with the derivation in Sect. 3 we can see that the corresponding derivative is as for the standard Laplacian above, except that the coefficients \((u_j^C/\sqrt{n_j} - u_k^C/\sqrt{n_k})^2n_j n_k\) in Eq. (23) are replaced with \((u_j^C/\sqrt{d_j} - u_k^C/\sqrt{d_k})^2 - \lambda ((u_j^C)^2/d_j + (u_k^C)^2/d_k)\), where \(\lambda \) is the second eigenvalue of the normalised Laplacian, \(u^C\) is the corresponding eigenvector, and \(d_j\) is the degree of the j-th element of \({\mathcal {T}}^C\).

Computational complexity

Here we give a very brief discussion of the computational complexity of the proposed method. At each iteration in the gradient descent, computing the projected data matrix, \(P(\pmb {\theta })\), requires \({\mathcal {O}}(Nld)\) operations. Computing all pairwise similarities from elements of the l-dimensional \(\mathcal {P}(\pmb {\theta })\) has computational complexity \({\mathcal {O}}(lN^2)\), and determining both Laplacian matrices, and their associated eigenvalue/vector pairs add a further computational cost \({\mathcal {O}}(N^2)\). Each evaluation of the objectives \(\lambda _2(L(\pmb {\theta }))\) or \(\lambda _2(L_{\mathrm {N}}(\pmb {\theta }))\) therefore requires \({\mathcal {O}}(lN(N+d))\) operations. In order to compute the gradients of these objectives, the partial derivatives with respect to each element of the projected data matrix need to be calculated. As we discussed in relation to the derivatives above, the majority of the terms in the sums in Eqs. (13) and (14) are zero, and in fact each partial derivative can be computed in \({\mathcal {O}}(N)\) time, and so all such partial derivatives can be computed in \({\mathcal {O}}(lN^2)\) time. The matrix derivatives \(D_{\pmb {\theta }_i} V_i ,i=1,\ldots ,l\), in (12) can each be computed with \({\mathcal {O}}(d(d-1))\) operations. Finally, determining the gradients with respect to each column of \(\pmb {\theta }\) involves computing the matrix product \(D_{\pmb {\theta }_i} \lambda = D_{P_i} \lambda D_{V_i} P_i D_{\pmb {\theta }_i} V_i\), where \(D_{P_i} \lambda \in \mathbb {R}^{1\times N}, D_{V_i} P_i \in \mathbb {R}^{N \times d}\) and \(D_{\pmb {\theta }_i} V_i \in \mathbb {R}^{d\times (d-1)}\). This has complexity \({\mathcal {O}}(Nd(d-1))\). The complete gradient calculation therefore requires \({\mathcal {O}}(lN(N+d(d-1)))\) operations. We have found that the optimality conditions based on directional derivatives and gradient sampling steps are seldom, if ever required, and moreover that these do not constitute the bottleneck in the running time of the method in practice. The complexity of the optimality condition check may be computed along similar lines and be found to be \({\mathcal {O}}(t^2lN(N+d(d-1)))\), where t is the multiplicity of the eigenvalue \(\lambda = \lambda _2(L(\pmb {\theta }))\). The gradient sampling is simply \({\mathcal {O}}(d)\) times the cost of computing a single gradient. The total complexity of the projection pursuit optimisation depends on the number of iterations in the gradient descent method, where in general this number is bounded for a given accuracy level. For our experiments we use the BFGS (Broyden–Fletcher–Goldfarb–Shanno) algorithm as this has been found to perform well on non-smooth functions (Lewis and Overton 2013).

Table 3 shows the observed running times for SCPP and DRSC when applied to six datasets which were used in the experiments. To render the comparison relevant, we did not use the microcluster approach to speed up the SCPP algorithm. We considered only subsets of the Opt. Digits and Pen Digits datasets so that run times for DRSC could be obtained in a reasonable amount of time. We used the same subsets as in the experiments for maximum margin clustering. The SCPP algorithm converged in a reasonable amount of time in all cases, despite the absence of the microcluster speedup. DRSC on the other hand took as many as three orders of magnitude longer to run on some datasets. Moreover, it failed to converge in half of the cases considered.

Table 3 Running time (in seconds) of SCPP and DRSC on six datasets

Proofs

Proof of Theorem 2

Before proving Theorem 2, we require some supporting theory which we present below. We will use the notation \(v^\top \mathcal {X} = \{v^\top x_1, \ldots , v^\top x_N\}\), and for a set \({\mathcal {P}} \subset \mathbb {R}\) and \(y \in \mathbb {R}\) we write, for example, \({\mathcal {P}}_{>y}\) for \({\mathcal {P}} \cap (y, \infty )\). Recall that for scaling parameter \(\sigma >0\) we define \(\pmb {\theta }_{\sigma }: = \text{ argmin }_{\pmb {\theta } \in \varTheta } \lambda _2(L(\pmb {\theta }, \sigma ))\), where \(L(\pmb {\theta }, \sigma )\) is as \(L(\pmb {\theta })\) from before, but with an explicit dependence on the scaling parameter. That is, \(\pmb {\theta }_{\sigma }\) defines the projection generating the minimal spectral connectivity of \(\mathcal {X}\) for a given value of \(\sigma \). We define \(\pmb {\theta }_{\sigma }^N\) similarly for the normalised Laplacian.

Recall that we are interested in those hyperplanes which intersect an arbitrary convex set \(\pmb {\varDelta }\). This is because very often the maximum margin hyperplane will separate only a few points from the remainder, as data tend to be more sparse in the tails of the underlying distribution. To account for the potential for hyperplanes with very large margins lying in the tails of the distribution, we make the additional assumption that the distance reducing parameter, \(\delta \), tends to zero along with \(\sigma \).

Lemmas 4 and 5 provide lower bounds on the second eigenvalue of the graph Laplacians of a one-dimensional dataset in terms of the largest Euclidean separation of adjacent points which lie within the interval \(\varDelta \), used to represent \(\pmb {\varDelta }(\pmb {\theta })\) in the context of a projection of \(\mathcal {X}\). These lemmas also show how we construct the set \(\pmb {\varDelta }^\prime \). Lemmas 6 and 7 use these results to show that a projection angle \(\pmb {\theta }\in \varTheta \) leads to lower spectral connectivity than all projections admitting smaller maximal margin hyperplanes intersecting \(\pmb {\varDelta }^\prime \) for all pairs \(\sigma , \delta \) sufficiently close to zero.

Lemma 4

Let \(k:{\mathbb {R}}_+ \rightarrow {\mathbb {R}}_+\) be a non-increasing, positive function and let \(\sigma > 0, \delta \in (0, 0.5]\). Let \(\mathcal {P}= \{p_1, \ldots , p_N\}\) be a univariate dataset and let \(\varDelta =[a, b]\) for \(a<b \in \mathbb {R}\). Suppose that \(\vert \mathcal {P}\cap \varDelta \vert \ge 2\) and \(a\ge \min \{\mathcal {P}\}, b\le \max \{\mathcal {P}\}\). Define \(\varDelta ^\prime = [a^\prime , b^\prime ]\), where \(a^\prime = (a+\min \{\mathcal {P}\cap \varDelta \})/2\), and \(b^\prime = (b+\max \{\mathcal {P}\cap \varDelta \})/2\). Let \(M = \max _{x \in \varDelta ^\prime }\{\min _{i=1\dots N}\vert x-p_i \vert \}\). Define \(L(\mathcal {P})\) to be the Laplacian of the graph with vertices \(\mathcal {P}\) and similarities according to \(s(P, i, j) = k(\vert T_{\varDelta }(p_i) - T_{\varDelta }(p_j)\vert /\sigma )\), where \(P \in \mathbb {R}^{1 \times N}\) is the matrix with i-th column equal to \(p_i\). Then \(\lambda _2(L(\mathcal {P})) \ge \frac{1}{\vert \mathcal {P}\vert ^3} k((2M+\delta C)/\sigma )\), where \(C = \max \{D, D^{1-\delta }\},\)\(D = \max \{a-\min \{\mathcal {P}\}, \max \{\mathcal {P}\} - b\}\).

Proof

We can assume that \(\mathcal {P}\) is sorted in increasing order, i.e. \(p_i \le p_{i+1}\), since this does not affect the eigenvalues of \(L(\mathcal {P})\). We first show that \(s(P, i, i+1) \ge k((2M+\delta C)/\sigma )\) for all \(i = 1, \ldots , N-1\). To this end observe that \(\delta \left( x + \left( \delta \left( 1-\delta \right) ^{\frac{1}{\delta }}\right) \right) ^{1-\delta }-\delta \left( \delta \left( 1-\delta \right) \right) ^{\frac{1-\delta }{\delta }} \le \delta \max \{x, x^{1-\delta }\}\) for \(x\ge 0\).

  • If \(p_i, p_{i+1} \le a\) then \(s(P, i, i+1) = k((T_{\varDelta }(p_{i+1})- T_{\varDelta }(p_i))/\sigma ) \ge k((T_{\varDelta }(a) - T_{\varDelta }(p_i))/\sigma )\)\( \ge k((2M+\delta C)/\sigma )\) by the definition of C and using the above inequality, since k is non-increasing. The case \(p_i, p_{i+1}\ge b\) is similar.

  • If \(p_i, p_{i+1} \in \varDelta \) then \(p_i, p_{i+1} \in \varDelta ^\prime \Rightarrow \vert p_i - p_{i+1}\vert \le 2M \Rightarrow s(P, i, i+1) \ge k(2M/\sigma ) \ge k((2M+\delta C)/\sigma )\) since M is the largest margin in \(\varDelta ^\prime \).

  • If none of the above hold, then we lose no generality in assuming \(p_i < a\), \(a<p_{i+1}<b\) since the case \(a<p_i<b\), \(p_{i+1}>b\) is analogous. We must have \(p_{i+1} = \min \{\mathcal {P}\cap \varDelta \}\) and so \(a^\prime = (a+p_{i+1})/2\). If \(p_{i+1}-a > 2M\) then \(\min _{j=1 \dots N} \vert a^\prime - p_j \vert >M\), a contradiction since \(a^\prime \in \varDelta ^\prime \) and M is the largest margin in \(\varDelta ^\prime \). Therefore \(p_{i+1}-a \le 2M\). In all

    $$\begin{aligned} T_{\varDelta }(p_{i+1}) - T_{\varDelta }(p_i)&= (p_{i+1}-a) + \delta (a-p_i+(\delta (1-\delta ))^{\frac{1}{\delta }})^{1-\delta }\\&\quad - \delta (\delta (1-\delta ))^{\frac{1-\delta }{\delta }}\\&\le 2M + \delta C\\ \Rightarrow s(P, i, i+1)&\ge k((2M+\delta C)/\sigma ). \end{aligned}$$

Now, let u be the second eigenvector of \(L(\mathcal {P})\). Then \(\Vert u\Vert = 1\) and \(u\perp \mathbf {1}\) and therefore \(\exists i, j\) s.t. \(u_i - u_j \ge \frac{1}{\sqrt{N}}\). We thus know that there exists m s.t. \(\vert u_m - u_{m+1}\vert \ge \frac{1}{N^{3/2}}\). By von Luxburg (2007, Proposition 1), we know that

$$\begin{aligned} u^\top L(\mathcal {P})u&= \frac{1}{2}\sum _{i, j}s(P, i, j)(u_i-u_j)^2\\&\ge s(P, m, m+1)(u_m-u_{m+1})^2\\&\ge \frac{1}{N ^3} k((2M+\delta C)/\sigma ), \end{aligned}$$

since all consecutive pairs \(p_m,\)\(p_{m+1}\) have similarity at least \(k((2M+\delta C)/\sigma )\), by above. Therefore \(\lambda _2(L(\mathcal {P})) \ge \frac{1}{N^3}k((2M+\delta C)/\sigma )\) as required. \(\square \)

Lemma 5

Let the conditions of Lemma 4 hold and let \(L_\mathrm {N}(\mathcal {P})\) be the normalised Laplacian of the graph with vertices \(\mathcal {P}\) and similarities \(s(P, i, j) = k(\vert T_{\varDelta }(p_i) - T_{\varDelta }(p_j)\vert /\sigma )\). Then

$$\begin{aligned} \lambda _2(L_{\mathrm {N}}(\mathcal {P})) \ge \frac{1}{\vert \mathcal {P}\vert ^4} k((2M+\delta C)/\sigma ). \end{aligned}$$

Proof

The proof is similar to that of Lemma 

theolink4FPar4, but requires a few simple modifications. Let u be the second eigenvector of \(L_{\mathrm {N}}(\mathcal {P})\). Since \(\Vert u\Vert = 1, \exists i \in \{1, \ldots , N\}\) s.t. \(\vert u_i \vert \ge \frac{1}{\sqrt{N}}\). Suppose w/o loss of generality that \(u_i \le -\frac{1}{\sqrt{N}}\). Now consider that for all \(j, k \in \{1, \ldots , N\}\) we have \(0 < s(P,j,k) \le 1\) and \(s(P,j,j) = 1\) and so \(1 < \sqrt{d_j} \le \sqrt{N}\) for all \(j \in \{1, \ldots , N\}\). Therefore we have \(u_i/\sqrt{d_i} \le -\frac{1}{N}\). Furthermore, since \(uD^{1/2} \perp \mathbf {1}\) we have \(u_j > 0\) for some \(j \in \{1, \ldots , N\} \Rightarrow u_j/\sqrt{d_j} > 0\). Therefore, \(u_j/\sqrt{d_j} - u_i/\sqrt{d_i} > \frac{1}{N}\). We thus know that \(\exists m \in \{1, \ldots , N\}\) s.t. \( \left| u_m/\sqrt{d_m} - u_{m+1}/\sqrt{d_{m+1}}\right| > \frac{1}{N^2}. \) By von Luxburg (2007, Proposition 3), we know that

$$\begin{aligned} u^\top L_{\mathrm {N}}(\mathcal {P})u&= \frac{1}{2} \sum _{i \not = j} s(P,i,j) (u_i/\sqrt{d_i} - u_j/\sqrt{d_j})^2\\&\ge S(P,m,m+1)(u_m/\sqrt{d_m} - u_{m+1}/\sqrt{d_{m+1}})^2\\&> \frac{1}{N ^4} k((2M+\delta C)/\sigma ), \end{aligned}$$

where the bound on \(s(P, m, m+1)\) is taken from the proof of Lemma 5. Therefore \(\lambda _2(L_{\mathrm {N}}(\mathcal {P})) \ge \frac{1}{N^4}k((2M+\delta C)/\sigma )\) as required. \(\square \)

In the above we have assumed that \(\varDelta \) is contained within the convex hull of the points \(\mathcal {P}\); however, the results of this section can easily be modified to allow for cases where this does not hold. In particular, if an unconstrained large margin hyperplane is sought, then setting \(\pmb {\varDelta }\) to be arbitrarily large allows for this. We have merely stated the results in the most convenient context for our practical implementation.

The set \(\varDelta ^\prime \) in the above is defined in terms of the one-dimensional interval [ab]. We define the full-dimensional set \(\pmb {\varDelta }^\prime \) along the same lines by,

$$\begin{aligned} \pmb {\varDelta }^\prime&= \{x \in \mathbb {R}^d \vert v(\pmb {\theta })^\top x \in \varDelta (\pmb {\theta })^ \prime \ \forall \pmb {\theta }\in \varTheta \},\nonumber \\ \varDelta (\pmb {\theta })^\prime&:= \Bigg [\frac{\min \varDelta (\pmb {\theta }) + \min \{v(\pmb {\theta }) ^\top \mathcal {X}\cap \varDelta (\pmb {\theta })\}}{2}, \end{aligned}$$
(23)
$$\begin{aligned}&\frac{\max \varDelta (\pmb {\theta }) + \max \{v(\pmb {\theta })^\top \mathcal {X}\cap \varDelta (\pmb {\theta })\}}{2}\Bigg ]. \end{aligned}$$
(24)

Here we assume that \(\pmb {\varDelta }\) is contained within the convex hull of the d-dimensional dataset X. Notice that since \(\pmb {\varDelta }\) is convex, we have \(v(\pmb {\theta })^\top \pmb {\varDelta }^\prime = \varDelta (\pmb {\theta })^\prime \). In what follows we show that as \(\sigma \) is reduced to zero, the optimal projection for spectral partitioning converges to the projection admitting the largest margin hyperplane intersecting \(\pmb {\varDelta }^\prime \). If it is the case that the largest margin hyperplane intersecting \(\pmb {\varDelta }\) also intersects \(\pmb {\varDelta }^\prime \), as is often the case, although this fact will not be known, then it is actually not necessary that \(\delta \) tend towards zero. In such cases it only needs to satisfy \(\delta \le 2M/C\) for the corresponding values of M and C over all possible projections. In particular, choosing \(\max \{\text{ Diam }(\mathcal {X}), \text{ Diam }(\mathcal {X})^{1-\delta }\}\) instead of C is appropriate for all projections.

Lemma 6

Let \(\pmb {\theta } \in \varTheta \) and let \(k:\mathbb {R}_+ \rightarrow {\mathbb {R}}_+\) be non-increasing, positive, and satisfy

$$\begin{aligned} \lim _{x \rightarrow \infty } k(x(1+\epsilon ))/k(x) = 0 \end{aligned}$$

for all \(\epsilon > 0\). Then for any \(0< m < \max \limits _{b \in \varDelta (\pmb {\theta })^\prime }\text{ margin }(v(\pmb {\theta }), b)\) there exists \(\sigma ^\prime > 0\) s.t. if \(0< \sigma < \sigma ^\prime \) and

$$\begin{aligned} \max \limits _{c \in \varDelta (\pmb {\theta }^\prime )^\prime }\text{ margin }(v(\pmb {\theta }^\prime ), c) < \max \limits _{b \in \varDelta (\pmb {\theta })^\prime }\text{ margin }(v(\pmb {\theta }), b) - m \end{aligned}$$

then \(\lambda _2(L(\pmb {\theta }, \sigma )) < \lambda _2(L(\pmb {\theta }^\prime , \sigma ))\).

Proof

Let \(B = \text{ argmax }_{b \in \varDelta (\pmb {\theta })^\prime }\text{ margin }(v(\pmb {\theta }), b)\) and let M be the corresponding margin, i.e. \(M = \text{ margin }(v(\pmb {\theta }), B)\). We assume that \(M \not = 0\), since otherwise there is nothing to show. Now, since spectral clustering solves a relaxation of the minimum Normalised Cut problem, we have,

$$\begin{aligned}&\lambda _2(L(\pmb {\theta }, \sigma )) \le \frac{1}{\vert \mathcal {X}\vert } \min _{\mathcal {C}\subset \mathcal {X}} \sum _{\begin{array}{c} i, j: x_i \in \mathcal {C}\\ x_j \not \in \mathcal {C} \end{array}} s(P(\pmb {\theta }), i, j)\left( \frac{1}{\vert \mathcal {C}\vert } + \frac{1}{\vert \mathcal {X}{\setminus } \mathcal {C}\vert }\right) \\&\le \frac{1}{\vert \mathcal {X}\vert }\sum _{\begin{array}{c} i, j : v(\pmb {\theta })^ \top x_i< B\\ v(\pmb {\theta })^\top x_j> B \end{array}} s(P(\pmb {\theta }), i, j) \Bigg (\frac{1}{\vert (v(\pmb {\theta })^\top \mathcal {X})_{< B}\vert }\\&\quad + \frac{1}{\vert (v(\pmb {\theta })^\top \mathcal {X})_{>B}\vert } \Bigg )\\&= \frac{1}{\vert \mathcal {X}\vert }\sum _{\begin{array}{c} i, j : v(\pmb {\theta })^\top x_i< B\\ v(\pmb {\theta })^\top x_j> B \end{array}} k\left( \frac{T_{\varDelta (\pmb {\theta })} (v(\pmb {\theta })^\top x_j) - T_{\varDelta (\pmb {\theta })}(v(\pmb {\theta })^\top x_i)}{\sigma }\right) \\&\quad \times \left( \frac{\vert \mathcal {X}\vert }{\vert (v(\pmb {\theta })^\top \mathcal {X})_{< B}\vert \vert (v(\pmb {\theta })^\top \mathcal {X})_{>B}\vert }\right) \\&\le \big \vert (v(\pmb {\theta })^\top \mathcal {X})_{< B}\big \vert \big \vert (v(\pmb {\theta })^\top \mathcal {X})_{> B}\big \vert k\left( \frac{2M}{\sigma }\right) \\&\quad \times \left( \frac{1}{\vert (v(\pmb {\theta })^\top \mathcal {X})_{< B} \vert \vert (v(\pmb {\theta })^\top \mathcal {X})_{>B}\vert }\right) \\&= k(2M/\sigma ). \end{aligned}$$

The final inequality holds since for any ij s.t. \(v(\pmb {\theta })^\top x_i < B\) and \(v(\pmb {\theta })^\top x_j >B\), we must have \(T_{\varDelta (\pmb {\theta })}(v(\pmb {\theta })^\top x_j) - T_{\varDelta (\pmb {\theta })}(v(\pmb {\theta })^\top x_i) \ge 2M\). Now, for any \(\pmb {\theta }^\prime \in \varTheta \), let \(M_{\pmb {\theta }^\prime } = \max _{c \in \varDelta (\pmb {\theta }^\prime )^\prime }\text{ margin }(v(\pmb {\theta }^\prime ), c)\). By Lemma 4 we know that \(\lambda _2(L(\pmb {\theta }^\prime , \sigma )) \ge \frac{1}{\vert \mathcal {X}\vert ^3} k((2M_{\pmb {\theta }^\prime }+\delta C/\sigma )\), where \(C = \max \{\text{ Diam }(X),\)\(\text{ Diam }(X)^{1-\delta }\}\). Therefore,

$$\begin{aligned} \lim _{\sigma \rightarrow 0^+}&\frac{\lambda _2(L(\pmb {\theta }, \sigma ))}{\inf _{\pmb {\theta }^\prime \in \varTheta }\{\lambda _2(L(\pmb {\theta }^\prime , \sigma )) \big \vert M_{\pmb {\theta }^\prime } < M - m\}}\\&\le \lim _{\sigma \rightarrow 0^+}\frac{\vert \mathcal {X}\vert ^3 k(2M/\sigma )}{ k((2(M-m)+\delta C)/\sigma )}\\&=0. \end{aligned}$$

Since \(\delta \rightarrow 0\) as \(\sigma \rightarrow 0\), this gives the result. \(\square \)

Lemma 7

Let the conditions of Lemma 6 hold. For any \(0< m < \max _{b \in \varDelta (\pmb {\theta })^\prime }\text{ margin }(v(\pmb {\theta }), b)\) there exists \(\sigma ^\prime > 0\) s.t. if \(0< \sigma < \sigma ^\prime \) and

$$\begin{aligned} \max _{c \in \varDelta (\pmb {\theta }^\prime )^\prime }\text{ margin }(v(\pmb {\theta }^\prime ), c) < \max _{b \in \varDelta (\pmb {\theta })^\prime }\text{ margin }(v(\pmb {\theta }), b) - m \end{aligned}$$

then \(\lambda _2(L_{\mathrm {N}}(\pmb {\theta }, \sigma )) < \lambda _2(L_{\mathrm {N}}(\pmb {\theta }^\prime , \sigma ))\).

Proof

Using a similar approach to that in the proof of Lemma 6, we can arrive at the following.

$$\begin{aligned} \lambda _2&(L_{\mathrm {N}}(\pmb {\theta }, \sigma ))\le \frac{\sum \limits _{\begin{array}{c} i, j : v(\pmb {\theta })^\top x_i< B\\ v(\pmb {\theta })^\top x_j> B \end{array}} k\left( \frac{T_{\varDelta (\pmb {\theta })}(v(\pmb {\theta })^\top x_j) - T_{\varDelta (\pmb {\theta })}(v(\pmb {\theta })^\top x_i)}{\sigma }\right) }{\mathrm {vol}((v(\pmb {\theta })^\top \mathcal {X})_{< B}) \mathrm {vol}((v(\pmb {\theta })^\top \mathcal {X})_{>B})}\\&\le k\left( \frac{2M}{\sigma }\right) \frac{\big \vert (v(\pmb {\theta })^\top \mathcal {X})_{< B}\big \vert \big \vert (v(\pmb {\theta })^\top \mathcal {X})_{> B}\big \vert }{\mathrm {vol}((v(\pmb {\theta })^\top \mathcal {X})_{< B}) \mathrm {vol}((v(\pmb {\theta })^\top \mathcal {X})_{>B})}\\&\le k(2M/\sigma ) \end{aligned}$$

where the final inequality comes from the fact that \(1 < d_i\) for all \(i \in \{1, \ldots , N\}\), and hence vol\(((v(\pmb {\theta })^\top \mathcal {X})_{>B}) \ge \vert (v(\pmb {\theta })^\top \mathcal {X})_{>B}\vert \), and similarly for \((v(\pmb {\theta })^\top \mathcal {X})_{<B}\). The final step in the proof is equivalent to that of Lemma 6, except that \(\vert \mathcal {X}\vert ^3\) is replaced with \(\vert \mathcal {X}\vert ^4\). \(\square \)

Lemmas 6 and 7 show almost immediately that the margin admitted by the optimal projection for spectral bi-partitioning converges to the largest margin through \(\pmb {\varDelta }^\prime \) as \(\sigma \) goes to zero. Theorem 2, which we are now in a position to prove, shows the stronger result that the optimal projection itself converges to the projection admitting the largest margin.

Proof of Theorem 2:

Take any \(\epsilon > 0\). Pavlidis et al. (2016) have shown that \(\exists m_\epsilon > 0\) s.t. for \(w \in \mathbb {R}^d, c \in \mathbb {R}\), \(\Vert (w, c)/\Vert w\Vert - (v(\pmb {\theta }^\star ), b^\star ) \Vert > \epsilon \Rightarrow \)margin\((w/\Vert w\Vert , c/\Vert w\Vert ) < \) margin\((v(\pmb {\theta }^\star ), b^\star ) - m_\epsilon \). By Lemma 6 we know \(\exists \sigma ^\prime > 0\), \(\delta ^\prime >0\) s.t. if \(0< \sigma < \sigma ^\prime \) then \(\exists c \in \varDelta (\pmb {\theta })\) s.t. margin\((v(\pmb {\theta }_{\sigma }), c)\)\(\ge \) margin\((v(\pmb {\theta }^\star ), b^\star ) - m_\epsilon \), since \(\pmb {\theta }_{\sigma }\) is optimal for \(\sigma \). Thus, by above, \(\Vert (v(\pmb {\theta }_{\sigma }), c) - (v(\pmb {\theta }^\star ), b^\star )\Vert \le \epsilon \). But \(\Vert (v(\pmb {\theta }_{\sigma }), c) - (v(\pmb {\theta }^\star ), b^\star )\Vert \ge \Vert v(\pmb {\theta }_{\sigma }) - v(\pmb {\theta }^\star )\Vert \) for any \(c \in {\mathbb {R}}\). Since \(\epsilon > 0\) was arbitrary, we therefore have \(v(\pmb {\theta }_{\sigma }) \rightarrow v(\pmb {\theta }^\star )\) as \(\sigma \rightarrow 0^+\). The proof for \(\pmb {\theta }^N_{\sigma }\) is analogous. \(\square \)

Proof of Lemma 3

The proof of Lemma 3 uses the following result from matrix perturbation theory.

Theorem 8

(Ye 2009) Let \(A = [a_{ij}]\) and \(\tilde{A} = [\tilde{a}_{ij}]\) be two symmetric positive semidefinite diagonally dominant matrices, and let \(\lambda _1 \le \lambda _2 \le \cdots \le \lambda _n\) and \(\tilde{\lambda }_1 \le \tilde{\lambda }_2 \le \cdots \le \tilde{\lambda }_n\) be their respective eigenvalues. If, for some \(0 \le \epsilon < 1\), \(\vert a_{ij} - \tilde{a}_{ij} \vert \le \epsilon \vert a_{ij} \vert \ \forall i \not = j\), and \( \vert v_i - \tilde{v}_i \vert \le \epsilon v_i \ \forall i,\) where \(v_i = a_{ii} - \sum _{j \not = i} \vert a_{ij} \vert \), and similarly for \(\tilde{v}_i\), then

$$\begin{aligned} \vert \lambda _i - \tilde{\lambda }_i \vert \le \epsilon \lambda _i \ \forall i. \end{aligned}$$

An inspection of the proof of Theorem 8 reveals that \(\epsilon < 1\) is necessary only to ensure that the signs of \(a_{ij}\) are the same as those of \(\tilde{a}_{ij}\). In the case of Laplacian matrices this equivalence of signs holds by design, and so in this context the requirement that \(\epsilon < 1\) can be relaxed.

Now, for brevity we drop the notational dependence on \(\pmb {\theta }\). Let \(\mathcal {P}^{c\prime } = \{V^\top c_1, V^\top c_1, \ldots , V^\top c_m, V^\top c_m\}\), where each \(V^\top c_i\) is repeated \(n_i\) times, and let \(P^{c \prime }\) be the corresponding matrix of repeated projected centroids. Let \(L^{c\prime }\) be the Laplacian of the graph with vertices \(\mathcal {P}^{c\prime }\) and edges given by \(s(P^{c\prime }, i, j)\). We begin by showing that \(\lambda _2(L^{c\prime }) = \lambda _2(N-B)\). Take \(v \in \mathbb {R}^m\), then,

$$\begin{aligned} v^\top (N-B)v&= \sum _{i,j}s(P^c, i, j)(v_i^2n_j - v_iv_j\sqrt{n_in_j})\\&= \frac{1}{2}\sum _{i,j}s(P^c,i,j)(v_i^2n_j+v_j^2n_i-2v_iv_j\sqrt{n_in_j})\\&\ge 0, \end{aligned}$$

and so \(N-B\) is positive semidefinite. In addition, it is straightforward to verify that \((N-B)(\sqrt{n_1} \ \dots \ \sqrt{n_K}) = \mathbf {0}\), and hence 0 is the smallest eigenvalue of \(N-B\) with corresponding eigenvector \((\sqrt{n_1} \ \dots \ \sqrt{n_m})\). Now, let u be the second eigenvector of \(L^{c\prime }\). Then \(u_j = u_k\) for pairs of indices jk aligned with the same \(V^\top c_i\) in \(P^{c\prime }\). Define \(u^c \in \mathbb {R}^m\) s.t. \(u^c_i = \sqrt{n_i}u_j\) where index j is aligned with \(V^\top c_i\) in \(P^{c\prime }_j\). Then \((u^c)^\top (\sqrt{n_1} \ \dots \ \sqrt{n_m}) = \sum _{i=1}^m u^c_i \sqrt{n_i} = \sum _{i=1}^m n_i u_{j_i}\) where index \(j_i\) is aligned with \(V^\top c_i\) in \(P^{c\prime }_{j_i}\) for each i. Therefore \(n_i u_{j_i} = \sum _{j:P^{c\prime } = V^\top c_i}u_j\) and hence \((u^c)^\top (\sqrt{n_1} \ \dots \ \sqrt{n_m}) = \sum _{i=1}^m\sum _{j: P^{c\prime }_j = V^\top c_i} u_j = \sum _{i=1}^N u_i = 0\) since \(\mathbf {1}\) is the smallest eigenvector of \(L^{c\prime }\) and so \(u \perp \mathbf {1}\). Similarly \(\Vert u^c\Vert ^2 = \sum _{i=1}^m n_i u_{j_i}^2 = \sum _{i=1}^N u_i^2 = 1\). Thus \(u^c \perp (\sqrt{n_1} \ \dots \ \sqrt{n_m})\) and \(\Vert u^c\Vert = 1\) and so is a candidate for the second eigenvector of \(N-B\). In addition it is straightforward to show that \((u^c)^\top (N-B)u^c = u\cdot L^{c\prime } u\). Now, suppose by way of contradiction that \(\exists w \perp (\sqrt{n_1} \ \dots \ \sqrt{n_m})\) with \(\Vert w\Vert =1\) s.t. \(w^\top (N-B)w < (u^c)^\top (N-B)u^c\). Then let \(w^\prime = (w_1/\sqrt{n_1} \ w_1/\sqrt{n_1} \ \dots \ w_m/\sqrt{n_m})\) where each \(w_i/\sqrt{n_i}\) is repeated \(n_i\) times. Then \(\Vert w^\prime \Vert = 1\), \((w^\prime )^\top \mathbf {1} = w^\top (\sqrt{n_1} \ \dots \ \sqrt{n_m}) = 0\) and \(w^\top L^{c\prime }w < u^\top L^{c\prime } u\), a contradiction since u is the second eigenvector of \(L^{c\prime }\).

Now, let ijqr be such that \(x_q \in C_i\) and \(x_r \in C_j\). We temporarily drop the notational dependence on \(\varDelta \). Then,

$$\begin{aligned} \Vert T (V^\top x_q) - T (V^\top x_r)\Vert&= \Vert T (V^\top x_q) - T (V^\top c_i) +T (V^\top c_i)\\&-T (V^\top c_j) +T (V^\top c_j)-T (V^\top x_r)\Vert \\&\le \Vert T (V^\top x_q) - T (V^\top c_i)\Vert \\&\quad +\,\Vert T (V^\top c_i)-T (V^\top c_j)\Vert \\&\quad +\,\Vert T (V^\top c_j)-T(V^\top x_r)\Vert \\&\le \rho _i + \rho _j + D_{ij}, \end{aligned}$$

since T contracts distances and \(\rho _i\) and \(\rho _j\) are the radii of \(C_i\) and \(C_j\). Since k is non-increasing, we therefore have,

$$\begin{aligned}&\frac{k(D_{ij}/\sigma )}{k((D_{ij}-\rho _i-\rho _j)^+/\sigma )}\le \frac{k(D_{ij}/\sigma )}{k(\Vert T(V^\top x_q) - T(V^\top x_r)\Vert /\sigma )}\\&\quad \le \frac{k(D_{ij}/\sigma )}{k((D_{ij}+\rho _i+\rho _j)/\sigma )}\\ \Rightarrow&1-\frac{k(D_{ij}/\sigma )}{k(\Vert T(V^\top x_q) - T(V^\top x_r)\Vert /\sigma )} \le 1-\frac{k(D_{ij}/\sigma )}{k((D_{ij}-\rho _i-\rho _j)^ +/\sigma )}\\&\quad \text{ and }\\&\quad \frac{k(D_{ij}/\sigma )}{k(\Vert T(V^\top x_q) - T(V^\top x_r)\Vert /\sigma )}-1\le \frac{k(D_{ij}/\sigma )}{k((D_{ij}+\rho _i+\rho _j)/\sigma )}-1. \end{aligned}$$

Therefore

$$\begin{aligned}&\left| \frac{ k(D_{ij}/\sigma )}{ k(\Vert T(V^\top x_q) - T(V^\top x_r) \Vert /\sigma )}-1\right| \le \\&\quad \max \left\{ 1- \frac{k(D_{ij}/\sigma )}{k((D_{ij}-\rho _i-\rho _j)^+/\sigma )},\frac{k(D_{ij}/\sigma )}{k((D_{ij}+\rho _i+\rho _j)/\sigma )} - 1\right\} . \end{aligned}$$

Now, we lose no generality by assuming that \(\mathcal {X}\) is ordered such that for each i the elements of cluster \(C_i\) are aligned with \(V^\top c_i\) in \(P^{c\prime }\), since this does not affect the eigenvalues of the Laplacian of \(V^\top \mathcal {X}\), L. By the design of the Laplacian matrix the “\(v_i\)" of Theorem 8 are exactly zero. For off diagonal terms qr with corresponding ij as above, consider

$$\begin{aligned} \frac{\vert L_{qr} - L^{c\prime }_{qr} \vert }{\vert L_{qr}\vert }&= \frac{\vert k(D_{ij}/\sigma ) - k(\Vert T(V^\top x_q) - T(V^\top x_r) \Vert /\sigma )\vert }{ k(\Vert T(V^\top x_q) - T(V^\top x_r) \Vert /\sigma )}\\&= \left| \frac{ k(D_{ij}/\sigma )}{ k(\Vert T(V^\top x_q) - T(V^\top x_r) \Vert /\sigma )}-1\right| . \end{aligned}$$

Theorem 8 thus gives the result.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Hofmeyr, D.P., Pavlidis, N.G. & Eckley, I.A. Minimum spectral connectivity projection pursuit. Stat Comput 29, 391–414 (2019). https://doi.org/10.1007/s11222-018-9814-6

Download citation

Keywords

  • Spectral clustering
  • Dimension reduction
  • Projection pursuit
  • Maximum margin