Abstract
We study the problem of determining the optimal lowdimensional projection for maximising the separability of a binary partition of an unlabelled dataset, as measured by spectral graph theory. This is achieved by finding projections which minimise the second eigenvalue of the graph Laplacian of the projected data, which corresponds to a nonconvex, nonsmooth optimisation problem. We show that the optimal univariate projection based on spectral connectivity converges to the vector normal to the maximum margin hyperplane through the data, as the scaling parameter is reduced to zero. This establishes a connection between connectivity as measured by spectral graph theory and maximal Euclidean separation. The computational cost associated with each eigen problem is quadratic in the number of data. To mitigate this issue, we propose an approximation method using microclusters with provable approximation error bounds. Combining multiple binary partitions within a divisive hierarchical model allows us to construct clustering solutions admitting clusters with varying scales and lying within different subspaces. We evaluate the performance of the proposed method on a large collection of benchmark datasets and find that it compares favourably with existing methods for projection pursuit and dimension reduction for data clustering. Applying the proposed approach for a decreasing sequence of scaling parameters allows us to obtain large margin clustering solutions, which are found to be competitive with those from dedicated maximum margin clustering algorithms.
This is a preview of subscription content, access via your institution.
Notes
 1.
An R implementation of the SCPP algorithm is available at https://github.com/DavidHofmeyr/SCPP.
 2.
 3.
 4.
 5.
 6.
We used the implementation provided by the authors, taken from https://sites.google.com/site/binzhao02/.
References
Bach, F.R., Jordan, M.I.: Learning spectral clustering, with application to speech separation. J. Mach. Learn. Res. 7, 1963–2001 (2006)
Bache, K., Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml
Boumal, N., Mishra, B., Absil, P.A., Sepulchre, R.: Manopt, a matlab toolbox for optimization on manifolds. J. Mach. Learn. Res. 15, 1455–1459 (2014)
Burke, J.V., Lewis, A.S., Overton, M.L.: A robust gradient sampling algorithm for nonsmooth, nonconvex optimization. SIAM J. Optim. 15(3), 751–779 (2006)
Chi, Y., Song, X., Zhou, D., Hino, K., Tseng, B.L.: On evolutionary spectral clustering. ACM Trans. Knowl. 3(4), 17:1–17:30 (2009)
Edelman, A., Arias, T., Smith, S.T.: The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl. 20(2), 303–353 (1998)
Fan, K.: On a theorem of weyl concerning eigenvalues of linear transformations I. Proc. Natl. Acad. Sci. USA 35(11), 652 (1949)
Hagen, L., Kahng, A.B.: New spectral methods for ratio cut partitioning and clustering. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 11(9), 1074–1085 (1992)
Hartigan, J.A., Hartigan, P.M.: The dip test of unimodality. Ann. Stat. 13(1), 70–84 (1985)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Texts in Statistics, 2nd edn. Springer, New York (2009)
Hofmeyr, D., Pavlidis, N.: Maximum clusterability divisive clustering. In: 2015 IEEE Symposium Series on Computational Intelligence, pp. 780–786. IEEE (2015)
Hofmeyr, D.: Improving spectral clustering using the asymptotic value of the normalised cut. arXiv preprint arXiv:1703.09975 (2017)
Joachims, T.: Transductive inference for text classification using support vector machines. In: Proceedings of International Conference on Machine Learning (ICML), Bled, Slowenien, vol. 99, pp. 200–209 (1999)
Kaiser, H.F.: The application of electronic computers to factor analysis. Educ. Psychol. Meas. 20(1), 141–151 (1960)
Krause, A., Liebscher, V.: Multimodal projection pursuit using the dip statistic. PreprintReihe Mathematik 13 (2005)
Lewis, A.S., Overton, M.L.: Eigenvalue optimization. Acta Numer. 5, 149–190 (1996)
Lewis, A., Overton, M.: Nonsmooth optimization via quasiNewton methods. Math. Program. 141, 135–163 (2013)
Magnus, J.R.: On differentiating eigenvalues and eigenvectors. Econ. Theory 1(02), 179–191 (1985)
Ng, A., Jordan, M.I., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: Dietterich, T., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems, vol. 14, pp. 849–856. MIT Press, Cambridge (2002)
Ning, H., Xu, W., Chi, Y., Gong, Y., Huang, T.S.: Incremental spectral clustering by efficiently updating the eigensystem. Pattern Recogn. 43(1), 113–127 (2010)
Niu, D., Dy, J.G., Jordan, M.I.: Dimensionality reduction for spectral clustering. In: International Conference on Artificial Intelligence and Statistics, pp. 552–560 (2011)
Nocedal, J., Wright, S.: Numerical Optimization. Springer, Berlin (2006)
Overton, M.L., Womersley, R.S.: Optimality conditions and duality theory for minimizing sums of the largest eigenvalues of symmetric matrices. Math. Program. 62(1–3), 321–357 (1993)
Pavlidis, N.G., Hofmeyr, D.P., Tasoulis, S.K.: Minimum density hyperplanes. J. Mach. Learn. Res. 17(156), 1–33 (2016)
Peña, D., Prieto, F.J.: Cluster identification using projections. J. Am. Stat. Assoc. 147, 389 (2001)
Polak, E.: On the mathematical foundations of nondifferentiable optimization in engineering design. SIAM Rev. 29(1), 21–89 (1987)
Rahimi, A., Recht, B.: Clustering with normalized cuts is clustering with a hyperplane. Stat. Learn. Comput. Vis. 56, 1 (2004)
Schur, J.: Bemerkungen zur theorie der beschränkten bilinearformen mit unendlich vielen veränderlichen. J. für die reine und Angew. Math. 140, 1–28 (1911)
Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)
Strehl, A., Ghosh, J.: Cluster ensembles–a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3, 583–617 (2002)
Tong, S., Koller, D.: Restricted Bayes optimal classifiers. In: AAAI/IAAI, pp. 658–664 (2000)
Trillos, N.G., Slepčev, D., Von Brecht, J., Laurent, T., Bresson, X.: Consistency of cheeger and ratio graph cuts. J. Mach. Learn. Res. 17(1), 6268–6313 (2016)
Vapnik, V.N., Kotz, S.: Estimation of Dependences Based on Empirical Data, vol. 40. Springer, New York (1982)
von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)
Wagner, D., Wagner, F.: Between Min Cut and Graph Bisection. Springer, Berlin (1993)
Wang, F., Zhao, B., Zhang, C.: Linear time maximum margin clustering. IEEE Trans. Neural Netw. 21(2), 319–332 (2010)
Weiss, Y.: Segmentation using eigenvectors: a unifying view. In: Proceedings of the 7th IEEE International Conference on Computer Vision, vol. 2, pp. 975–982 (1999)
Weyl, H.: Das asymptotische verteilungsgesetz der eigenwerte linearer partieller differentialgleichungen (mit einer anwendung auf die theorie der hohlraumstrahlung). Math. Ann. 71(4), 441–479 (1912)
Wolfe, P.: On the convergence of gradient methods under constraint. IBM J. Res. Dev. 16(4), 407–411 (1972)
Xu, L., Neufeld, J., Larson, B., Schuurmans, D.: Maximum margin clustering. In: Advances in Neural Information Processing Systems, pp. 1537–1544 (2004)
Yan, D., Huang, L., Jordan, M.I.: Fast approximate spectral clustering. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 907–916. ACM (2009)
Ye, Q.: Relative perturbation bounds for eigenvalues of symmetric positive definite diagonally dominant matrices. SIAM J. Matrix Anal. Appl. 31(1), 11–17 (2009)
ZelnikManor, L., Perona, P.: Selftuning spectral clustering. In: Advances in Neural Information Processing Systems, pp. 1601–1608 (2004)
Zhang, T., Ramakrishnan, R., Livny, M.: Birch: an efficient data clustering method for very large databases. In: ACM SIGMOD Record, vol. 25, pp. 103–114. ACM (1996)
Zhang, B.: Dependence of clustering algorithm performance on clusteredness of data. Technical Report, 20010417. HewlettPackard Labs (2001)
Zhang, K., Tsang, I.W., Kwok, J.T.: Maximum margin clustering made practical. IEEE Trans. Neural Netw. 20(4), 583–596 (2009)
Zhao, Y., Karypis, G.: Empirical and theoretical comparisons of selected criterion functions for document clustering. Mach. Learn. 55(3), 311–331 (2004)
Acknowledgements
The authors would like to thank the anonymous reviewers for their insightful recommendations, which helped improve the quality of the paper. They would also like to thank Dr. Teemu Roos for his valuable comments on this work. Finally, they are very grateful to Dr. Kai Zhang for providing code to implement the iSVR algorithm.
Author information
Affiliations
Corresponding author
Additional information
David Hofmeyr acknowledges support from the EPSRCfunded EP/H023151/1 STORi centre for doctoral training as well as the Oppenheimer Memorial Trust. Idris Eckley was supported by EPSRC Grant EP/N031938/1 (StatScale).
Appendices
Avoiding outliers
It has been documented that spectral clustering can be sensitive to outliers (Rahimi and Recht 2004). Our experience has shown that this problem becomes more pronounced when performing dimension reduction based on the spectral clustering objective, especially in highdimensional applications. Consider the extreme case where \(d>N\): since the linear system \(V^\top X = P\) is underdetermined, for any P there exists \(\pmb {\theta }\in \varTheta , c \in \mathbb {R}{\setminus }\{0\}\) s.t. \(V(\pmb {\theta })^\top X = cP\). The projected data can therefore be made to have any distribution (up to a scaling constant). In other words there will always be projections that contain outliers. We have found that even in problems of moderate dimensionality, there often exist projections which induce large separation of a small group of points from the remainder of the data. These projections frequently achieve the minimum spectral connectivity for both Ratio Cut and Normalised Cut.
We have found that by defining a metric which encourages the induced cluster boundaries to intersect a compact set, \(\pmb {\varDelta }(\pmb {\theta })\), around the mean of the projected data, the problem of outliers can be mitigated. This is achieved by reducing the distance, relative to the usual Euclidean metric, to points lying outside \(\pmb {\varDelta }(\pmb {\theta })\). Points lying outside \(\pmb {\varDelta }(\pmb {\theta })\), which may be outliers, therefore have increased similarity to all others. We define \(\pmb {\varDelta }(\pmb {\theta }) = \varDelta _1 \times \cdots \times \varDelta _l\), where \(\varDelta _i = [\mu _i\beta \sigma _{i}, \mu _i + \beta \sigma _{i}]\); \(\mu _i\) and \(\sigma _{i}\) are the mean and standard deviation of the ith component of the projected data; and \(\beta \geqslant 0\) controls the size of \(\pmb {\varDelta }(\pmb {\theta })\). The modified distance metric, \(d(\cdot , \cdot )\), is defined with respect to a continuously differentiable transformation, \(T_{\varDelta }\), of the projected data,
where \(\delta \in (0, 0.5]\) is the distance reducing parameter, and \(c_1\) and \(c_2\) are equal to \(\left( \delta \left( 1\delta \right) \right) ^{1/\delta }\) and \(\delta c_1^{1\delta }\), respectively. By construction \(\Vert T_{\varDelta }(p_i)  T_{\varDelta }(p_j)\Vert _2 \le \Vert p_i  p_j\Vert _2\) for any \(p_i,p_j \in \mathbb {R}^l\), with strict inequality when either or both \(p_i,p_j \notin \pmb {\varDelta }(\pmb {\theta })\).
Figure 7 illustrates the impact of \(T_{\varDelta }\) on pairwise distances in the univariate case. As shown, distance increases linearly in the interval \(\varDelta \), but outside \(\varDelta \) it increases much more slowly, with the rate being determined by \(\delta \). In the limit as \(\delta \) approaches zero, all points outside \(\varDelta \) are mapped to the boundary of \(\varDelta \). As a result distances between points outside \(\varDelta \) and all other points are much smaller after being transformed through \(T_\varDelta \), and points which can be characterised as outliers in terms of the original projections, \(\mathcal {P}\), do not appear as such in terms of \(T_\varDelta (\mathcal {P})\).
An illustration of the usefulness of this modified metric is provided in Fig. 8. The figure shows twodimensional projections of the 64dimensional optical recognition of handwritten digits dataset (Bache and Lichman 2013). The left plots show the true clusters, while the right plots show the clustering assignments based on spectral clustering using the normalised Laplacian (Shi and Malik 2000). Figure 8a shows the projection onto the first two principal components, which are also used as initialisation for our method. There are clearly a few points outlying from the remainder of the data, which are separated by the spectral clustering algorithm. Figure 8b shows the optimal projection from minimising \(\lambda _2(L_{\mathrm {N}}(\pmb {\theta }))\) using the Euclidean metric. The result is that the outlying points have been further separated from the remainder of the data, thereby exacerbating the outlier problem. Finally, Fig. 8c shows the same result but using the modified metric discussed above and with \(\beta = 3\). In this case the projection pursuit is able to find a projection which separates two of the true clusters clearly from the remainder.
Derivatives
Evaluating \(D_{P_i}\lambda _2(\cdot )\)
We first consider the standard Laplacian L and use \(\lambda \) and u to denote the second eigenvalue and corresponding eigenvector. By Eq. (11) we have \(d\lambda = u^\top d(L) u = u^\top d(D) u  u^\top d(A) u\).
Now,
and so,
For the normalised Laplacian, \(L_{\mathrm {N}}\), consider first
We again use \(\lambda \) and u to denote the second eigenvalue and corresponding eigenvector. Using \(LD^{1/2}u = \lambda D^{1/2}u\),
where in the third step we made use of the fact that \(d(D^{1/2})DD^{1/2} + D^{1/2}d(D)D^{1/2} + D^{1/2}Dd(D^{1/2}) = d(D^{1/2}DD^{1/2}) = d(I) = \mathbf {0}\). Therefore,
Derivatives of the approximate eigenvalue functions based on microclusters
In the general case we may consider a set of m microclusters with centres \(c_1, \ldots , c_m\) and counts \(n_1, \ldots , n_m\). The derivations we provide are valid for \(n_i = 1 \ \forall i \in \{1, \dots , m\}\), and so apply to the exact formulation of the problem as well. Let \(\pmb {\theta }\in \varTheta \). We find it practically convenient to associate the transformation in Eq. (20), which incorporates the set \(\pmb {\varDelta }(\pmb {\theta })\), with the projection of the microclusters rather than with the computation of similarities. Specifically, we now let \({\mathcal {T}}\) be the transformed projected microcluster centres, i.e.
where each \(t_i\) is repeated \(n_i\) times. The reason for this is that with this formulation the majority of terms in the above sums corresponding to \(\partial \lambda \) (which are now partial derivatives w.r.t. the elements of \({\mathcal {T}}\), and not \(\mathcal {P}\) as before) are zero. Specifically, with this expression for \({\mathcal {T}}\), and letting T be the matrix with columns corresponding to elements in \({\mathcal {T}}\), we have
and similarly for the normalised Laplacian.
In Sect. 3 we expressed \(D_{\pmb {\theta }}\lambda \) via the chain rule decomposition \(D_P\lambda D_v PD_{\pmb {\theta }} v\), which we can now simply restructure as \(D_T\lambda D_v TD_{\pmb {\theta }} v\). The compression of \({\mathcal {T}}\) to the size m nonrepeated set, \({\mathcal {T}}^C = \{t_1, \ldots , t_m \}\), requires a slight restructuring, as described in Sect. 5. We begin with the standard Laplacian, letting \(T^C\) be the matrix corresponding to \({\mathcal {T}}^C\), and define \(N(\pmb {\theta })\) and \(B(\pmb {\theta })\) as in Lemma 3. That is, \(N(\pmb {\theta })\) is the diagonal matrix with ith diagonal element equal to \(\sum _{j=1}^m n_j k(\Vert t_i  t_j\Vert /\sigma )\) and \(B(\pmb {\theta })_{i,j} = \sqrt{n_i n_j} k(\Vert t_i  t_j\Vert /\sigma )\). The derivative of the second eigenvalue of the Laplacian relies on the corresponding eigenvector, u. However, this vector is not explicitly available as we only solve the \(m\times m\) eigen problem of \(N(\pmb {\theta })  B(\pmb {\theta })\). Let \(u^C\) be the second eigenvector of \(N(\pmb {\theta })  B(\pmb {\theta })\). As in the proof of Lemma 3 if i, j are such that the ith element of \({\mathcal {T}}\) corresponds to the jth microcluster, then \(u^C_j = \sqrt{n_j}u_i\). The derivative of \(\lambda _2(N(\pmb {\theta })B(\pmb {\theta }))\) with respect to the ith column of \(\pmb {\theta }\), and thus equivalently of the second eigenvalue of the Laplacian, is therefore the vector with jth entry given by
where \(D_{\pmb {\theta }_i}V_i\) is given in Eq. (12) and \(D_{V_i} T^C_i\) is expressed below. We provide expressions for the case where
as in our implementation, where we have again assumed that the data have been centred, i.e. have zero mean. Then \(D_{V_i} T^C_i\) is the \(m \times d\) matrix with jth row equal to,
if \(V_i^\top c_j < \beta \sigma _{\pmb {\theta }_i}\),
if \(\beta \sigma _{\pmb {\theta }_i} \le V_i^\top c_j \le \beta \sigma _{\pmb {\theta }_i}\), and
if \(V_i^\top c_j>\beta \sigma _{\pmb {\theta }_i}\). Here \(\varSigma \) is the covariance matrix of the data.
For the normalised Laplacian, the reduced \(m\times m\) eigen problem has precisely the same form as the original \(N\times N\) problem, with the only difference being the introduction of the factors \(n_j n_k\). Specifically, with the derivation in Sect. 3 we can see that the corresponding derivative is as for the standard Laplacian above, except that the coefficients \((u_j^C/\sqrt{n_j}  u_k^C/\sqrt{n_k})^2n_j n_k\) in Eq. (23) are replaced with \((u_j^C/\sqrt{d_j}  u_k^C/\sqrt{d_k})^2  \lambda ((u_j^C)^2/d_j + (u_k^C)^2/d_k)\), where \(\lambda \) is the second eigenvalue of the normalised Laplacian, \(u^C\) is the corresponding eigenvector, and \(d_j\) is the degree of the jth element of \({\mathcal {T}}^C\).
Computational complexity
Here we give a very brief discussion of the computational complexity of the proposed method. At each iteration in the gradient descent, computing the projected data matrix, \(P(\pmb {\theta })\), requires \({\mathcal {O}}(Nld)\) operations. Computing all pairwise similarities from elements of the ldimensional \(\mathcal {P}(\pmb {\theta })\) has computational complexity \({\mathcal {O}}(lN^2)\), and determining both Laplacian matrices, and their associated eigenvalue/vector pairs add a further computational cost \({\mathcal {O}}(N^2)\). Each evaluation of the objectives \(\lambda _2(L(\pmb {\theta }))\) or \(\lambda _2(L_{\mathrm {N}}(\pmb {\theta }))\) therefore requires \({\mathcal {O}}(lN(N+d))\) operations. In order to compute the gradients of these objectives, the partial derivatives with respect to each element of the projected data matrix need to be calculated. As we discussed in relation to the derivatives above, the majority of the terms in the sums in Eqs. (13) and (14) are zero, and in fact each partial derivative can be computed in \({\mathcal {O}}(N)\) time, and so all such partial derivatives can be computed in \({\mathcal {O}}(lN^2)\) time. The matrix derivatives \(D_{\pmb {\theta }_i} V_i ,i=1,\ldots ,l\), in (12) can each be computed with \({\mathcal {O}}(d(d1))\) operations. Finally, determining the gradients with respect to each column of \(\pmb {\theta }\) involves computing the matrix product \(D_{\pmb {\theta }_i} \lambda = D_{P_i} \lambda D_{V_i} P_i D_{\pmb {\theta }_i} V_i\), where \(D_{P_i} \lambda \in \mathbb {R}^{1\times N}, D_{V_i} P_i \in \mathbb {R}^{N \times d}\) and \(D_{\pmb {\theta }_i} V_i \in \mathbb {R}^{d\times (d1)}\). This has complexity \({\mathcal {O}}(Nd(d1))\). The complete gradient calculation therefore requires \({\mathcal {O}}(lN(N+d(d1)))\) operations. We have found that the optimality conditions based on directional derivatives and gradient sampling steps are seldom, if ever required, and moreover that these do not constitute the bottleneck in the running time of the method in practice. The complexity of the optimality condition check may be computed along similar lines and be found to be \({\mathcal {O}}(t^2lN(N+d(d1)))\), where t is the multiplicity of the eigenvalue \(\lambda = \lambda _2(L(\pmb {\theta }))\). The gradient sampling is simply \({\mathcal {O}}(d)\) times the cost of computing a single gradient. The total complexity of the projection pursuit optimisation depends on the number of iterations in the gradient descent method, where in general this number is bounded for a given accuracy level. For our experiments we use the BFGS (Broyden–Fletcher–Goldfarb–Shanno) algorithm as this has been found to perform well on nonsmooth functions (Lewis and Overton 2013).
Table 3 shows the observed running times for SCPP and DRSC when applied to six datasets which were used in the experiments. To render the comparison relevant, we did not use the microcluster approach to speed up the SCPP algorithm. We considered only subsets of the Opt. Digits and Pen Digits datasets so that run times for DRSC could be obtained in a reasonable amount of time. We used the same subsets as in the experiments for maximum margin clustering. The SCPP algorithm converged in a reasonable amount of time in all cases, despite the absence of the microcluster speedup. DRSC on the other hand took as many as three orders of magnitude longer to run on some datasets. Moreover, it failed to converge in half of the cases considered.
Proofs
Proof of Theorem 2
Before proving Theorem 2, we require some supporting theory which we present below. We will use the notation \(v^\top \mathcal {X} = \{v^\top x_1, \ldots , v^\top x_N\}\), and for a set \({\mathcal {P}} \subset \mathbb {R}\) and \(y \in \mathbb {R}\) we write, for example, \({\mathcal {P}}_{>y}\) for \({\mathcal {P}} \cap (y, \infty )\). Recall that for scaling parameter \(\sigma >0\) we define \(\pmb {\theta }_{\sigma }: = \text{ argmin }_{\pmb {\theta } \in \varTheta } \lambda _2(L(\pmb {\theta }, \sigma ))\), where \(L(\pmb {\theta }, \sigma )\) is as \(L(\pmb {\theta })\) from before, but with an explicit dependence on the scaling parameter. That is, \(\pmb {\theta }_{\sigma }\) defines the projection generating the minimal spectral connectivity of \(\mathcal {X}\) for a given value of \(\sigma \). We define \(\pmb {\theta }_{\sigma }^N\) similarly for the normalised Laplacian.
Recall that we are interested in those hyperplanes which intersect an arbitrary convex set \(\pmb {\varDelta }\). This is because very often the maximum margin hyperplane will separate only a few points from the remainder, as data tend to be more sparse in the tails of the underlying distribution. To account for the potential for hyperplanes with very large margins lying in the tails of the distribution, we make the additional assumption that the distance reducing parameter, \(\delta \), tends to zero along with \(\sigma \).
Lemmas 4 and 5 provide lower bounds on the second eigenvalue of the graph Laplacians of a onedimensional dataset in terms of the largest Euclidean separation of adjacent points which lie within the interval \(\varDelta \), used to represent \(\pmb {\varDelta }(\pmb {\theta })\) in the context of a projection of \(\mathcal {X}\). These lemmas also show how we construct the set \(\pmb {\varDelta }^\prime \). Lemmas 6 and 7 use these results to show that a projection angle \(\pmb {\theta }\in \varTheta \) leads to lower spectral connectivity than all projections admitting smaller maximal margin hyperplanes intersecting \(\pmb {\varDelta }^\prime \) for all pairs \(\sigma , \delta \) sufficiently close to zero.
Lemma 4
Let \(k:{\mathbb {R}}_+ \rightarrow {\mathbb {R}}_+\) be a nonincreasing, positive function and let \(\sigma > 0, \delta \in (0, 0.5]\). Let \(\mathcal {P}= \{p_1, \ldots , p_N\}\) be a univariate dataset and let \(\varDelta =[a, b]\) for \(a<b \in \mathbb {R}\). Suppose that \(\vert \mathcal {P}\cap \varDelta \vert \ge 2\) and \(a\ge \min \{\mathcal {P}\}, b\le \max \{\mathcal {P}\}\). Define \(\varDelta ^\prime = [a^\prime , b^\prime ]\), where \(a^\prime = (a+\min \{\mathcal {P}\cap \varDelta \})/2\), and \(b^\prime = (b+\max \{\mathcal {P}\cap \varDelta \})/2\). Let \(M = \max _{x \in \varDelta ^\prime }\{\min _{i=1\dots N}\vert xp_i \vert \}\). Define \(L(\mathcal {P})\) to be the Laplacian of the graph with vertices \(\mathcal {P}\) and similarities according to \(s(P, i, j) = k(\vert T_{\varDelta }(p_i)  T_{\varDelta }(p_j)\vert /\sigma )\), where \(P \in \mathbb {R}^{1 \times N}\) is the matrix with ith column equal to \(p_i\). Then \(\lambda _2(L(\mathcal {P})) \ge \frac{1}{\vert \mathcal {P}\vert ^3} k((2M+\delta C)/\sigma )\), where \(C = \max \{D, D^{1\delta }\},\)\(D = \max \{a\min \{\mathcal {P}\}, \max \{\mathcal {P}\}  b\}\).
Proof
We can assume that \(\mathcal {P}\) is sorted in increasing order, i.e. \(p_i \le p_{i+1}\), since this does not affect the eigenvalues of \(L(\mathcal {P})\). We first show that \(s(P, i, i+1) \ge k((2M+\delta C)/\sigma )\) for all \(i = 1, \ldots , N1\). To this end observe that \(\delta \left( x + \left( \delta \left( 1\delta \right) ^{\frac{1}{\delta }}\right) \right) ^{1\delta }\delta \left( \delta \left( 1\delta \right) \right) ^{\frac{1\delta }{\delta }} \le \delta \max \{x, x^{1\delta }\}\) for \(x\ge 0\).

If \(p_i, p_{i+1} \le a\) then \(s(P, i, i+1) = k((T_{\varDelta }(p_{i+1}) T_{\varDelta }(p_i))/\sigma ) \ge k((T_{\varDelta }(a)  T_{\varDelta }(p_i))/\sigma )\)\( \ge k((2M+\delta C)/\sigma )\) by the definition of C and using the above inequality, since k is nonincreasing. The case \(p_i, p_{i+1}\ge b\) is similar.

If \(p_i, p_{i+1} \in \varDelta \) then \(p_i, p_{i+1} \in \varDelta ^\prime \Rightarrow \vert p_i  p_{i+1}\vert \le 2M \Rightarrow s(P, i, i+1) \ge k(2M/\sigma ) \ge k((2M+\delta C)/\sigma )\) since M is the largest margin in \(\varDelta ^\prime \).

If none of the above hold, then we lose no generality in assuming \(p_i < a\), \(a<p_{i+1}<b\) since the case \(a<p_i<b\), \(p_{i+1}>b\) is analogous. We must have \(p_{i+1} = \min \{\mathcal {P}\cap \varDelta \}\) and so \(a^\prime = (a+p_{i+1})/2\). If \(p_{i+1}a > 2M\) then \(\min _{j=1 \dots N} \vert a^\prime  p_j \vert >M\), a contradiction since \(a^\prime \in \varDelta ^\prime \) and M is the largest margin in \(\varDelta ^\prime \). Therefore \(p_{i+1}a \le 2M\). In all
$$\begin{aligned} T_{\varDelta }(p_{i+1})  T_{\varDelta }(p_i)&= (p_{i+1}a) + \delta (ap_i+(\delta (1\delta ))^{\frac{1}{\delta }})^{1\delta }\\&\quad  \delta (\delta (1\delta ))^{\frac{1\delta }{\delta }}\\&\le 2M + \delta C\\ \Rightarrow s(P, i, i+1)&\ge k((2M+\delta C)/\sigma ). \end{aligned}$$
Now, let u be the second eigenvector of \(L(\mathcal {P})\). Then \(\Vert u\Vert = 1\) and \(u\perp \mathbf {1}\) and therefore \(\exists i, j\) s.t. \(u_i  u_j \ge \frac{1}{\sqrt{N}}\). We thus know that there exists m s.t. \(\vert u_m  u_{m+1}\vert \ge \frac{1}{N^{3/2}}\). By von Luxburg (2007, Proposition 1), we know that
since all consecutive pairs \(p_m,\)\(p_{m+1}\) have similarity at least \(k((2M+\delta C)/\sigma )\), by above. Therefore \(\lambda _2(L(\mathcal {P})) \ge \frac{1}{N^3}k((2M+\delta C)/\sigma )\) as required. \(\square \)
Lemma 5
Let the conditions of Lemma 4 hold and let \(L_\mathrm {N}(\mathcal {P})\) be the normalised Laplacian of the graph with vertices \(\mathcal {P}\) and similarities \(s(P, i, j) = k(\vert T_{\varDelta }(p_i)  T_{\varDelta }(p_j)\vert /\sigma )\). Then
Proof
The proof is similar to that of Lemma
theolink4FPar4, but requires a few simple modifications. Let u be the second eigenvector of \(L_{\mathrm {N}}(\mathcal {P})\). Since \(\Vert u\Vert = 1, \exists i \in \{1, \ldots , N\}\) s.t. \(\vert u_i \vert \ge \frac{1}{\sqrt{N}}\). Suppose w/o loss of generality that \(u_i \le \frac{1}{\sqrt{N}}\). Now consider that for all \(j, k \in \{1, \ldots , N\}\) we have \(0 < s(P,j,k) \le 1\) and \(s(P,j,j) = 1\) and so \(1 < \sqrt{d_j} \le \sqrt{N}\) for all \(j \in \{1, \ldots , N\}\). Therefore we have \(u_i/\sqrt{d_i} \le \frac{1}{N}\). Furthermore, since \(uD^{1/2} \perp \mathbf {1}\) we have \(u_j > 0\) for some \(j \in \{1, \ldots , N\} \Rightarrow u_j/\sqrt{d_j} > 0\). Therefore, \(u_j/\sqrt{d_j}  u_i/\sqrt{d_i} > \frac{1}{N}\). We thus know that \(\exists m \in \{1, \ldots , N\}\) s.t. \( \left u_m/\sqrt{d_m}  u_{m+1}/\sqrt{d_{m+1}}\right > \frac{1}{N^2}. \) By von Luxburg (2007, Proposition 3), we know that
where the bound on \(s(P, m, m+1)\) is taken from the proof of Lemma 5. Therefore \(\lambda _2(L_{\mathrm {N}}(\mathcal {P})) \ge \frac{1}{N^4}k((2M+\delta C)/\sigma )\) as required. \(\square \)
In the above we have assumed that \(\varDelta \) is contained within the convex hull of the points \(\mathcal {P}\); however, the results of this section can easily be modified to allow for cases where this does not hold. In particular, if an unconstrained large margin hyperplane is sought, then setting \(\pmb {\varDelta }\) to be arbitrarily large allows for this. We have merely stated the results in the most convenient context for our practical implementation.
The set \(\varDelta ^\prime \) in the above is defined in terms of the onedimensional interval [a, b]. We define the fulldimensional set \(\pmb {\varDelta }^\prime \) along the same lines by,
Here we assume that \(\pmb {\varDelta }\) is contained within the convex hull of the ddimensional dataset X. Notice that since \(\pmb {\varDelta }\) is convex, we have \(v(\pmb {\theta })^\top \pmb {\varDelta }^\prime = \varDelta (\pmb {\theta })^\prime \). In what follows we show that as \(\sigma \) is reduced to zero, the optimal projection for spectral partitioning converges to the projection admitting the largest margin hyperplane intersecting \(\pmb {\varDelta }^\prime \). If it is the case that the largest margin hyperplane intersecting \(\pmb {\varDelta }\) also intersects \(\pmb {\varDelta }^\prime \), as is often the case, although this fact will not be known, then it is actually not necessary that \(\delta \) tend towards zero. In such cases it only needs to satisfy \(\delta \le 2M/C\) for the corresponding values of M and C over all possible projections. In particular, choosing \(\max \{\text{ Diam }(\mathcal {X}), \text{ Diam }(\mathcal {X})^{1\delta }\}\) instead of C is appropriate for all projections.
Lemma 6
Let \(\pmb {\theta } \in \varTheta \) and let \(k:\mathbb {R}_+ \rightarrow {\mathbb {R}}_+\) be nonincreasing, positive, and satisfy
for all \(\epsilon > 0\). Then for any \(0< m < \max \limits _{b \in \varDelta (\pmb {\theta })^\prime }\text{ margin }(v(\pmb {\theta }), b)\) there exists \(\sigma ^\prime > 0\) s.t. if \(0< \sigma < \sigma ^\prime \) and
then \(\lambda _2(L(\pmb {\theta }, \sigma )) < \lambda _2(L(\pmb {\theta }^\prime , \sigma ))\).
Proof
Let \(B = \text{ argmax }_{b \in \varDelta (\pmb {\theta })^\prime }\text{ margin }(v(\pmb {\theta }), b)\) and let M be the corresponding margin, i.e. \(M = \text{ margin }(v(\pmb {\theta }), B)\). We assume that \(M \not = 0\), since otherwise there is nothing to show. Now, since spectral clustering solves a relaxation of the minimum Normalised Cut problem, we have,
The final inequality holds since for any i, j s.t. \(v(\pmb {\theta })^\top x_i < B\) and \(v(\pmb {\theta })^\top x_j >B\), we must have \(T_{\varDelta (\pmb {\theta })}(v(\pmb {\theta })^\top x_j)  T_{\varDelta (\pmb {\theta })}(v(\pmb {\theta })^\top x_i) \ge 2M\). Now, for any \(\pmb {\theta }^\prime \in \varTheta \), let \(M_{\pmb {\theta }^\prime } = \max _{c \in \varDelta (\pmb {\theta }^\prime )^\prime }\text{ margin }(v(\pmb {\theta }^\prime ), c)\). By Lemma 4 we know that \(\lambda _2(L(\pmb {\theta }^\prime , \sigma )) \ge \frac{1}{\vert \mathcal {X}\vert ^3} k((2M_{\pmb {\theta }^\prime }+\delta C/\sigma )\), where \(C = \max \{\text{ Diam }(X),\)\(\text{ Diam }(X)^{1\delta }\}\). Therefore,
Since \(\delta \rightarrow 0\) as \(\sigma \rightarrow 0\), this gives the result. \(\square \)
Lemma 7
Let the conditions of Lemma 6 hold. For any \(0< m < \max _{b \in \varDelta (\pmb {\theta })^\prime }\text{ margin }(v(\pmb {\theta }), b)\) there exists \(\sigma ^\prime > 0\) s.t. if \(0< \sigma < \sigma ^\prime \) and
then \(\lambda _2(L_{\mathrm {N}}(\pmb {\theta }, \sigma )) < \lambda _2(L_{\mathrm {N}}(\pmb {\theta }^\prime , \sigma ))\).
Proof
Using a similar approach to that in the proof of Lemma 6, we can arrive at the following.
where the final inequality comes from the fact that \(1 < d_i\) for all \(i \in \{1, \ldots , N\}\), and hence vol\(((v(\pmb {\theta })^\top \mathcal {X})_{>B}) \ge \vert (v(\pmb {\theta })^\top \mathcal {X})_{>B}\vert \), and similarly for \((v(\pmb {\theta })^\top \mathcal {X})_{<B}\). The final step in the proof is equivalent to that of Lemma 6, except that \(\vert \mathcal {X}\vert ^3\) is replaced with \(\vert \mathcal {X}\vert ^4\). \(\square \)
Lemmas 6 and 7 show almost immediately that the margin admitted by the optimal projection for spectral bipartitioning converges to the largest margin through \(\pmb {\varDelta }^\prime \) as \(\sigma \) goes to zero. Theorem 2, which we are now in a position to prove, shows the stronger result that the optimal projection itself converges to the projection admitting the largest margin.
Proof of Theorem 2:
Take any \(\epsilon > 0\). Pavlidis et al. (2016) have shown that \(\exists m_\epsilon > 0\) s.t. for \(w \in \mathbb {R}^d, c \in \mathbb {R}\), \(\Vert (w, c)/\Vert w\Vert  (v(\pmb {\theta }^\star ), b^\star ) \Vert > \epsilon \Rightarrow \)margin\((w/\Vert w\Vert , c/\Vert w\Vert ) < \) margin\((v(\pmb {\theta }^\star ), b^\star )  m_\epsilon \). By Lemma 6 we know \(\exists \sigma ^\prime > 0\), \(\delta ^\prime >0\) s.t. if \(0< \sigma < \sigma ^\prime \) then \(\exists c \in \varDelta (\pmb {\theta })\) s.t. margin\((v(\pmb {\theta }_{\sigma }), c)\)\(\ge \) margin\((v(\pmb {\theta }^\star ), b^\star )  m_\epsilon \), since \(\pmb {\theta }_{\sigma }\) is optimal for \(\sigma \). Thus, by above, \(\Vert (v(\pmb {\theta }_{\sigma }), c)  (v(\pmb {\theta }^\star ), b^\star )\Vert \le \epsilon \). But \(\Vert (v(\pmb {\theta }_{\sigma }), c)  (v(\pmb {\theta }^\star ), b^\star )\Vert \ge \Vert v(\pmb {\theta }_{\sigma })  v(\pmb {\theta }^\star )\Vert \) for any \(c \in {\mathbb {R}}\). Since \(\epsilon > 0\) was arbitrary, we therefore have \(v(\pmb {\theta }_{\sigma }) \rightarrow v(\pmb {\theta }^\star )\) as \(\sigma \rightarrow 0^+\). The proof for \(\pmb {\theta }^N_{\sigma }\) is analogous. \(\square \)
Proof of Lemma 3
The proof of Lemma 3 uses the following result from matrix perturbation theory.
Theorem 8
(Ye 2009) Let \(A = [a_{ij}]\) and \(\tilde{A} = [\tilde{a}_{ij}]\) be two symmetric positive semidefinite diagonally dominant matrices, and let \(\lambda _1 \le \lambda _2 \le \cdots \le \lambda _n\) and \(\tilde{\lambda }_1 \le \tilde{\lambda }_2 \le \cdots \le \tilde{\lambda }_n\) be their respective eigenvalues. If, for some \(0 \le \epsilon < 1\), \(\vert a_{ij}  \tilde{a}_{ij} \vert \le \epsilon \vert a_{ij} \vert \ \forall i \not = j\), and \( \vert v_i  \tilde{v}_i \vert \le \epsilon v_i \ \forall i,\) where \(v_i = a_{ii}  \sum _{j \not = i} \vert a_{ij} \vert \), and similarly for \(\tilde{v}_i\), then
An inspection of the proof of Theorem 8 reveals that \(\epsilon < 1\) is necessary only to ensure that the signs of \(a_{ij}\) are the same as those of \(\tilde{a}_{ij}\). In the case of Laplacian matrices this equivalence of signs holds by design, and so in this context the requirement that \(\epsilon < 1\) can be relaxed.
Now, for brevity we drop the notational dependence on \(\pmb {\theta }\). Let \(\mathcal {P}^{c\prime } = \{V^\top c_1, V^\top c_1, \ldots , V^\top c_m, V^\top c_m\}\), where each \(V^\top c_i\) is repeated \(n_i\) times, and let \(P^{c \prime }\) be the corresponding matrix of repeated projected centroids. Let \(L^{c\prime }\) be the Laplacian of the graph with vertices \(\mathcal {P}^{c\prime }\) and edges given by \(s(P^{c\prime }, i, j)\). We begin by showing that \(\lambda _2(L^{c\prime }) = \lambda _2(NB)\). Take \(v \in \mathbb {R}^m\), then,
and so \(NB\) is positive semidefinite. In addition, it is straightforward to verify that \((NB)(\sqrt{n_1} \ \dots \ \sqrt{n_K}) = \mathbf {0}\), and hence 0 is the smallest eigenvalue of \(NB\) with corresponding eigenvector \((\sqrt{n_1} \ \dots \ \sqrt{n_m})\). Now, let u be the second eigenvector of \(L^{c\prime }\). Then \(u_j = u_k\) for pairs of indices j, k aligned with the same \(V^\top c_i\) in \(P^{c\prime }\). Define \(u^c \in \mathbb {R}^m\) s.t. \(u^c_i = \sqrt{n_i}u_j\) where index j is aligned with \(V^\top c_i\) in \(P^{c\prime }_j\). Then \((u^c)^\top (\sqrt{n_1} \ \dots \ \sqrt{n_m}) = \sum _{i=1}^m u^c_i \sqrt{n_i} = \sum _{i=1}^m n_i u_{j_i}\) where index \(j_i\) is aligned with \(V^\top c_i\) in \(P^{c\prime }_{j_i}\) for each i. Therefore \(n_i u_{j_i} = \sum _{j:P^{c\prime } = V^\top c_i}u_j\) and hence \((u^c)^\top (\sqrt{n_1} \ \dots \ \sqrt{n_m}) = \sum _{i=1}^m\sum _{j: P^{c\prime }_j = V^\top c_i} u_j = \sum _{i=1}^N u_i = 0\) since \(\mathbf {1}\) is the smallest eigenvector of \(L^{c\prime }\) and so \(u \perp \mathbf {1}\). Similarly \(\Vert u^c\Vert ^2 = \sum _{i=1}^m n_i u_{j_i}^2 = \sum _{i=1}^N u_i^2 = 1\). Thus \(u^c \perp (\sqrt{n_1} \ \dots \ \sqrt{n_m})\) and \(\Vert u^c\Vert = 1\) and so is a candidate for the second eigenvector of \(NB\). In addition it is straightforward to show that \((u^c)^\top (NB)u^c = u\cdot L^{c\prime } u\). Now, suppose by way of contradiction that \(\exists w \perp (\sqrt{n_1} \ \dots \ \sqrt{n_m})\) with \(\Vert w\Vert =1\) s.t. \(w^\top (NB)w < (u^c)^\top (NB)u^c\). Then let \(w^\prime = (w_1/\sqrt{n_1} \ w_1/\sqrt{n_1} \ \dots \ w_m/\sqrt{n_m})\) where each \(w_i/\sqrt{n_i}\) is repeated \(n_i\) times. Then \(\Vert w^\prime \Vert = 1\), \((w^\prime )^\top \mathbf {1} = w^\top (\sqrt{n_1} \ \dots \ \sqrt{n_m}) = 0\) and \(w^\top L^{c\prime }w < u^\top L^{c\prime } u\), a contradiction since u is the second eigenvector of \(L^{c\prime }\).
Now, let i, j, q, r be such that \(x_q \in C_i\) and \(x_r \in C_j\). We temporarily drop the notational dependence on \(\varDelta \). Then,
since T contracts distances and \(\rho _i\) and \(\rho _j\) are the radii of \(C_i\) and \(C_j\). Since k is nonincreasing, we therefore have,
Therefore
Now, we lose no generality by assuming that \(\mathcal {X}\) is ordered such that for each i the elements of cluster \(C_i\) are aligned with \(V^\top c_i\) in \(P^{c\prime }\), since this does not affect the eigenvalues of the Laplacian of \(V^\top \mathcal {X}\), L. By the design of the Laplacian matrix the “\(v_i\)" of Theorem 8 are exactly zero. For off diagonal terms q, r with corresponding i, j as above, consider
Theorem 8 thus gives the result.
Rights and permissions
About this article
Cite this article
Hofmeyr, D.P., Pavlidis, N.G. & Eckley, I.A. Minimum spectral connectivity projection pursuit. Stat Comput 29, 391–414 (2019). https://doi.org/10.1007/s1122201898146
Received:
Accepted:
Published:
Issue Date:
Keywords
 Spectral clustering
 Dimension reduction
 Projection pursuit
 Maximum margin