Fast support vector clustering
- 3.5k Downloads
Abstract
Support-based clustering has recently absorbed plenty of attention because of its applications in solving the difficult and diverse clustering or outlier detection problem. Support-based clustering method perambulates two phases: finding the domain of novelty and performing the clustering assignment. To find the domain of novelty, the training time given by the current solvers is typically over-quadratic in the training size. This fact impedes the application of support-based clustering method to the large-scale datasets. In this paper, we propose applying stochastic gradient descent framework to the first phase of support-based clustering for finding the domain of novelty in the form of a half-space and a new strategy to perform the clustering assignment. We validate our proposed method on several well-known datasets for clustering task to show that the proposed method renders a comparable clustering quality to the baselines while being faster than them.
Keywords
Support vector clustering Cluster analysis Kernel method1 Introduction
Cluster analysis is a fundamental problem in pattern recognition where objects are categorized into groups or clusters based on pairwise similarities between those objects such that two criteria, homogeneity and separation, are achieved [21]. Two challenges in the task of cluster analysis are (1) dealing with complicated data with nested or hierarchy structures inside; and (2) automatically detecting the number of clusters. Recently, support-based clustering, e.g., support vector clustering (SVC) [1], has drawn a significant research concern because of its applications in solving the difficult and diverse clustering or outlier detection problem [1, 2, 8, 10, 11, 15, 23]. These clustering methods have two main advantages comparing with other clustering methods: (1) ability to generate the clustering boundaries with arbitrary shapes and automatically discover the number of clusters; and (2) capability to handle well the outliers.
Support-based clustering methods always undergo two phases. In the first phase, the domain of novelty, e.g., optimal hypersphere [1, 9, 22] or hyperplane [18], is discovered in the feature space. The domain of novelty when mapped back into the input space will become a set of contours tightly enclosing data which can be interpreted as cluster boundaries. However, this set of contours does not specify how to assign a data sample to its cluster. In addition, the computational complexity of the current solvers [3, 7] to find out the domain of novelty is often over-quadratic [4]. Such a computational complexity impedes the usage of support-based clustering methods for the real-world datasets. In the second phase, namely clustering assignment, based on the geometry information carried in the resultant set of contours harvested from the first phase, data samples are appointed to their clusters. Several works have been proposed for improving cluster assignment procedure [2, 8, 11, 15, 23].
Recently, stochastic gradient descent (SGD) frameworks [6, 19, 20] have emerged as building blocks to develop the learning methods for efficiently handling the large-scale dataset. SGD-based algorithm has the following advantages: (1) very fast; (2) ability to run in online mode; and (3) not requiring to load the entire dataset to the main memory in training. In this paper, we conjoin the advantages of SGD with support-based clustering. In particular, we propose to use the optimal hyperplane as the domain of novelty. The margin, i.e., the distance from the origin to the optimal hyperplane, is maximized to make the contours enclosing the data as tightly as possible. We subsequently apply the stochastic gradient descent framework proposed in [19] to the first phase of support-based clustering for achieving the domain of novelty. Finally, we propose a new strategy for clustering assignment where each data sample in the extended decision boundary has its own trajectory to converge to an equilibrium point and clustering assignment is then reduced to the same task for those equilibrium points. Our clustering assignment strategy distinguishes from the existing works of [8, 11, 12, 13] in the way to find the trajectory with a start and the initial set of data samples that need to do a trajectory for finding the corresponding equilibrium point. The experiments established on the real-world datasets show that our proposed method produces the comparable clustering quality with other support-based clustering methods while simultaneously achieving the computational speedup.
-
Different from the works of [1, 2, 11, 15, 23] which employ a hypersphere to characterize the domain of novelty, we propose using a hyperplane to characterize the domain of novelty. This allows us to introduce SGD-based solution for finding the domain of novelty.
-
We propose SGD-based solution for finding the domain of novelty. We perform a rigorous convergence analysis for the proposed solution. We note that the works of [1, 2, 11, 15, 23] utilized the Sequential-Minimal-Optimization-based approach [17] to find the domain of novelty wherein the computational complexity is over-quadratic and it requires loading the entire Gram matrix to the main memory.
-
We propose new clustering assignment strategy which can reduce the clustering assignment for N samples in the entire training set to the same task for M equilibrium points where M is usually very small comparing with N.
-
Comparing with the conference version [16], this paper presents a more rigorous convergence analysis with the full proofs and explanations. In addition, it further introduces new strategy for clustering assignment. Regarding the experiment, it compares with more baselines and produces more experimental results.
2 Stochastic gradient descent large margin one-class support vector machine
2.1 Large margin one-class support vector machine
2.2 SGD-based Solution in the primal form
To efficiently solve the optimization in Eq. (1), we use stochastic gradient descent method. We name the outcome method by stochastic-based large margin one-class support vector machine (SGD-LMSVC).
At \(t\mathrm{th}\) round, we sample the data point \(x_{n_{t}}\) from the dataset \(\mathcal {D}\). Let us define the instantaneous function \(g_{t}\left( \mathbf {w}\right) \triangleq \frac{1}{2}\left\| \mathbf {w}\right\| ^{2}+C\text {max}\left\{ 0,1-\mathbf {w}^{\mathsf {T}}\phi \left( x_{n_{t}}\right) \right\} \). It is obvious that \(g_{t}(\mathbf {w})\) is \(1-\text {strongly convex}\) w.r.t the norm \(\left\| .\right\| _{2}\) over the feature space.
Algorithm 1 is proposed to find the optimal hyperplane which defines the domain of novelty. At each round, one data sample is uniformly sampled from the training set and the update rule in Eq. (2) is applied to determine the next hyperplane, i.e., \(\mathbf {w}_{t+1}\). Finally, the last hyperplane, i.e., \(\mathbf {w}_{T+1}\) is outputted as the optimal hyperplane. According to the theory displayed in the next section, we can randomly output any intermediate hyperplane and the approximately accurate solution is still warranted in a long-term training. Nonetheless, in Algorithm 1, we make use of the last hyperplane as output to exploit as much as possible the information accumulated through the iterations. It is worthwhile to note that in Algorithm 1, we store \(\mathbf {w}_{t}\) as \(\mathbf {w}_{t}=\sum _{i}\alpha _{i}\phi \left( x_{i}\right) \).
2.3 Convergence analysis
In this section, we show the convergence analysis of Algorithm 1. We assume that data are bounded in the feature space, that is, \(\left\| \phi \left( x\right) \right\| \le R,\;\forall x\,\in \mathcal {X}\). We denote the optimal solution by \(\mathbf {w}^{*}\), that is, Open image in new window
. We derive as follows.
Lemma 1 establishes a bound on \(\left\| \mathbf {w}_{T}\right\| \), followed by Lemma 2 which establishes a bound on \(\left\| \lambda _{T}\right\| \).
Lemma 1
Proof
Lemma 2
Proof
Theorem 1 establishes a bound on regret and shows that Algorithm 1 has the convergence rate \(\text {O}\left( \frac{\log \,T}{T}\right) \).
Theorem 1
Proof
Theorem 1 shows the inequality for the average solution in the expectation form. In the following theorem, we prove that if we output a single-point solution, with a high probability we have a real inequality.
Theorem 2
Proof
3 Clustering assignment
Visual comparison of SGD-LMSVC (the orange region is the domain of novelty) with C-Means and Fuzzy C-Means on two ring dataset
Our proposed clustering assignment procedure is different with the existing procedure proposed in [1]. The procedure proposed in [1] requires to run \(m=20\) sample-point test for every edge connected \(x_{i},\,x_{j}\) (\(i\ne j\)) in the training set. Consequently, the computational cost incurred is \(\text {O}\left( N\left( N-1\right) ms\right) \) where s is the sparsity level of the decision function (i.e., the number of vectors in the model). Our proposed procedure needs to perform \(m=20\) sample-point test for a reduced set of M data samples (i.e., the set of the equilibrium points \(\left\{ e_{1},e_{2},\ldots ,e_{M}\right\} \)) where M is possibly very small comparing with N. The reason is that many data points in the training set could converge to a common equilibrium point which significantly reduces the size from N to M. The computational cost incurred is therefore \(\text {O}\left( M\left( M-1\right) ms\right) \).
4 Experiments
4.1 Visual experiment
Visual comparison of SGD-LMSVC (the orange region is the domain of novelty) with C-Means and Fuzzy C-Means on two-moon dataset
4.2 Experiment on real datasets
To explicitly prove the performance of the proposed algorithm, we establish experiments on the real datasets. Clustering problem is basically an unsupervised learning task and, therefore, there is not a perfect measure to compare given two clustering algorithms. We examine five typical clustering validity indices (CVI) including compactness, purity, rand index, Davies–Bouldin index (DB index), and normalized mutual information (NMI). A good clustering algorithm should produce a solution which has a high purity, rand index, DB index, and NMI and a low compactness.
4.2.1 Clustering validity index
SGD-LMSVC (the orange region is the domain of novelty) can recognize the clusters scattered from mixture of four Gaussian distributions
The clustering with a small compactness is preferred. A small compactness gained means the average intra-distance of clusters is small and homogeneity is thereby good, i.e., two objects in the same cluster have high similarity to each other.
The statistics of the experimental datasets
| Datasets | Size | Dimension | #Classes |
|---|---|---|---|
| Aggregation | 788 | 2 | 7 |
| Breast cancer | 699 | 9 | 2 |
| Compound | 399 | 2 | 6 |
| D31 | 3100 | 2 | 31 |
| Flame | 240 | 2 | 2 |
| Glass | 214 | 9 | 7 |
| Iris | 150 | 4 | 3 |
| Jain | 373 | 2 | 2 |
| Pathbased | 300 | 2 | 3 |
| R15 | 600 | 2 | 15 |
| Spiral | 312 | 2 | 3 |
| Abalone | 4177 | 8 | 28 |
| Car | 1728 | 6 | 4 |
| Musk | 6598 | 198 | 2 |
| Shuttle | 43,500 | 9 | 5 |
The purity, rand index, and NMI of the clustering methods on the experimental datasets
| Datasets | Purity | Rand index | NMI | ||||||
|---|---|---|---|---|---|---|---|---|---|
| SVC | SGD | FSVC | VC | SGD | FSVC | SVC | SGD | FSVC | |
| Aggregation | 1.00 | 1.00 | 0.22 | 1.00 | 1.00 | 0.22 | 0.69 | 0.75 | 0.60 |
| Breast cancer | 0.98 | 0.99 | 0.99 | 0.82 | 0.85 | 0.81 | 0.22 | 0.55 | 0.45 |
| Compound | 0.66 | 0.62 | 0.13 | 0.92 | 0.88 | 0.25 | 0.51 | 0.81 | 0.45 |
| Flame | 0.86 | 0.87 | 0.03 | 0.75 | 0.76 | 0.03 | 0.55 | 0.51 | 0.05 |
| Glass | 0.5 | 0.71 | 0.65 | 0.77 | 0.91 | 0.54 | 0.60 | 0.44 | 0.53 |
| Iris | 1.00 | 1.00 | 0.68 | 0.97 | 0.96 | 0.69 | 0.63 | 0.75 | 0.71 |
| Jain | 0.37 | 0.46 | 0.69 | 0.7 | 0.71 | 0.77 | 0.53 | 0.31 | 1.00 |
| Pathbased | 0.6 | 0.5 | 1.00 | 0.81 | 0.94 | 1.00 | 0.48 | 0.43 | 0.12 |
| R15 | 0.88 | 0.9 | 0.37 | 0.74 | 0.71 | 0.37 | 0.67 | 0.77 | 0.77 |
| Spiral | 0.09 | 0.33 | 0.53 | 0.15 | 0.94 | 0.75 | 0.52 | 0.34 | 0.16 |
| D31 | 0.94 | 0.99 | 0.42 | 0.88 | 0.81 | 0.54 | 0.45 | 0.50 | 0.38 |
| Abalone | 0.22 | 0.44 | 0.03 | 0.43 | 0.86 | 0.12 | 0.22 | 0.34 | 0.07 |
| Car | 0.94 | 0.95 | 0.70 | 0.46 | 0.46 | 0.54 | 0.32 | 0.32 | 0.24 |
| Musk | 0.87 | 0.68 | 0.88 | 0.26 | 0.28 | 0.26 | 0.21 | 0.16 | 0.23 |
| Shuttle | 0.06 | 0.05 | 0.06 | 0.84 | 0.83 | 0.75 | 0.26 | 0.41 | 0.50 |
We perform experiments on 15 well-known datasets for clustering task. The statistics of the experimental datasets is given in Table 1. These datasets are fully labeled and consequently, the CVIs like purity, rand index, and NMI can be completely estimated. We make comparison of our proposed SGD-LMSVC with the following baselines.
4.2.2 Baselines
The compactness and DB index of the clustering methods on the experimental datasets
| Datasets | Compactness | DB index | ||||
|---|---|---|---|---|---|---|
| SVC | SGD | FSVC | SVC | SGD | FSVC | |
| Aggregation | 0.29 | 0.29 | 2.84 | 0.68 | 0.67 | 0.63 |
| Breast cancer | 1.26 | 0.68 | 0.71 | 1.58 | 1.38 | 0.53 |
| Compound | 0.5 | 0.21 | 2.43 | 2.45 | 0.86 | 0.67 |
| Flame | 0.58 | 0.44 | 2.28 | 1.3 | 0.65 | 3.56 |
| Glass | 0.72 | 0.68 | 1.85 | 0.53 | 0.56 | 0.93 |
| Iris | 0.98 | 0.25 | 0.99 | 1.95 | 1.17 | 0.77 |
| Jain | 0.96 | 0.36 | 1.16 | 1.23 | 1.08 | 0.71 |
| Pathbased | 0.18 | 0.3 | 1.04 | 0.36 | 0.73 | 1.07 |
| R15 | 0.61 | 0.13 | 1.84 | 2.96 | 1.42 | 1.37 |
| Spiral | 2 | 0.17 | 0.18 | 1.41 | 0.98 | 0.36 |
| D31 | 1.41 | 0.26 | 1.78 | 2.33 | 1.35 | 1.21 |
| Abalone | 3.88 | 0.40 | 4.97 | 3.78 | 3.91 | 1.29 |
| Car | 0.75 | 0.74 | 14.68 | 1.76 | 1.76 | 1.57 |
| Musk | 9.89 | 30.05 | 20.00 | 2.27 | 2.83 | 0.01 |
| Shuttle | 0.50 | 0.46 | 0.26 | 1.86 | 1.84 | 1.32 |
Training time in second (i.e., the time for finding domain of novelty) and clustering time in second (i.e., the time for clustering assignment) of the clustering methods on the experimental datasets
| Datasets | Training time | Clustering time | ||||
|---|---|---|---|---|---|---|
| SVC | SGD | FSVC | SVC | SGD | FSVC | |
| Aggregation | 0.05 | 0.03 | 0.05 | 31.42 | 2.83 | 7.51 |
| Breast cancer | 0.18 | 0.02 | 0.05 | 19.80 | 2.14 | 22.86 |
| Compound | 0.03 | 0.02 | 0.10 | 6.82 | 1.17 | 7.24 |
| Flame | 0.02 | 0.02 | 15.16 | 1.81 | 0.67 | 4.31 |
| Glass | 0.03 | 0.03 | 0.02 | 2.30 | 0.53 | 10.67 |
| Iris | 0.02 | 0.02 | 0.04 | 1.03 | 0.34 | 4.33 |
| Jain | 0.02 | 0.02 | 0.03 | 5.80 | 0.81 | 4.59 |
| Pathbased | 0.02 | 0.02 | 0.05 | 4.02 | 0.54 | 4.22 |
| R15 | 0.02 | 0.02 | 0.02 | 4.14 | 3.68 | 10.43 |
| Spiral | 0.02 | 0.03 | 0.02 | 1.60 | 0.99 | 7.78 |
| D31 | 0.17 | 0.09 | 0.09 | 467.72 | 6.56 | 33.08 |
| Abalone | 2.26 | 0.81 | 10.94 | 653.65 | 26.58 | 242.97 |
| Car | 5.62 | 0.64 | 8.15 | 67.66 | 7.05 | 84.47 |
| Musk | 55.93 | 5.79 | 58.49 | 602.09 | 432.58 | 510.25 |
| Shuttle | 10.03 | 0.46 | 68.43 | 1,972.61 | 925 | 1,125.46 |
4.2.3 Hyperparameter setting
The RBF kernel, given by \(K\left( x,x'\right) =e^{-\gamma \left\| x-x'\right\| ^{2}}\), is employed. The width of kernel \(\gamma \) is searched on the grid \(\left\{ 2^{-5},\,2^{-3},\,\ldots ,\,2^{3},\,2^{5}\right\} \). The trade-off parameter C is searched on the same grid. In addition, the parameters p and \(\varepsilon \) in FSVC are searched in the common grid \(\left\{ 0.1,0.2,\ldots ,0.9,1\right\} \) which is the same as in [8]. Determining the number of iterations in Algorithm 1 is really challenging. To resolve it, we use the stopping criterion \(\left\| \mathbf {w}_{t+1}-\mathbf {w}_{t}\right\| \le \theta =0.01\), i.e., the next hyperplane does only a slight change.
We report the experimental results of purity, rand index, and NMI in Table 2, compactness and DB index in Table 3, and the training time (i.e., the time for finding domain of novelty) and clustering time (i.e., the time for clustering assignment) in Table 4. For each CVI, we boldface the method that yields a better outcome, i.e., highest value for purity, rand index, NMI, and DB index and lowest value for compactness. As shown in Tables 2 and 3, our proposed SGD-LMSVC is generally comparable with other baselines in the CVIs. In particular, our proposed SGD-LMSVC is slightly better than others on purity, rand index, and NMI whereas it totally surpasses others on compactness. Moreover, our proposed SGD-LMSVC is slightly worse than SVC on DB index. Regarding the amounts of time taken for training and doing clustering assignment, our proposed SGD-LMSVC is totally superior than others. For the training time, the speedup is significant for the medium-scale or large-scale datasets including Shuttle, Musk, and Abalone. In particular, the speedup is really significant for the clustering time.
5 Conclusion
In this paper, we have proposed a fast support-based clustering method, which conjoins the advantages of SGD-based method and kernel-based method. Furthermore, we have also proposed a new strategy for clustering assignment. We validate our proposed method on 15 well-known datasets for clustering task. The experiment has shown that our proposed method has achieved a comparable clustering quality compared with the baselines while being significantly faster than them.
References
- 1.Ben-Hur, A., Horn, D., Siegelmann, H.T., Vapnik, V.: Support vector clustering. J. Mach. Learn. Res. 2, 125–137 (2001)MATHGoogle Scholar
- 2.Camastra, F., Verri, A.: A novel kernel method for clustering. IEEE Trans. Pattern Anal. Mach. Intell. 27(5), 801–804 (2005)CrossRefGoogle Scholar
- 3.Chang, C.-C., Lin, C.-J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 27:1–27:27 (2011)CrossRefGoogle Scholar
- 4.Chu, C.S., Tsang, I.W., Kwok, J.T.: Scaling up support vector data description by using core-sets. In: Proceedings of the 2004 IEEE international joint conference on neural networks, IEEE 2004. vol. 1 (2004)Google Scholar
- 5.Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Clustering validity checking methods: Part II. SIGMOD Rec. 31(3), 19–27 (2002)CrossRefMATHGoogle Scholar
- 6.Hazan, E., Kale, S.: Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. J. Mach. Learn. Res. 15(1), 2489–2512 (2014)MathSciNetMATHGoogle Scholar
- 7.Joachims, T.: Advances in kernel methods. In: Schölkopf, B., Burges, C., Smola, A. (eds.) Making Large-Scale Support Vector Machine Learning Practical, pp. 169–184. The MIT Press, Cambridge (1999)Google Scholar
- 8.Jung, K.-H., Lee, D., Lee, J.: Fast support-based clustering method for large-scale problems. Pattern Recognit. 43(5), 1975–1983 (2010)CrossRefMATHGoogle Scholar
- 9.Le, T., Tran, D., Ma, W., Sharma, D.: An optimal sphere and two large margins approach for novelty detection. In: The 2010 international joint conference on neural networks (IJCNN), IEEE, pp. 1–6 (2010)Google Scholar
- 10.Le, T., Tran, D., Nguyen, P., Ma, W., Sharma, D.: Proximity multisphere support vector clustering. Neural Comput. Appl. 22(7–8), 1309–1319 (2013)CrossRefGoogle Scholar
- 11.Lee, J., Lee, D.: An improved cluster labeling method for support vector clustering. IEEE Trans. Pattern Anal. Mach. Intell. 27(3), 461–464 (2005)CrossRefGoogle Scholar
- 12.Lee, J., Lee, D.: Dynamic characterization of cluster structures for robust and inductive support vector clustering. IEEE Trans. Pattern Anal. Mach. Intell. 28(11), 1869–1874 (2006)CrossRefGoogle Scholar
- 13.Li, H.: A fast and stable cluster labeling method for support vector clustering. J. Comput. 8(12), 3251–3256 (2013)Google Scholar
- 14.Murphy, K.P.: Machine learning: a probabilistic perspective. The MIT Press, Cambridge (2012)MATHGoogle Scholar
- 15.Park, J.H., Ji, X., Zha, H., Kasturi, R.: Support vector clustering combined with spectral graph partitioning. In: Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, vol. 4, pp. 581–584. IEEE (2004)Google Scholar
- 16.Pham, T., Dang, H., Le, T., Le, H-T.: Stochastic gradient descent support vector clustering. In: 2015 2nd national foundation for science and technology development conference on information and computer science (NICS), pp. 88–93 (2015)Google Scholar
- 17.Platt, J.C.: Advances in kernel methods. In: Schölkopf, B., Burges, C., Smola, A. (eds.) Fast Training of Support Vector Machines Using Sequential Minimal Optimization, pp. 185–208. The MIT Press, Cambridge (1999)Google Scholar
- 18.Schölkopf, B., Platt, J.C., Shawe-Taylor, J.C., Smola, A.J., Williamson, R.C.: Estimating the support of a high-dimensional distribution. Neural Comput. 13(7), 1443–1471 (2001)CrossRefMATHGoogle Scholar
- 19.Shalev-Shwartz, S., Singer, Y.: Logarithmic regret algorithms for strongly convex repeated games. The Hebrew University, Jerusalem (2007)Google Scholar
- 20.Shalev-Shwartz, S., Singer, Y., Srebro, N.: Pegasos: primal estimated sub-gradient solver for svm. In. Ghahramani, Z. (ed.) ICML, pp. 807–814 (2007)Google Scholar
- 21.Shamir, R., Sharan, R.: Algorithmic approaches to clustering gene expression data. In: Current Topics in Computational Biology. pp. 269–300. MIT Press, Cambridge (2001)Google Scholar
- 22.Tax, D.M.J., Duin, R.P.W.: Support vector data description. Mach. Learn. 54(1), 45–66 (2004)CrossRefMATHGoogle Scholar
- 23.Yang, J., Estivill-Castro, V., Chalup, S.K.: Support vector clustering through proximity graph modelling. In: Neural information processing, 2002, ICONIP’02, vol. 2, pp. 898–903 (2002)Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.




