Fast support vector clustering

Support-based clustering has recently absorbed plenty of attention because of its applications in solving the difficult and diverse clustering or outlier detection problem. Support-based clustering method perambulates two phases: finding the domain of novelty and performing the clustering assignment. To find the domain of novelty, the training time given by the current solvers is typically over-quadratic in the training size. This fact impedes the application of support-based clustering method to the large-scale datasets. In this paper, we propose applying stochastic gradient descent framework to the first phase of support-based clustering for finding the domain of novelty in the form of a half-space and a new strategy to perform the clustering assignment. We validate our proposed method on several well-known datasets for clustering task to show that the proposed method renders a comparable clustering quality to the baselines while being faster than them.


Introduction
Cluster analysis is a fundamental problem in pattern recognition where objects are categorized into groups or clusters based on pairwise similarities between those objects such that two criteria, homogeneity and separation, are achieved B Hang Dang dthang@hcmus.edu.vn 1 Faculty of Information Technology, VNUHCM-University of Science, Ho Chi Minh City, Vietnam 2 Faculty of Information Technology, HCMc University of Pedagogy, Ho Chi Minh City, Vietnam [21].Two challenges in the task of cluster analysis are (1) dealing with complicated data with nested or hierarchy structures inside; and (2) automatically detecting the number of clusters.Recently, support-based clustering, e.g., support vector clustering (SVC) [1], has drawn a significant research concern because of its applications in solving the difficult and diverse clustering or outlier detection problem [1,2,8,10,11,15,23].These clustering methods have two main advantages comparing with other clustering methods: (1) ability to generate the clustering boundaries with arbitrary shapes and automatically discover the number of clusters; and (2) capability to handle well the outliers.
Support-based clustering methods always undergo two phases.In the first phase, the domain of novelty, e.g., optimal hypersphere [1,9,22] or hyperplane [18], is discovered in the feature space.The domain of novelty when mapped back into the input space will become a set of contours tightly enclosing data which can be interpreted as cluster boundaries.However, this set of contours does not specify how to assign a data sample to its cluster.In addition, the computational complexity of the current solvers [3,7] to find out the domain of novelty is often over-quadratic [4].Such a computational complexity impedes the usage of support-based clustering methods for the real-world datasets.In the second phase, namely clustering assignment, based on the geometry information carried in the resultant set of contours harvested from the first phase, data samples are appointed to their clusters.Several works have been proposed for improving cluster assignment procedure [2,8,11,15,23].
Recently, stochastic gradient descent (SGD) frameworks [6,19,20] have emerged as building blocks to develop the learning methods for efficiently handling the large-scale dataset.SGD-based algorithm has the following advantages: (1) very fast; (2) ability to run in online mode; and (3) not requiring to load the entire dataset to the main memory in training.In this paper, we conjoin the advantages of SGD with support-based clustering.In particular, we propose to use the optimal hyperplane as the domain of novelty.The margin, i.e., the distance from the origin to the optimal hyperplane, is maximized to make the contours enclosing the data as tightly as possible.We subsequently apply the stochastic gradient descent framework proposed in [19] to the first phase of support-based clustering for achieving the domain of novelty.Finally, we propose a new strategy for clustering assignment where each data sample in the extended decision boundary has its own trajectory to converge to an equilibrium point and clustering assignment is then reduced to the same task for those equilibrium points.Our clustering assignment strategy distinguishes from the existing works of [8,[11][12][13] in the way to find the trajectory with a start and the initial set of data samples that need to do a trajectory for finding the corresponding equilibrium point.The experiments established on the real-world datasets show that our proposed method produces the comparable clustering quality with other support-based clustering methods while simultaneously achieving the computational speedup.
To summarize, the contribution of the paper consists of the following points: -Different from the works of [1,2,11,15,23] which employ a hypersphere to characterize the domain of novelty, we propose using a hyperplane to characterize the domain of novelty.This allows us to introduce SGD-based solution for finding the domain of novelty.-We propose SGD-based solution for finding the domain of novelty.We perform a rigorous convergence analysis for the proposed solution.We note that the works of [1,2,11,15,23] utilized the Sequential-Minimal-Optimizationbased approach [17] to find the domain of novelty wherein the computational complexity is over-quadratic and it requires loading the entire Gram matrix to the main memory.-We propose new clustering assignment strategy which can reduce the clustering assignment for N samples in the entire training set to the same task for M equilibrium points where M is usually very small comparing with N .-Comparing with the conference version [16], this paper presents a more rigorous convergence analysis with the full proofs and explanations.In addition, it further introduces new strategy for clustering assignment.Regarding the experiment, it compares with more baselines and produces more experimental results.
2 Stochastic gradient descent large margin one-class support vector machine

Large margin one-class support vector machine
Given the dataset D = {x 1 , x 2 , . . ., x N }, to define the domain of novelty, we construct an optimal hyperplane that can separate the data samples and the origin such that the margin, i.e., the distance from the origin to the hyperplane, is maximized.
The optimization problem is formulated as subjects to where φ is a transformation from the input space to the feature space and w T φ (x) − ρ = 0 is equation of the hyperplane.It occurs that the margin is invariant if we scale (w, ρ) by a factor k. Hence without loss of generality, we can assume that ρ = 1 and we achieve the following optimization problem min subjects to Using the slack variables, we can extend the above optimization problem to form the soft model of large margin one-class Support vector machine (LMOCSVM) min where C > 0 is the trade-off parameter and ξ = [ξ 1 , . . ., ξ N ] is the vector of slack variables.We can rewrite the above optimization problem in the primal form as follows min w J (w)

SGD-based Solution in the primal form
To efficiently solve the optimization in Eq. ( 1), we use stochastic gradient descent method.We name the outcome method by stochastic-based large margin one-class support vector machine (SGD-LMSVC).
At tth round, we sample the data point x n t from the dataset D. Let us define the instantaneous function g t (w) The learning rate is , where I A (.) is the indicator function.Therefore, the update rule is Algorithm 1 Algorithm for solving SGD-LMSVC in the primal form.
Algorithm 1 is proposed to find the optimal hyperplane which defines the domain of novelty.At each round, one data sample is uniformly sampled from the training set and the update rule in Eq. ( 2) is applied to determine the next hyperplane, i.e., w t+1 .Finally, the last hyperplane, i.e., w T +1 is outputted as the optimal hyperplane.According to the theory displayed in the next section, we can randomly output any intermediate hyperplane and the approximately accurate solution is still warranted in a long-term training.Nonetheless, in Algorithm 1, we make use of the last hyperplane as output to exploit as much as possible the information accumulated through the iterations.It is worthwhile to note that in Algorithm 1, we store w t as w t = i α i φ (x i ).

Convergence analysis
In this section, we show the convergence analysis of Algorithm 1.We assume that data are bounded in the feature space, that is, φ (x) ≤ R, ∀x ∈ X .We denote the optimal solution by w * , that is, w * = argmin w J (w).We derive as follows.
Lemma 1 establishes a bound on w T , followed by Lemma 2 which establishes a bound on λ T .

Lemma 1
The following statement holds Taking sum the above when t = 1, 2, . . ., T − 1, we gain Theorem 1 establishes a bound on regret and shows that Algorithm 1 has the convergence rate O log T T .
Theorem 1 Let us consider the running of Algorithm 1.The following statement holds Proof It is apparent that We have the following Taking conditional expectation w.r.t w t the above, we gain Taking expectation again, we achieve Taking sum the above inequality when t = 1, . . ., T , we gain Theorem 1 shows the inequality for the average solution in the expectation form.In the following theorem, we prove that if we output a single-point solution, with a high probability we have a real inequality.
Theorem 2 Let us consider the running of Algorithm 1.Let r be an integer randomly picked from {1, 2, . . ., T }.Given δ ∈ (0; 1), with the probability greater than 1 − δ the following inequality holds Using Markov inequality, we gain εT , we gain the conclusion.
We now investigate the number of iterations required if we want to gain an ε-precision solution with a probability at least 1−δ.According to Theorem 2, the number of iterations T must be greater than T 0 where T 0 is the smallest number such that

Clustering assignment
After solving the optimization problem, we yield the decision function To find the equilibrium points, we need to solve the equation ∇ f (x) = 0. To this end, we use the fixed point technique and assume that Gaussian kernel is used, i.e., K x, x = e −γ x−x 2 .We then have To find an equilibrium point, we start with the initial point x (0) ∈ R d and iterate x ( j+1) = P x ( j) .By fixed point theorem, the sequence x ( j) , which can be considered as a trajectory with start x (0) , converges to the point x (0) * satisfying P(x Let us denote B = {x i : , namely the extended boundary for a tolerance > 0. It follows that the set B forms a strip enclosing the decision boundary f (x) = 0. Algorithm 2 is proposed to do clustering assignment.In Algorithm 2, the task of clustering assignment is reduced to itself for M equilibrium point.To fulfill cluster assignment for M equilibrium points, we run m = 20 sample-point test as proposed in [1].
Algorithm 2 Clustering assignment procedure.Our proposed clustering assignment procedure is different with the existing procedure proposed in [1].The procedure proposed in [1] requires to run m = 20 sample-point test for every edge connected x i , x j (i = j) in the training set.Consequently, the computational cost incurred is O (N (N − 1) ms) where s is the sparsity level of the decision function (i.e., the number of vectors in the model).Our proposed procedure needs to perform m = 20 sample-point test for a reduced set of M data samples (i.e., the set of the equilibrium points {e 1 , e 2 , . . ., e M }) where M is possibly very small comparing with N .The reason is that many data points in the training set could converge to a common equilibrium point which significantly reduces the size from N to M. The computational cost incurred is therefore O (M (M − 1) ms).

Visual experiment
To visually show the high clustering quality produced by our proposed SGD-LMSVC, we establish experiment on three synthesized datasets and visually make comparison SGD-LMSVC with C-Means and Fuzzy C-Means.In the first experiment, data samples form the nested structure with two outside rings and one Gaussian distribution at center.As shown in Fig. 1, SGD-LMSVC can perfectly detect three clusters without any prior information while both C-Means and Fuzzy C-Means with the number of clusters being set to 3 beforehand fail to discover the nested clusters.The second experiment is carried out with a two-moon dataset.As observed from Fig. 2, SGD-LMSVC without any prior knowledge can flawlessly discover two clusters in moons, however, C-Means and Fuzzy C-Means cannot detect the clusters correctly.In the last visual experiment, we generate data from the mixture of 4 Gaussian distributions.As shown in Fig. 3, SGD-LMSVC can perfectly detect 4 clusters corresponding to the individual Gaussian distributions.These visual experiments manifest that SGD-LMSVC is able to generate the cluster boundaries in arbitrary shapes as well as automatically detect the appropriate number of clusters well presented in the data.

Experiment on real datasets
To explicitly prove the performance of the proposed algorithm, we establish experiments on the real datasets.Clustering problem is basically an unsupervised learning task and, therefore, there is not a perfect measure to compare given two clustering algorithms.We examine five typical clustering validity indices (CVI) including compactness, purity, rand index, Davies-Bouldin index (DB index), and normalized mutual information (NMI).A good clustering algorithm should produce a solution which has a high purity, rand index, DB index, and NMI and a low compactness.

Clustering validity index
Compactness measures the average pairwise distances of points in the same cluster [5] and is given as follows The clustering with a small compactness is preferred.A small compactness gained means the average intra-distance of clusters is small and homogeneity is thereby good, i.e., two objects in the same cluster have high similarity to each other.
The second CVI in use is purity which measures the purity of clustering solution with respect to the nature classes of data [14].It is certainly true that the metric purity is only appropriate for data with labels in nature.Let N i j be the number of objects in cluster i that belong to the class j.Again, let N i m j=1 N i j be total number of objects in cluster i.Let us define p i j N i j N j , i.e., the empirical distribution over class labels for cluster i.We define a purity of a cluster as p i max j p i j and overall purity of a clustering solution as The purity ranges between 0 (bad) and 1 (good).This CVI embodies the classification ability of clustering algorithm.
A clustering algorithm which achieves a high purity can be appropriately used for classification purpose.The third CVI used as a measure is rand index [14].To calculate this CVI for a clustering solution, we need to construct a 2 × 2 contingency table containing the following numbers: This can be interpreted as the fraction of clustering decisions that are correct.Obviously, rand index ranges between 0 and 1.
Davies-Bouldin validity index is a function of the ratio of the sum of intra-distances to inter-distances [5] and is formulated as follows A good clustering algorithm should produce the solution which has as smallest DBI as possible.
The last considered CVI is normalized mutual information (NMI) [14].This measure allows us to trade off the quality of the clustering against the number of clusters.

NMI I ( , C) [H (C) + H ( )] /2
where C {c 1 , . . ., c J } is the set of classes and = {ω 1 , . . ., ω K } is the set of clusters.I ( , C) is the mutual information and is defined as and H (.) is the entropy and is defined as It is certainly that the NMI ranges between 0 and 1, and a good clustering algorithm should produce as highest NMI measure as possible.
We perform experiments on 15 well-known datasets for clustering task.The statistics of the experimental datasets is given in Table 1.These datasets are fully labeled and consequently, the CVIs like purity, rand index, and NMI can be completely estimated.We make comparison of our proposed SGD-LMSVC with the following baselines.
It is noteworthy that the first phase in our proposed SGD-LMSVC is SGD-based solution for LMOCSVM (cf.Algorithm 1) and the second phase is proposed in Algorithm 2. All competitive methods are run on a Windows computer with dual-core CPU 2.6 GHz and 4 GB RAM.

Hyperparameter setting
The RBF kernel, given by K x, x = e −γ x−x 2 , is employed.The width of kernel γ is searched on the grid 2 −5 , 2 −3 , . . ., 2 3 , 2 5 .The trade-off parameter C is searched on the same grid.In addition, the parameters p and ε in FSVC are searched in the common grid {0.1, 0.2, . . ., 0.9, 1} which is the same as in [8].Determining the number of iterations in Algorithm 1 is really challenging.To resolve it, we use the stopping criterion w t+1 − w t ≤ θ = 0.01, i.e., the next hyperplane does only a slight change.
We report the experimental results of purity, rand index, and NMI in Table 2, compactness and DB index in Table 3, and the training time (i.e., the time for finding domain of novelty) and clustering time (i.e., the time for clustering assignment) in Table 4.For each CVI, we boldface the method that yields a better outcome, i.e., highest value for purity, rand index, NMI, and DB index and lowest value for compactness.As shown in Tables 2 and 3, our proposed SGD-LMSVC is generally comparable with other baselines in the CVIs.In particular, our proposed SGD-LMSVC is slightly better than others on purity, rand index, and NMI whereas it totally surpasses others on compactness.Moreover, our proposed SGD-LMSVC is slightly worse than SVC on DB index.Regarding the amounts of time taken for training and doing clustering assignment, our proposed SGD-LMSVC is totally superior than others.For the training time, the speedup is significant for the medium-scale or large-scale datasets including Shuttle, Musk, and Abalone.In particular, the speedup is really significant for the clustering time.

Conclusion
In this paper, we have proposed a fast support-based clustering method, which conjoins the advantages of SGD-based method and kernel-based method.Furthermore, we have also proposed a new strategy for clustering assignment.We validate our proposed method on 15 well-known datasets for clustering task.The experiment has shown that our proposed method has achieved a comparable clustering quality compared with the baselines while being significantly faster than them.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecomm ons.org/licenses/by/4.0/),which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
that E = {e 1 , e 2 , . . ., e M } Do m sample point test with for E to find cluster indices for e 1 , e 2 , . . ., e M .Each point x (0) ∈ B is assigned to the cluster of its corresponding equilibrium point x (0) * ∈ E. Each point x ∈ D\B is assigned to the cluster of its nearest neighbor in B using the Euclidean distance.Output: clustering solution for D = {x 1 , ..., x N }

Fig. 1 Fig. 2
Fig. 1 Visual comparison of SGD-LMSVC (the orange region is the domain of novelty) with C-Means and Fuzzy C-Means on two ring dataset

( 1 )
TP (true positive) is the number of pairs that are in the same cluster and belong to the same class; (2) TN (true negative) is the number of pairs that are in two different clusters and belong to different classes; (3) FP (false positive) is the number of pairs that are in the same cluster but belong to different classes; and (4) FN (false negative) is the number of pairs that are in two different clusters but belong to the same class.Rand index is defined as follows Rand TP + TN TP + FP + TN + FN

Table 1
The statistics of the experimental datasets

Table 3
The compactness and DB index of the clustering methods on the experimental datasets

Table 4
Training time in second (i.e., the time for finding domain of novelty) and clustering time in second (i.e., the time for clustering assignment) of the clustering methods on the experimental datasets