1 Introduction

Cluster analysis is a fundamental problem in pattern recognition where objects are categorized into groups or clusters based on pairwise similarities between those objects such that two criteria, homogeneity and separation, are achieved [21]. Two challenges in the task of cluster analysis are (1) dealing with complicated data with nested or hierarchy structures inside; and (2) automatically detecting the number of clusters. Recently, support-based clustering, e.g., support vector clustering (SVC) [1], has drawn a significant research concern because of its applications in solving the difficult and diverse clustering or outlier detection problem [1, 2, 8, 10, 11, 15, 23]. These clustering methods have two main advantages comparing with other clustering methods: (1) ability to generate the clustering boundaries with arbitrary shapes and automatically discover the number of clusters; and (2) capability to handle well the outliers.

Support-based clustering methods always undergo two phases. In the first phase, the domain of novelty, e.g., optimal hypersphere [1, 9, 22] or hyperplane [18], is discovered in the feature space. The domain of novelty when mapped back into the input space will become a set of contours tightly enclosing data which can be interpreted as cluster boundaries. However, this set of contours does not specify how to assign a data sample to its cluster. In addition, the computational complexity of the current solvers [3, 7] to find out the domain of novelty is often over-quadratic [4]. Such a computational complexity impedes the usage of support-based clustering methods for the real-world datasets. In the second phase, namely clustering assignment, based on the geometry information carried in the resultant set of contours harvested from the first phase, data samples are appointed to their clusters. Several works have been proposed for improving cluster assignment procedure [2, 8, 11, 15, 23].

Recently, stochastic gradient descent (SGD) frameworks [6, 19, 20] have emerged as building blocks to develop the learning methods for efficiently handling the large-scale dataset. SGD-based algorithm has the following advantages: (1) very fast; (2) ability to run in online mode; and (3) not requiring to load the entire dataset to the main memory in training. In this paper, we conjoin the advantages of SGD with support-based clustering. In particular, we propose to use the optimal hyperplane as the domain of novelty. The margin, i.e., the distance from the origin to the optimal hyperplane, is maximized to make the contours enclosing the data as tightly as possible. We subsequently apply the stochastic gradient descent framework proposed in [19] to the first phase of support-based clustering for achieving the domain of novelty. Finally, we propose a new strategy for clustering assignment where each data sample in the extended decision boundary has its own trajectory to converge to an equilibrium point and clustering assignment is then reduced to the same task for those equilibrium points. Our clustering assignment strategy distinguishes from the existing works of [8, 1113] in the way to find the trajectory with a start and the initial set of data samples that need to do a trajectory for finding the corresponding equilibrium point. The experiments established on the real-world datasets show that our proposed method produces the comparable clustering quality with other support-based clustering methods while simultaneously achieving the computational speedup.

To summarize, the contribution of the paper consists of the following points:

  • Different from the works of [1, 2, 11, 15, 23] which employ a hypersphere to characterize the domain of novelty, we propose using a hyperplane to characterize the domain of novelty. This allows us to introduce SGD-based solution for finding the domain of novelty.

  • We propose SGD-based solution for finding the domain of novelty. We perform a rigorous convergence analysis for the proposed solution. We note that the works of [1, 2, 11, 15, 23] utilized the Sequential-Minimal-Optimization-based approach [17] to find the domain of novelty wherein the computational complexity is over-quadratic and it requires loading the entire Gram matrix to the main memory.

  • We propose new clustering assignment strategy which can reduce the clustering assignment for N samples in the entire training set to the same task for M equilibrium points where M is usually very small comparing with N.

  • Comparing with the conference version [16], this paper presents a more rigorous convergence analysis with the full proofs and explanations. In addition, it further introduces new strategy for clustering assignment. Regarding the experiment, it compares with more baselines and produces more experimental results.

2 Stochastic gradient descent large margin one-class support vector machine

2.1 Large margin one-class support vector machine

Given the dataset \(\mathcal {D}=\left\{ x_{1},x_{2},\ldots ,x_{N}\right\} \), to define the domain of novelty, we construct an optimal hyperplane that can separate the data samples and the origin such that the margin, i.e., the distance from the origin to the hyperplane, is maximized. The optimization problem is formulated as

$$\begin{aligned} \underset{\mathbf {w},\rho }{\max }\left( \frac{\left| \rho \right| }{\Vert \mathbf {w}\Vert ^{2}}\right) \end{aligned}$$

subjects to

$$\begin{aligned}&\mathbf {w}^{\mathsf {T}}\phi (x_{i})-\rho \ge 0, \quad i=1,\ldots ,N \\&\mathbf {w}^{\mathsf {T}}\mathbf {0}-\rho =-\rho <0 \end{aligned}$$

where \(\phi \) is a transformation from the input space to the feature space and \(\mathbf {w}^{\mathsf {T}}\phi \left( x\right) -\rho =0\) is equation of the hyperplane. It occurs that the margin is invariant if we scale \((\mathbf {w},\rho )\) by a factor k. Hence without loss of generality, we can assume that \(\rho =1\) and we achieve the following optimization problem

$$\begin{aligned} \underset{\mathbf {w}}{\text {min}}\left( \frac{1}{2}\Vert \mathbf {w}\Vert ^{2}\right) \end{aligned}$$

subjects to

$$\begin{aligned} \mathbf {w}^{\mathsf {T}}\phi (x_{i})-1\ge 0, \quad i=1,\ldots ,N \end{aligned}$$

Using the slack variables, we can extend the above optimization problem to form the soft model of large margin one-class Support vector machine (LMOCSVM)

$$\begin{aligned} \underset{\mathbf {w}}{\text {min}}\left( \frac{1}{2}\Vert \mathbf {w}\Vert ^{2}+\frac{C}{N}\sum _{i=1}^{N}\xi _{i}\right) \end{aligned}$$

subjects to

$$\begin{aligned}&\mathbf {w}^{\mathsf {T}}\phi (x_{i})-1\ge -\xi _{i}, \quad i=1,\ldots ,N \\&\xi _{i}\ge 0, \quad i=1,\ldots ,N \end{aligned}$$

where \(C>0\) is the trade-off parameter and \(\varvec{\xi }=\left[ \xi _{1},\ldots ,\xi _{N}\right] \) is the vector of slack variables.

We can rewrite the above optimization problem in the primal form as follows

$$\begin{aligned} \underset{\mathbf {w}}{\text {min}}\left( J(\mathbf {w})\triangleq \frac{1}{2}\left\| \mathbf {w}\right\| ^{2}+\frac{C}{N}\sum _{i=1}^{N}\max \left\{ 0,1-\mathbf {w}^{\mathsf {T}}\phi \left( x_{i}\right) \right\} \right) \end{aligned}$$
(1)

2.2 SGD-based Solution in the primal form

To efficiently solve the optimization in Eq. (1), we use stochastic gradient descent method. We name the outcome method by stochastic-based large margin one-class support vector machine (SGD-LMSVC).

At \(t\mathrm{th}\) round, we sample the data point \(x_{n_{t}}\) from the dataset \(\mathcal {D}\). Let us define the instantaneous function \(g_{t}\left( \mathbf {w}\right) \triangleq \frac{1}{2}\left\| \mathbf {w}\right\| ^{2}+C\text {max}\left\{ 0,1-\mathbf {w}^{\mathsf {T}}\phi \left( x_{n_{t}}\right) \right\} \). It is obvious that \(g_{t}(\mathbf {w})\) is \(1-\text {strongly convex}\) w.r.t the norm \(\left\| .\right\| _{2}\) over the feature space.

The learning rate is \(\eta _{t}=\frac{1}{t}\) and the sub-gradient is \(\lambda _{t}=\mathbf {w}_{t}-C\mathbb {\mathbf {I}}_{[\mathbf {w}_{t}^{\mathsf {T}}\phi \left( x_{n_{t}}\right) <1]}\phi \left( x_{n_{t}}\right) \in \partial g_{t}\left( \mathbf {w}_{t}\right) \), where \(\mathbb {\mathbf {I}}_{A}\left( .\right) \) is the indicator function. Therefore, the update rule is

$$\begin{aligned} \mathbf {w}_{t+1}=\mathbf {w}_{t}-\eta _{t}\lambda _{t}=\left( 1-\frac{1}{t}\right) \mathbf {w}_{t}+\frac{C}{t}\mathbb {\mathbf {I}}_{[\mathbf {w}_{t}^{\mathsf {T}}\phi \left( x_{n_{t}}\right) <1]}\phi \left( x_{n_{t}}\right) \end{aligned}$$
(2)
figure a

Algorithm 1 is proposed to find the optimal hyperplane which defines the domain of novelty. At each round, one data sample is uniformly sampled from the training set and the update rule in Eq. (2) is applied to determine the next hyperplane, i.e., \(\mathbf {w}_{t+1}\). Finally, the last hyperplane, i.e., \(\mathbf {w}_{T+1}\) is outputted as the optimal hyperplane. According to the theory displayed in the next section, we can randomly output any intermediate hyperplane and the approximately accurate solution is still warranted in a long-term training. Nonetheless, in Algorithm 1, we make use of the last hyperplane as output to exploit as much as possible the information accumulated through the iterations. It is worthwhile to note that in Algorithm 1, we store \(\mathbf {w}_{t}\) as \(\mathbf {w}_{t}=\sum _{i}\alpha _{i}\phi \left( x_{i}\right) \).

2.3 Convergence analysis

In this section, we show the convergence analysis of Algorithm 1. We assume that data are bounded in the feature space, that is, \(\left\| \phi \left( x\right) \right\| \le R,\;\forall x\,\in \mathcal {X}\). We denote the optimal solution by \(\mathbf {w}^{*}\), that is, . We derive as follows.

Lemma 1 establishes a bound on \(\left\| \mathbf {w}_{T}\right\| \), followed by Lemma 2 which establishes a bound on \(\left\| \lambda _{T}\right\| \).

Lemma 1

The following statement holds

$$\begin{aligned} \left\| \mathbf {w}_{T}\right\| \le CR, \quad \forall \; T \end{aligned}$$

Proof

We have

$$\begin{aligned}&t\mathbf {w}_{t+1} =\left( t-1\right) \mathbf {w}_{t}+C\mathbb {\mathbf {I}}_{[\mathbf {w}_{t}^{\mathsf {T}}\phi \left( x_{n_{t}}\right) <1]}\phi \left( x_{n_{t}}\right) \\&t\left\| \mathbf {w}_{t+1}\right\| \le \left( t-1\right) \left\| \mathbf {w}_{t}\right\| +CR \end{aligned}$$

Taking sum the above when \(t=1,2,\ldots ,T-1\), we gain

$$\begin{aligned}&\left( T-1\right) \left\| \mathbf {w}_{T}\right\| \le \left( T-1\right) CR\\&\left\| \mathbf {w}_{T}\right\| \le CR \end{aligned}$$

Lemma 2

The following statement holds

$$\begin{aligned} \left\| \lambda _{T}\right\| =\left\| \mathbf {w}_{t}-C\mathbb {\mathbf {I}}_{[\mathbf {w}_{t}^{\mathsf {T}}\phi \left( x_{n_{t}}\right) <1]}\phi \left( x_{n_{t}}\right) \right\| \le 2CR, \quad \forall \; T \end{aligned}$$

Proof

We have

$$\begin{aligned} \left\| \lambda _{T}\right\|&\le \left\| \mathbf {w}_{T}\right\| +CR\le 2CR \end{aligned}$$

Theorem 1 establishes a bound on regret and shows that Algorithm 1 has the convergence rate \(\text {O}\left( \frac{\log \,T}{T}\right) \).

Theorem 1

Let us consider the running of Algorithm 1.The following statement holds

$$\begin{aligned} \mathcal {J}\left( \overline{\mathbf {w}}_{T}\right) -\mathcal {J}\left( \mathbf {w}^{*}\right) \le \frac{2C^{2}R^{2}\left( \log \,T+1\right) }{T} \end{aligned}$$

where \(\overline{\mathbf {w}}_{T}=\frac{1}{T}\sum _{t=1}^{T}\mathbf {w}_{t}\).

Proof

It is apparent that

$$\begin{aligned} \mathbb {E}\left[ \lambda _{t}|\mathbf {w}_{t}\right] =\mathbf {w}_{t}-\frac{C}{N}\sum _{n_{t}=1}^{N}C\mathbb {\mathbf {I}}_{[\mathbf {w}_{t}^{\mathsf {T}}\phi \left( x_{n_{t}}\right) <1]}\phi \left( x_{n_{t}}\right) =\mathcal {J}^{'}\left( \mathbf {w}_{t}\right) \end{aligned}$$

We have the following

$$\begin{aligned} \left\| \mathbf {w}_{t+1}-\mathbf {w}^{*}\right\| ^{2}&=\left\| \mathbf {w}_{t}-\eta _{t}\lambda _{t}-\mathbf {w}^{*}\right\| ^{2}\\&\le \left\| \mathbf {w}_{t}-\mathbf {w}^{*}\right\| ^{2}-2\eta _{t}\lambda _{t}^{\mathsf {T}}\left( \mathbf {w}_{t}-\mathbf {w}^{*}\right) \\&\quad +\eta _{t}^{2}\left\| \lambda _{t}\right\| ^{2} \end{aligned}$$

Taking conditional expectation w.r.t \(\mathbf {w}_{t}\) the above, we gain

$$\begin{aligned}&\mathcal {J}^{'}\left( \mathbf {w}_{t}\right) \left( \mathbf {w}_{t}-\mathbf {w}^{*}\right) \\&\quad \le \frac{\mathbb {E}\left[ \left\| \mathbf {w}_{t} -\mathbf {w}^{*}\right\| ^{2}\right] -\mathbb {E}\left[ \left\| \mathbf {w}_{t+1} -\mathbf {w}^{*}\right\| ^{2}\right] }{2\eta _{t}}+\frac{\eta _{t}}{2}\mathbb {E}\left[ \left\| \lambda _{t}\right\| ^{2}\right] \\&\mathcal {J}\left( \mathbf {w}_{t}\right) -\mathcal {J}\left( \mathbf {w}^{*}\right) +\frac{1}{2}\left\| \mathbf {w}_{t}-\mathbf {w}^{*}\right\| ^{2} \\&\quad \le \frac{\mathbb {E}\left[ \left\| \mathbf {w}_{t} -\mathbf {w}^{*}\right\| ^{2}\right] -\mathbb {E}\left[ \left\| \mathbf {w}_{t+1} -\mathbf {w}^{*}\right\| ^{2}\right] }{2\eta _{t}}+\frac{\eta _{t}}{2}\mathbb {E} \left[ \left\| \lambda _{t}\right\| ^{2}\right] \\&\mathcal {J}\left( \mathbf {w}_{t}\right) -\mathcal {J}\left( \mathbf {w}^{*}\right) \le \frac{t-1}{2}\mathbb {E}\left[ \left\| \mathbf {w}_{t}-\mathbf {w}^{*}\right\| ^{2}\right] \\&\quad -\,\frac{t}{2}\mathbb {E}\left[ \left\| \mathbf {w}_{t+1}-\mathbf {w}^{*}\right\| ^{2}\right] +\frac{2C^{2}R^{2}}{t} \end{aligned}$$

Taking expectation again, we achieve

$$\begin{aligned} \mathbb {E}\left[ \mathcal {J}\left( \mathbf {w}_{t}\right) \right] -\mathcal {J}\left( \mathbf {w}^{*}\right)&\le \frac{t-1}{2}\mathbb {E}\left[ \left\| \mathbf {w}_{t}-\mathbf {w}^{*}\right\| ^{2}\right] \\&\quad -\frac{t}{2}\mathbb {E}\left[ \left\| \mathbf {w}_{t+1}-\mathbf {w}^{*}\right\| ^{2}\right] +\frac{2C^{2}R^{2}}{t} \end{aligned}$$

Taking sum the above inequality when \(t=1,\ldots ,T\), we gain

$$\begin{aligned}&\frac{1}{T}\sum _{t=1}^{T}\mathbb {E}\left[ \mathcal {J}\left( \mathbf {w}_{t}\right) \right] -\mathcal {J}\left( \mathbf {w}^{*}\right) \nonumber \\&\quad \le \frac{2C^{2}R^{2}}{T}\sum _{t=1}^{T}\frac{1}{t}\le \frac{2C^{2}R^{2}\left( \log \,T+1\right) }{T} \\&\mathcal {J}\left( \overline{\mathbf {w}}_{T}\right) -\mathcal {J}\left( \mathbf {w}^{*}\right) \le \frac{2C^{2}R^{2}\left( \log \,T+1\right) }{T}\nonumber \end{aligned}$$
(3)

\(\square \)

Theorem 1 shows the inequality for the average solution in the expectation form. In the following theorem, we prove that if we output a single-point solution, with a high probability we have a real inequality.

Theorem 2

Let us consider the running of Algorithm 1. Let r be an integer randomly picked from \(\left\{ 1,2,\ldots ,T\right\} \). Given \(\delta \in \left( 0;1\right) \), with the probability greater than \(1-\delta \) the following inequality holds

$$\begin{aligned} \mathcal {J}\left( \mathbf {w}_{r}\right) <\mathcal {J}\left( \mathbf {w}^{*}\right) +\frac{2R^{2}C^{2}\left( 1+\text {log}\,T\right) }{\delta T} \end{aligned}$$

Proof

Let us denote \(X=\mathcal {J}\left( \mathbf {w}_{r}\right) -\mathcal {J}\left( \mathbf {w}^{*}\right) \ge 0\). By definition of r, we have

$$\begin{aligned}&\mathbb {E}_{r}\left[ X\right] =\frac{1}{T}\sum _{t=1}^{T}\mathbb {E}\left[ \mathcal {J}\left( \mathbf {w}_{t}\right) \right] -\mathcal {J}\left( \mathbf {w}^{*}\right) \\&\mathbb {E}\left[ X\right] =\mathbb {E}_{\left( x_{t},y_{t}\right) _{t=1}^{T}}\left[ \mathbb {E}_{r}\left[ X\right] \right] \le \frac{2C^{2}R^{2}\left( \log \,T+1\right) }{T} \end{aligned}$$

Using Markov inequality, we gain

$$\begin{aligned}&\mathbb {P}\left( X\ge \varepsilon \right) \le \frac{\mathbb {E}\left[ X\right] }{\varepsilon }\le \frac{2C^{2}R^{2}\left( \log \,T+1\right) }{\varepsilon T}\\&\mathbb {P}\left( X<\varepsilon \right) >1-\frac{2C^{2}R^{2}\left( \log \,T+1\right) }{\varepsilon T} \end{aligned}$$

Choosing \(\delta =\frac{2C^{2}R^{2}\left( \log \,T+1\right) }{\varepsilon T}\), we gain the conclusion. \(\square \)

We now investigate the number of iterations required if we want to gain an \(\varepsilon \)-precision solution with a probability at least \(1-\delta \). According to Theorem 2, the number of iterations T must be greater than \(T_{0}\) where \(T_{0}\) is the smallest number such that

$$\begin{aligned}&\frac{2R^{2}C^{2}\left( 1+\text {log}\,T_{0}\right) }{\delta T_{0}} \le \varepsilon \\&\frac{1+\text {log}\,T_{0}}{T_{0}} \le \frac{\varepsilon \delta }{2R^{2}C^{2}} \end{aligned}$$

3 Clustering assignment

After solving the optimization problem, we yield the decision function

$$\begin{aligned} f\left( x\right) =\sum _{i=1}^{N}\alpha _{i}K\left( x_{i},x\right) -1 \end{aligned}$$

To find the equilibrium points, we need to solve the equation \(\nabla f\left( x\right) =0\). To this end, we use the fixed point technique and assume that Gaussian kernel is used, i.e., \(K\left( x,x'\right) =e^{-\gamma \Vert x-x'\Vert ^{2}}\). We then have

$$\begin{aligned} \frac{1}{2}\nabla f\left( x\right)= & {} \sum _{i=1}^{N}\alpha _{i}\left( x_{i}-x\right) e^{-\gamma \Vert x-x_{i}\Vert ^{2}} \\= & {} 0\rightarrow x=\frac{\sum _{i=1}^{N}\alpha _{i}e^{-\gamma \Vert x-x_{i}\Vert ^{2}}x_{i}}{\sum _{i=1}^{N}\alpha _{i}e^{-\gamma \Vert x-x_{i}\Vert ^{2}}}=P\left( x\right) \end{aligned}$$

To find an equilibrium point, we start with the initial point \(x^{(0)}\in \mathbb {R}^{d}\) and iterate \(x^{(j+1)}=P\left( x^{(j)}\right) \). By fixed point theorem, the sequence \(x^{(j)}\), which can be considered as a trajectory with start \(x^{(0)}\), converges to the point \(x_{*}^{(0)}\) satisfying P\((x_{*}^{(0)})=x_{*}^{(0)}\) or \(\nabla f(x_{*}^{(0)})=0\), i.e., \(x_{*}^{(0)}\) is an equilibrium point.

Fig. 1
figure 1

Visual comparison of SGD-LMSVC (the orange region is the domain of novelty) with C-Means and Fuzzy C-Means on two ring dataset

Let us denote \(B_{\epsilon }=\left\{ x_{i}:1\le i\le N\,\wedge \,\vert f\left( x_{i}\right) \vert \le \epsilon \right\} \), namely the extended boundary for a tolerance \(\epsilon >0\). It follows that the set \(B_{\epsilon }\) forms a strip enclosing the decision boundary \(f\left( x\right) =0\). Algorithm 2 is proposed to do clustering assignment. In Algorithm 2, the task of clustering assignment is reduced to itself for M equilibrium point. To fulfill cluster assignment for M equilibrium points, we run \(m=20\) sample-point test as proposed in [1].

figure b

Our proposed clustering assignment procedure is different with the existing procedure proposed in [1]. The procedure proposed in [1] requires to run \(m=20\) sample-point test for every edge connected \(x_{i},\,x_{j}\) (\(i\ne j\)) in the training set. Consequently, the computational cost incurred is \(\text {O}\left( N\left( N-1\right) ms\right) \) where s is the sparsity level of the decision function (i.e., the number of vectors in the model). Our proposed procedure needs to perform \(m=20\) sample-point test for a reduced set of M data samples (i.e., the set of the equilibrium points \(\left\{ e_{1},e_{2},\ldots ,e_{M}\right\} \)) where M is possibly very small comparing with N. The reason is that many data points in the training set could converge to a common equilibrium point which significantly reduces the size from N to M. The computational cost incurred is therefore \(\text {O}\left( M\left( M-1\right) ms\right) \).

4 Experiments

4.1 Visual experiment

To visually show the high clustering quality produced by our proposed SGD-LMSVC, we establish experiment on three synthesized datasets and visually make comparison SGD-LMSVC with C-Means and Fuzzy C-Means. In the first experiment, data samples form the nested structure with two outside rings and one Gaussian distribution at center. As shown in Fig. 1, SGD-LMSVC can perfectly detect three clusters without any prior information while both C-Means and Fuzzy C-Means with the number of clusters being set to 3 beforehand fail to discover the nested clusters. The second experiment is carried out with a two-moon dataset. As observed from Fig. 2, SGD-LMSVC without any prior knowledge can flawlessly discover two clusters in moons, however, C-Means and Fuzzy C-Means cannot detect the clusters correctly. In the last visual experiment, we generate data from the mixture of 4 Gaussian distributions. As shown in Fig. 3, SGD-LMSVC can perfectly detect 4 clusters corresponding to the individual Gaussian distributions. These visual experiments manifest that SGD-LMSVC is able to generate the cluster boundaries in arbitrary shapes as well as automatically detect the appropriate number of clusters well presented in the data.

Fig. 2
figure 2

Visual comparison of SGD-LMSVC (the orange region is the domain of novelty) with C-Means and Fuzzy C-Means on two-moon dataset

4.2 Experiment on real datasets

To explicitly prove the performance of the proposed algorithm, we establish experiments on the real datasets. Clustering problem is basically an unsupervised learning task and, therefore, there is not a perfect measure to compare given two clustering algorithms. We examine five typical clustering validity indices (CVI) including compactness, purity, rand index, Davies–Bouldin index (DB index), and normalized mutual information (NMI). A good clustering algorithm should produce a solution which has a high purity, rand index, DB index, and NMI and a low compactness.

4.2.1 Clustering validity index

Compactness measures the average pairwise distances of points in the same cluster [5] and is given as follows

$$\begin{aligned} \mathrm{Compactness}\triangleq \frac{1}{N}\sum _{k=1}^{m}N_{k}\frac{\sum _{x,x'\in C_{k}}d\left( x,x'\right) }{N_{k}\left( N_{k}-1\right) /2} \end{aligned}$$

where the cluster solution consists of m clusters \(C_{1},\,C_{2},\ldots ,C_{m}\) whose cardinalities are \(N_{1},\,N_{2},\ldots ,N_{m}\), respectively.

Fig. 3
figure 3

SGD-LMSVC (the orange region is the domain of novelty) can recognize the clusters scattered from mixture of four Gaussian distributions

The clustering with a small compactness is preferred. A small compactness gained means the average intra-distance of clusters is small and homogeneity is thereby good, i.e., two objects in the same cluster have high similarity to each other.

The second CVI in use is purity which measures the purity of clustering solution with respect to the nature classes of data [14]. It is certainly true that the metric purity is only appropriate for data with labels in nature. Let \(N_{ij}\) be the number of objects in cluster i that belong to the class j. Again, let \(N_{i}\triangleq \sum _{j=1}^{m}N_{ij}\) be total number of objects in cluster i. Let us define \(p_{ij}\triangleq \frac{N_{ij}}{N_{j}}\), i.e., the empirical distribution over class labels for cluster i. We define a purity of a cluster as \(p_{i}\triangleq \max _{j}\,p_{ij}\) and overall purity of a clustering solution as

$$\begin{aligned} \mathrm{Purity}\triangleq \sum _{i}\frac{N_{i}}{N}\times p_{i} \end{aligned}$$

The purity ranges between 0 (bad) and 1 (good). This CVI embodies the classification ability of clustering algorithm. A clustering algorithm which achieves a high purity can be appropriately used for classification purpose.

The third CVI used as a measure is rand index [14]. To calculate this CVI for a clustering solution, we need to construct a \(2\times 2\) contingency table containing the following numbers: (1) TP (true positive) is the number of pairs that are in the same cluster and belong to the same class; (2) TN (true negative) is the number of pairs that are in two different clusters and belong to different classes; (3) FP (false positive) is the number of pairs that are in the same cluster but belong to different classes; and (4) FN (false negative) is the number of pairs that are in two different clusters but belong to the same class. Rand index is defined as follows

$$\begin{aligned} \mathrm{Rand}\triangleq \frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{FP}+\mathrm{TN}+\mathrm{FN}} \end{aligned}$$

This can be interpreted as the fraction of clustering decisions that are correct. Obviously, rand index ranges between 0 and 1.

Davies–Bouldin validity index is a function of the ratio of the sum of intra-distances to inter-distances [5] and is formulated as follows

$$\begin{aligned} \mathrm{DBI}\triangleq \frac{1}{m}\sum _{i=1}^{m}\underset{j\ne i}{\text {max}}\left\{ \frac{\Delta \left( C_{i}\right) +\Delta \left( C_{j}\right) }{d\left( C_{i},C_{j}\right) }\right\} \end{aligned}$$

A good clustering algorithm should produce the solution which has as smallest DBI as possible.

Table 1 The statistics of the experimental datasets
Table 2 The purity, rand index, and NMI of the clustering methods on the experimental datasets

The last considered CVI is normalized mutual information (NMI) [14]. This measure allows us to trade off the quality of the clustering against the number of clusters.

$$\begin{aligned} \mathrm{NMI}\triangleq \frac{I\left( \Omega ,C\right) }{\left[ H\left( C\right) +H\left( \Omega \right) \right] /2} \end{aligned}$$

where \(C=\left\{ c_{1},\ldots ,c_{J}\right\} \) is the set of classes and \(\Omega =\left\{ \omega _{1},\ldots ,\omega _{K}\right\} \) is the set of clusters. \(I\left( \Omega ,C\right) \) is the mutual information and is defined as

$$\begin{aligned} I\left( \Omega ,C\right) \triangleq \sum _{k}\sum _{j}P\left( c_{j}\bigcap \omega _{k}\right) \log \frac{P\left( c_{j}\bigcap \omega _{k}\right) }{P\left( c_{j}\right) P\left( \omega _{k}\right) } \end{aligned}$$

and \(H\left( .\right) \) is the entropy and is defined as

$$\begin{aligned} H\left( \Omega \right) \triangleq -\sum _{k}P\left( \omega _{k}\right) \log P\left( \omega _{k}\right) \end{aligned}$$

It is certainly that the NMI ranges between 0 and 1, and a good clustering algorithm should produce as highest NMI measure as possible.

We perform experiments on 15 well-known datasets for clustering task. The statistics of the experimental datasets is given in Table 1. These datasets are fully labeled and consequently, the CVIs like purity, rand index, and NMI can be completely estimated. We make comparison of our proposed SGD-LMSVC with the following baselines.

4.2.2 Baselines

  • Support vector clustering (SVC) [1] using LIBSVM [3] for finding domain of novelty and fully connected graph for clustering assignment.

  • Fast support vector clustering (FSVC) [8] an equilibrium-based approach for clustering assignment.

It is noteworthy that the first phase in our proposed SGD-LMSVC is SGD-based solution for LMOCSVM (cf. Algorithm 1) and the second phase is proposed in Algorithm 2. All competitive methods are run on a Windows computer with dual-core CPU 2.6 GHz and 4 GB RAM.

Table 3 The compactness and DB index of the clustering methods on the experimental datasets
Table 4 Training time in second (i.e., the time for finding domain of novelty) and clustering time in second (i.e., the time for clustering assignment) of the clustering methods on the experimental datasets

4.2.3 Hyperparameter setting

The RBF kernel, given by \(K\left( x,x'\right) =e^{-\gamma \left\| x-x'\right\| ^{2}}\), is employed. The width of kernel \(\gamma \) is searched on the grid \(\left\{ 2^{-5},\,2^{-3},\,\ldots ,\,2^{3},\,2^{5}\right\} \). The trade-off parameter C is searched on the same grid. In addition, the parameters p and \(\varepsilon \) in FSVC are searched in the common grid \(\left\{ 0.1,0.2,\ldots ,0.9,1\right\} \) which is the same as in [8]. Determining the number of iterations in Algorithm 1 is really challenging. To resolve it, we use the stopping criterion \(\left\| \mathbf {w}_{t+1}-\mathbf {w}_{t}\right\| \le \theta =0.01\), i.e., the next hyperplane does only a slight change.

We report the experimental results of purity, rand index, and NMI in Table 2, compactness and DB index in Table 3, and the training time (i.e., the time for finding domain of novelty) and clustering time (i.e., the time for clustering assignment) in Table 4. For each CVI, we boldface the method that yields a better outcome, i.e., highest value for purity, rand index, NMI, and DB index and lowest value for compactness. As shown in Tables 2 and 3, our proposed SGD-LMSVC is generally comparable with other baselines in the CVIs. In particular, our proposed SGD-LMSVC is slightly better than others on purity, rand index, and NMI whereas it totally surpasses others on compactness. Moreover, our proposed SGD-LMSVC is slightly worse than SVC on DB index. Regarding the amounts of time taken for training and doing clustering assignment, our proposed SGD-LMSVC is totally superior than others. For the training time, the speedup is significant for the medium-scale or large-scale datasets including Shuttle, Musk, and Abalone. In particular, the speedup is really significant for the clustering time.

5 Conclusion

In this paper, we have proposed a fast support-based clustering method, which conjoins the advantages of SGD-based method and kernel-based method. Furthermore, we have also proposed a new strategy for clustering assignment. We validate our proposed method on 15 well-known datasets for clustering task. The experiment has shown that our proposed method has achieved a comparable clustering quality compared with the baselines while being significantly faster than them.