1 Introduction

Novelty detection is an interesting research topic in many data analytics and machine learning tasks ranging from video security surveillance, network abnormality detection, and detection of abnormal gene expression sequence to name a few. Different from the binary classification which focuses mainly on balance dataset, novelty detection aims to learn from imbalance dataset where a majority in the dataset is normal data, and abnormal data or outliers constitute a minor portion of dataset. The purpose of novelty detection is to find patterns in data that do not conform to expected behaviors [5]. To obtain this aim, a data description is constructed to capture all characteristics of normal data, and subsequently this description is used to detect abnormal data which cannot fit it well. These anomaly patterns are interesting because they reveal actionable information, the known unknowns and unknown unknowns. Notable real-world applications of novelty detection include intrusion detection [10], fraud detection (credit card fraud detection, mobile phone fraud detection, insurance claim fraud detection) [7], industrial damage detection [2], and sensor network data processing [13].

Probability density-based and neural network-based approaches have been proposed to address novelty detection problems. Another notable solution is the kernel-based approach. At its crux, data are mapped into the feature space with very high or infinite dimension via a transformation. In this space, a simple geometric shape is learned to discover the domain of novelty [16, 20, 21]. One-class support vector machine (OCSVM) [20] uses an optimal hyperplane, which separates the origin from data samples, to distinguish abnormality from normality. The domain of novelty in this case is certainly a positive region of the optimal hyperplane with maximal margin, the distance from the origin to the hyperplane. In another approach, support vector data description (SVDD) [21], the novelty domain of the normal data, is defined as an optimal hypersphere in the feature space, which becomes a set of contours tightly covering the normal data when mapped back to the input space.

Data in real-world applications are often collected from many different data sources. Such a heterogeneous dataset requires a mixture of individual data descriptions. A mixture of experts refers to the problem of using and combining many experts (e.g., classification or regression models) for classification or prediction purpose [12]. Two challenging obstacles that need to be addressed include: (1) how to automatically discover the appropriate number of experts in use or automatically do model selection; (2) how to train an individual expert with reference to others.

In this paper, we present a mixture of support vector data descriptions (mSVDD) for the novelty detection task. The idea of mSVDD is to use a set of hyperspheres as a data description for normal data. In mSVDD, we leverage the probabilistic framework with kernel-based method, where the model comprises two quantities: log likelihood to control the fit of data to the model (i.e., empirical error) and regularization quantizer to control the generalization capacity of the model (i.e., general error). The EM algorithm is used to train the model and the model is guaranteed to gradually converge to an appropriate set of optimal hyperspheres that well describes data. In the model, for each hypersphere (expert), a quantity, which expresses the total fit of data to this hypersphere, is utilized for model selection. We start with a maximal number of hyperspheres and step by step the redundant hyperspheres are eliminated. In the experimental section, we demonstrate the advantage of mSVDD: although learning in the input space, the set of hyperspheres offered by mSVDD is able to approximate a set of contours formed by a single optimal hypersphere in the feature space. In particular, our experiment on several benchmark datasets shows that mSVDD often yields a comparable classification accuracy while achieving a significant speed-up compared with the baselines.

To summarize, the contribution of the paper consists of the following points:

  • We have viewed the problem of mixture of hyperspheres under the probabilistic perspective and proposed a mixture of support vector data description (mSVDD) for novelty detection. We have employed the EM algorithm to train mSVDD. The resultant model can automatically discover the appropriate number of experts in use and the weight of each expert.

  • We have conducted the experiments on several benchmark datasets. The results have showed that mSVDD can learn mixture of hyperspheres in the input space which can approximately represent the set of contours generated by the traditional SVDD in the feature space. While the accuracy of mSVDD is comparable, the training time of mSVDD is faster than the kernelized baselines, since all computations are performed directly in the input space.

  • Compared with the conference version [15], we have introduced two new strategies, i.e., approximate SVDD (amSVDD) and probabilistic mSVDD (pmSVDD), to improve the training time and accuracy compared with the general mSVDD. We have also conducted more experiments to evaluate these two new strategies and investigated the behaviors of the proposed algorithms.

2 Related work

We review the studies on a mixture of experts closely related to our work. These works can be divided into two branches: mixture of hyperplanes and mixture of hyperspheres. The works presented in [6, 12, 14] applied divide-and-conquer strategy to partition the input space into many disjoint regions, and in each region, a linear or nonlinear SVM can be employed to classify data. In [19], multiple hyperplanes were used for learning to rank. It also follows up the divide-and-conquer strategy and the final rank is aggregated by the ranks offered by the hyperplanes. Multiple hyperplane was brought into play in [1, 24] for multiclass classification where each class was characterized by a set of hyperplanes. In [9], a mixing of linear SVMs was used to simulate a nonlinear SVM. This approach has both the efficiency of linear SVM and the power of nonlinear SVM for classifying data. Recently, a Dirichlet process mixture of large-margin kernel machine was proposed in [27] for multi-way classification. This work conjoined the advantages of Bayesian nonparametrics in automatically discovering the underlying mixture components and maximum entropy discrimination framework in integrating large-margin principle with Bayesian posterior inference.

Yet another approach to combine linear classifiers for nonlinear classification is ensemble methods including bagging [3], boosting [8], and random forests [4]. They implement a strong classifier by integrating many weak classifiers. However, since ensemble methods tend to use a great number of base classifiers and for stable model like SVM, the training is quite robust to data perturbation. Hence, the classifiers obtained at different stages are usually highly correlated and this would increase the number of iterations required for convergence and also bring the negative effect to the performance of combination scheme with such classifiers used as weak learners.

Mostly related to ours are the works of [17, 18, 25], which are shared with us the idea of using a set of hyperspheres as the domain of novelty. In [25], the procedure to discover a set of hyperspheres is ad hoc, heuristic and does not conform to any learning principle. The works of [17, 18] are driven by the principle learning with minimum volume. However, the main drawback in those works is that the model selection cannot be performed automatically and the number of hyperspheres in use must be declared a priori.

3 Background

3.1 Expectation maximization principle

The main task in machine learning problems is to find the optimal parameter \(\theta \) of the model given a training set D. In probabilistic perspective, it is described as a maximization of log likelihood function. However, estimating parameters using maximum-likelihood principle is very hard if there are some missing data or latent variables (i.e., cannot be observed). The expectation maximization principle [11], for short EM, was proposed to overcome this obstacle. EM is an iterative method, in which each iteration consists of expectation step (E-step) followed by the maximization step (M-step). The basic idea of EM is described as follows.

Let \(\left\{ x_{n}\right\} _{n=1}^{N}\) be the observed data and \(\left\{ z_{n}\right\} _{n=1}^{N}\) the latent variables; the log likelihood function is given as follows:

$$\begin{aligned} l\left( \theta \right) =\sum _{n=1}^{N}\log \,p\left( x_{n}\mid \theta \right) =\sum _{n=1}^{N}\log \left[ \sum _{z_{n}}p\left( x_{n},z_{n}\mid \theta \right) \right] . \end{aligned}$$

As can be seen, the log cannot be pushed inside the sum; hence it is hard to find the maximum of \(l\left( \theta \right) \). Instead of finding maximum likelihood, EM involves the complete data log likelihood as follows:

$$\begin{aligned} l_{c}\left( \theta \right) =\sum _{n=1}^{N}\log \,p\left( x_{n},z_{n}\mid \theta \right) . \end{aligned}$$

In the E-step, the expected complete data log likelihood \(Q\left( \theta ,\theta ^{t-1}\right) \) is computed by formulation as follows:

$$\begin{aligned} Q\left( \theta ,\theta ^{t-1}\right) =\mathbb {E}\left[ l_{c}\left( \theta \right) \mid D,\theta ^{t-1}\right] . \end{aligned}$$

Then, in the M-step, \(\theta ^{t}\) is found by optimizing the Q function w.r.t \(\theta \)

$$\begin{aligned} \theta ^{t}=\arg \max \,Q\left( \theta ,\theta ^{t-1}\right) . \end{aligned}$$

3.2 Weighted support vector data description

Given the training set \(D=\left\{ \left( x_{n},y_{n}\right) \right\} _{n=1}^{N}\) where \(x_{n}\in \mathbb {R}^{d}\) and \(y_{n}\in \left\{ -1;1\right\} ,\,n=1,\ldots ,N\) including normal (labeled by 1) and abnormal (labeled by \(-1)\) observations. To find the domain of novelty, weighted SVDD [26] learns an optimal hypersphere which encloses all normal observations with tolerances. Each observation \(x_{n}\) is associated with a weight \(C\lambda _{n}\) and a smaller value for the weight implies that the correspondent observation is allowed to have a bigger error. The optimization problem of weighted SVDD is defined as follows:

$$\begin{aligned} \begin{array}{l} \displaystyle \underset{c,R,\xi }{\text {min}}\left( R^{2}+C\sum _{n=1}^{N}\lambda _{n}\xi _{n}\right) \\ \displaystyle \text {s.t.}:\,\left\| x_{n}-c\right\| ^{2}\le R^{2}+\xi _{n},\,n=1,\ldots ,N;\,y_{n}=1 \\ \displaystyle \left\| x_{n}-c\right\| ^{2}\ge R^{2}-\xi _{n},\,n=1,\ldots ,N;\,y_{n}=-1 \\ \displaystyle \xi _{n}\ge 0,\,n=1,\ldots ,n \end{array} \end{aligned}$$
(1)

where \(R,\,c\) are the radius and center of the optimal hypersphere, respectively.

It follows that the error at the observation \(x_{n}\) is given as

$$\begin{aligned} \xi _{n}=\max \left\{ 0,\,y_{n}\left( \left\| x_{n}-c\right\| ^{2}-R^{2}\right) \right\} \end{aligned}$$

and the optimization problem in Eq. (1) can be rewritten as follows:

$$\begin{aligned} \underset{c,R}{\text {min}}\left( R^{2}+C\sum _{n=1}^{N}\lambda _{n}\max \left\{ 0,\,y_{n}\left( \left\| x_{n}-c\right\| ^{2}-R^{2}\right) \right\} \right) . \end{aligned}$$

4 Mixture of support vector data descriptions

4.1 Idea of mixture of support vector data descriptions

Let us denote the latent variable which specifies the expert of \(x_{n}\) by \(z_{n}\in \left\{ 0;1\right\} ^{m}\). The latent variable \(z_{n}\), a bit pattern of size m where only its jth component is 1 and others are 0, means that the jth expert (i.e., the jth SVDD) is used to classify the observation \(x_{n}\). The graphical model of mSVDD is shown in Fig. 1.

Fig. 1
figure 1

Graphical model of a mixture of SVDD

In mSVDD, we describe normal data using a set of m hyperspheres denoted by \(\mathbb {S}_{j}\left( c_{j},R_{j}\right) ,\,j=1,\ldots ,m\). Given the jth hypersphere and the observation \(x_{n}\), the conditional probability to classify \(x_{n}\) w.r.t the jth hypersphere is given as follows:

$$\begin{aligned}&p\left( \text {normal}\mid x_{n},\mathbb {S}_{j}\right) =p\left( y_{n}=1\mid x_{n},\mathbb {S}_{j}\right) \nonumber \\&\quad =p\left( y_{n}=1\mid x_{n},z_{n}^{j}=1,\theta \right) =s\left( R_{j}^{2}-\left\| x_{n}-c_{j}\right\| ^{2}\right) \nonumber \\&p\left( \text {abnormal}\mid x_{n},\mathbb {S}_{j}\right) =p\left( y_{n}=-1\mid x_{n},\mathbb {S}_{j}\right) \nonumber \\&\quad =p\left( y_{n}=-1\mid x_{n},z_{n}^{j}=1,\theta \right) =s\left( \left\| x_{n}-c_{j}\right\| ^{2}-R_{j}^{2}\right) ,\nonumber \\ \end{aligned}$$
(2)

where \(s\left( x\right) =\frac{1}{1+e^{-x}}\) is the sigmoid function and \(\theta \) is the model parameter.

Intuitively, \(p\left( \text {normal}\mid x_{n},\mathbb {S}_{j}\right) \) is exactly \(\frac{1}{2}\) if \(x_{n}\) locates at the boundary of the hypersphere \(\mathbb {S}_{j}\) and increases to 1 when \(x_{n}\) resides inside the hypersphere or decreases from 0 when \(x_{n}\) resides outside the hypersphere and moves apart from it. Because it is difficult to handle the log of \(s\left( x\right) =\frac{1}{1+e^{-x}}\) , we approximate \(s\left( x\right) \) as follows:

$$\begin{aligned} s\left( x\right) =\frac{1}{1+e^{-x}}\approx e^{-\text {max}\left\{ 0;\delta -x\right\} }. \end{aligned}$$
(3)

To visually see how tight the approximation is, we plot the above two functions on the same coordinate when \(\delta =1\). As can be seen in Fig. 2, the blue line (presenting function \(\frac{1}{1+e^{-x}}\)) is close to the green line (presenting function \(e^{-\text {max}\left\{ 0;\delta -x\right\} }\)). Therefore, the function \(e^{-\text {max}\left\{ 0;\delta -x\right\} }\) is a good approximation of the sigmoid function \(s\left( x\right) \).

The marginal likelihood is given as: \(p\left( Y|X,\theta \right) =\prod _{n=1}^{N}p\left( y_{n}\mid x_{n},\theta \right) \), where \(X=\left[ x_{n}\right] _{n=1}^{N}\in \mathbb {R}^{d\times N}\), \(Y=\left[ y_{n}\right] _{n=1}^{N}\in \mathbb {R}^{N}\), and \(\theta =\left( \theta _{1},\theta _{2},\ldots ,\theta _{m}\right) \) encompass the parameters of all experts.

Fig. 2
figure 2

Plots of two functions on the same coordinate when \(\delta =0\)

4.2 Optimization problem

To take into account both the general and empirical risks, the following optimization problem is proposed:

$$\begin{aligned} \underset{\theta }{\text {max}}\left( -\sum _{j=1}^{m}R_{j}^{2}+C\times \text {log}\,p\left( Y\mid X,\theta \right) \right) . \end{aligned}$$
(4)

In the optimization problem as shown in Eq. (4), we minimize \(\sum _{j=1}^{m}R_{j}^{2}\) to maximize the generalization capacity of the model. In the meanwhile, we maximize the log marginal likelihood to boost the fit of the model to the observed data X. The trade-off parameter C is used to control the proportion of the first and second quantities or to govern the balance between overfitting and underfitting.

To efficiently solve the optimization problem in Eq. (4), we make use of EM principle by iteratively performing two steps: E-step and M-step.

4.2.1 E-step

Given \({z}=\left( z_{1},z_{2},\ldots ,z_{N}\right) \), we first compute the complete log likelihood as

$$\begin{aligned} l_{c}\left( \theta \mid D\right) =\text {log}\,p\left( Y,\mathbf {z}\mid X,\theta \right) =\sum _{n=1}^{N}\text {log}\,p\left( y_{n},\,z_{n}|x_{n},\theta \right) . \end{aligned}$$
(5)

It is clear that (cf. [15] for details)

$$\begin{aligned}&\text {log}\,p\left( y_{n},\,z_{n}\mid x_{n},\theta \right) \nonumber \\&\quad = \text {log}\left( \prod _{j=1}^{m}\left( p\left( y_{n}\mid z_{n}^{j}=1,x_{n},\theta \right) p\left( z_{n}^{j}=1\mid x_{n},\theta \right) \right) ^{z_{n}^{j}}\right) \nonumber \\&\quad = \sum _{j=1}^{m}z_{n}^{j}\left( -\xi _{j}\left( x_{n}\right) +\text {log}\,\alpha _{j}\right) , \end{aligned}$$
(6)

where \(\alpha _{j}\triangleq p\big (z_{n}^{j}=1\mid \theta \big )\) is interpreted as the mixing proportion of the \(j^{th}\) expert and satisfies \(\sum _{j=1}^{m}\alpha _{j}=1\). Because it is difficult to compute the log of \(p\big (y_{n}\mid x_{n},z_{n}^{j}=1,\theta \big )\), as mentioned before, we approximate \(p\big (y_{n}\mid x_{n},z_{n}^{j}=1,\theta \big )\) as

$$\begin{aligned}&p\left( y_{n}\mid x_{n},z_{n}^{j}=1,\theta \right) \\&\quad =e^{-\xi _{j}\left( x_{n}\right) }=e^{-\max \left\{ 0;\delta -y_{n}\left( R_{j}^{2}-\left\| x_{n}-c_{j}\right\| ^{2}\right) \right\} }. \end{aligned}$$

Therefore, we have

$$\begin{aligned} \text {log}\,p\left( y_{n},\,z_{n}\mid x_{n},\theta \right) =\sum _{j=1}^{m}z_{n}^{j}\left( -\xi _{j}\left( x_{n}\right) +\text {log}\,\alpha _{j}\right) . \end{aligned}$$

Substituting Eq. (6) in Eq. (5), we gain the following:

$$\begin{aligned} l_{c}\left( \theta \mid D\right)= & {} \sum _{n=1}^{N}\sum _{j=1}^{m}z_{n}^{j}\left( -\xi _{j}\left( x_{n}\right) +\text {log}\,\alpha _{j}\right) \\= & {} \sum _{j=1}^{m}\sum _{n=1}^{N}z_{n}^{j}\left( -\xi _{j}\left( x_{n}\right) +\text {log}\,\alpha _{j}\right) . \end{aligned}$$

To fulfill the E-step, we compute the following conditional expectation when \(\mathbf {z}\) is varied (cf. [15] for details):

$$\begin{aligned} \mathbb {E}\left( \theta \right)= & {} \left\langle -\sum _{j=1}^{m}R_{j}^{2}+Cl_{c}\left( \theta \mid D\right) \right\rangle _{z/D,\theta ^{(t)}}\\= & {} -\sum _{j=1}^{m}R_{j}^{2}+C\sum _{j=1}^{m}\sum _{n=1}^{N}\left\langle z_{n}^{j}\right\rangle _{z/D,\theta ^{(t)}}\left( -\xi _{j}\left( x_{n}\right) +\text {log}\,\alpha _{j}\right) \\= & {} -\sum _{j=1}^{m}R_{j}^{2}+C\sum _{j=1}^{m}\sum _{n=1}^{N}\tau _{jn}^{(t)}\left( -\xi _{j}\left( x_{n}\right) +\text {log}\,\alpha _{j}\right) , \end{aligned}$$

where \(\theta ^{(t)}\) is the value of parameter \(\theta \) at \(t^{th}\) iteration and \(\tau _{jn}^{(t)}\triangleq p\big (z_{n}^{j}=1\mid x_{n},y_{n}=1,\theta ^{(t)}\big )\) can be computed by the Bayes formula as follows:

$$\begin{aligned} \tau _{jn}^{(t)}&\triangleq p\big (z_{n}^{j}=1\mid x_{n},y_{n},\theta ^{(t)}\big )\\&= \frac{p\left( y_{n}\mid z_{n}^{j}=1,x_{n},\theta ^{\left( t\right) }\right) \alpha _{j}^{\left( t\right) }}{\sum _{k=1}^{m}p\left( y_{n}\mid z_{n}^{k}=1,x_{n},\theta ^{\left( t\right) }\right) \alpha _{k}^{\left( t\right) }}. \end{aligned}$$

4.2.2 M-step

In this step, we need to find \(\theta =\left( \theta _{1},\theta _{2},\ldots ,\theta _{m}\right) \) where \(\theta _{j}=\left( \alpha _{j},c_{j},R_{j}\right) ,\) which maximizes \(\mathbb {E}\left( \theta \right) \). To this end, we rewrite this function as

$$\begin{aligned} \mathbb {E}\left( \theta \right)= & {} -\sum _{j=1}^{m}\left( R_{j}^{2}+\sum _{n=1}^{N}C\tau _{jn}^{(t)}\xi _{j}\left( x_{n}\right) \right) +C\sum _{j=1}^{m}\tau _{j}^{(t)}\log \,\alpha _{j}\\= & {} -E_{1}\left( \mathbf {R},\mathbf {c}\right) +CE_{2}\left( \varvec{\alpha }\right) , \end{aligned}$$

where \(\tau _{j}^{\left( t\right) }\triangleq \sum _{n=1}^{N}\tau _{jn}^{\left( t\right) }\), \(\mathbf {R}=\left[ R_{j}\right] _{j=1}^{m}\), \(\mathbf {c}=\left[ c_{j}\right] _{j=1}^{m}\), and \(\varvec{\alpha }=\left[ \alpha _{j}\right] _{j=1}^{m}\).

It is obvious that the two following optimization problems need to be solved. The first becomes

$$\begin{aligned}&\underset{\varvec{\alpha }}{\text {max}}\left( \sum _{j=1}^{m}\tau _{j}^{(t)}\log \,\alpha _{j}\right) \\&\text {s.t.}:\,\sum _{j=1}^{m}\alpha _{j}=1,\,\alpha _{j}\ge 0,\,j=1,\ldots ,m. \end{aligned}$$

The rule to update \(\varvec{\alpha }\) is given as: \(\alpha _{j}^{\left( t+1\right) }=\frac{\tau _{j}^{(t)}}{\sum _{k=1}^{m}\tau _{k}^{(t)}}\).

The second is given as:

$$\begin{aligned}&\underset{c_{j},R_{j}}{\text {min}}\left( R_{j}^{2}+\sum _{n=1}^{N}C\tau _{jn}^{(t)}\max \left\{ 0,\delta -y_{n}\left( R_{j}^{2}-\left\| x_{n}-c_{j}\right\| ^{2}\right) \right\} \right) \\&\text {where}\,j=1,\ldots ,m. \end{aligned}$$

If we choose \(\delta =0\), the second is indeed split into m independent weighted SVDD problems as

$$\begin{aligned}&\underset{c_{j},R_{j}}{\text {min}}\left( R_{j}^{2}+\sum _{n=1}^{N}C\tau _{jn}^{(t)}\max \left\{ 0,\,y_{n}\left( \left\| x_{n}-c_{j}\right\| ^{2}-R_{j}^{2}\right) \right\} \right) \,\nonumber \\&\text {where}\,j=1,\ldots ,m. \end{aligned}$$
(7)

4.3 How to do model selection in mixture of SVDDs

We start with the number of hyperspheres \(m=m_{\text {max}}\) and make use of the criterion \(\alpha _{j}^{(t+1)}<\lambda \), where \(\lambda >0\) is a threshold, for decision to eliminate the jth hypersphere. The small value for the mixing proportion \(\alpha _{j}^{(t+1)}\) of the jth hypersphere implies that \(\tau _{j}^{(t)}\) is too small as compared to others. The quantity \(\tau _{j}^{(t)}=\sum _{n=1}^{N}p(z_{n}^{j}=1\mid x_{n},y_{n}=1,\theta ^{(t)})\) is interpreted as the sum of the relevance levels of observations to the jth hypersphere and its small value implies that this hypersphere is redundant and has a too small radius.

4.4 The approaches to make a decision in mSVDD

In mSVDD, a natural way to classify a new observation x is to examine whether it belongs to at least one of m hyperspheres. The decision function is given as follows:

$$\begin{aligned} f\left( x\right)= & {} \max \left\{ \text {sign}\left( R_{1}^{2}-\left\| x-c_{1}\right\| ^{2}\right) ,\ldots ,\right. \\&\left. \text {sign}\left( R_{m}^{2}-\left\| x-c_{m}\right\| ^{2}\right) \right\} \end{aligned}$$

where \(R_{j},\,c_{j}\) are radius and center of the \(j^{th}\) hypersphere, respectively.

The above decision function can be regarded as a hard way to decide the normality or abnormality. To take advantage of the probabilistic perspective of mSVDD, we propose a probabilistic approach to make a decision. This approach decides an observation x as a normal data point if more than k experts (e.g., \(k=0.8m\)) classify it as normal data with a confidence level over \(\rho \), that is

$$\begin{aligned} f\left( x\right)= & {} {\left\{ \begin{array}{ll} 1 &{} \text {if}\,\,\exists j:p\left( \text {normal}\mid x,\mathbb {S}_{j}\right) =1\\ 1 &{} \text {if}\,\,\left| \left\{ j:p\left( \text {normal}\mid x,\mathbb {S}_{j}\right) \ge \rho \right\} \right| \ge k\\ -1 &{} \text {otherwise} \end{array}\right. }, \end{aligned}$$

where the second case means that x is classified as normal if there are more than k experts that predict it as normal with a confidence level over \(\rho \).

4.5 Approximate mixture of SVDDs

According to Eq. (7), M-step requires the solutions of m weighted SVDD problems, each of which requires training the whole training set. It is also noteworthy that the variable \(\tau _{jn}^{(t)}\) governs the influence of data point \(x_{n}\) to the jth hypersphere. A large value of \(\tau _{jn}^{(t)}\) highly affects the determination of the jth hypersphere and vice versa, and a small value of \(\tau _{jn}^{(t)}\) slightly affects the determination of the jth hypersphere. Therefore, the data point \(x_{n}\) with a small value of \(\tau _{jn}^{(t)}\) can be eliminated when training the jth hypersphere without possibly changing the learning performance. In return, the training time of a mixture of SVDDs would be improved because training each SVDD now possibly involves a small subset of the training set. To realize this idea, we define the parameter \(\mu \in \left( 0,1\right) \) as a threshold to eliminate data point \(x_{n}\) in the determination of the jth hypersphere (i.e., \(\tau _{jn}^{(t)}<\mu \)).

5 Experiment

5.1 Experimental settings

We establish the experiments over five datasets.Footnote 1

The statistics of the experimental datasets is given in Table 1. To form the imbalanced datasets, we choose a class as the positive class and randomly select the negative data samples from the remaining classes such that the proportion of positive and negative data samples is 10 : 1. We make a comparison of our proposed mSVDD with ball vector machine (BVM) [22], core vector machine (CVM) [23] and support vector data description (SVDD) [21]. All codes are implemented in C/C++. All experiments are run on the computer with core I3 2.3GHz and 16GB in RAM.

Table 1 Statistics of experimental datasets

5.2 How fast and accurate the proposed method compares with the baselines

We do experiments to investigate the performance of our proposed mSVDD in the input space compared with BVM, CVM and SVDD in the feature space with RBF kernel, named KBVM, KCVM, and KSVDD, respectively. To prove the necessity of mSVDD, we also compare mSVDD with CVM using the linear kernel (i.e., LCVM). We wish to validate that with mixture model dataset, a single hypersphere offered by LCVM cannot be sufficiently robust to classify it. We apply fivefold cross-validation to select the trade-off parameter C for KBVM, LCVM, KCVM, KSVDD, mSVDD and the kernel width parameter \(\gamma \) for KBVM, KCVM and KSVDD. The considered ranges are \(C\in \left\{ 2^{-3},2^{-1},\ldots ,2^{29},2^{31}\right\} \) and \(\gamma \in \left\{ 2^{-3},2^{-1},\ldots ,2^{29},2^{31}\right\} \). With mSVDD and LCVM, the linear kernel given by \(K\left( x,x'\right) =\left\langle x,x'\right\rangle \) is used, whereas with kernel versions, the RBF kernel given by \(K\left( x,x'\right) =e^{-\gamma \Vert x-x'\Vert ^{2}}\) is employed. It is certainly true that the computation cost of the linear kernel is much lower than that of the RBF kernel. We fix \(\delta =0\) in all experiments; then in Sect. 5.4 we will investigate the behavior of mSVDD when \(\delta \) is varied. We repeat each experiment five times and compute the means of the corresponding measures. We report training times (cf. Table 5 and Fig. 3a), the one-class accuracy (cf. Table 2 and Fig. 3b), the negative prediction values (NPVs) (cf. Table 3 and Fig. 3c) and \(F_{1}\) score (cf. Table 4 and Fig. 3d) corresponding to the optimal values of C, \(\gamma \).

Table 2 One-class accuracy (in %) comparison
Fig. 3
figure 3

Experimental results on the datasets. a Training times, b One-class accuracy, c negative prediction value, d \(F_1\) score

Table 3 Negative prediction value (in %) comparison
Table 4 \(F_{1}\) score (in %) comparison
Table 5 Training time (in s) comparison

In addition, the one-class accuracy is computed by \(acc=\frac{\%TP+\%TN}{2}=\frac{acc^{+}+acc^{-}}{2}\). This measure is appropriate for one-class classification, since it is required to achieve high values for both \(acc^{+}\) and \(acc^{-}\) to produce a high one-class accuracy. The negative prediction values (NPVs) are computed by \(NPV=\frac{TN}{TN+FN}\). This formulation indicates that a less number of false negative (FN) produces a higher value of NPV. \(F_{1}\) score is the harmonic mean of precision and recall and is computed by \(F_{1}=\frac{2.precision.recall}{precision+recall}\) where \(precision=\frac{TP}{TP+FP}\) and \(recall=\frac{TP}{TP+FN}\). \(F_{1}\) score can be rewritten as \(F_{1}=\frac{2.TP}{2.TP+FP+FN}\) which shows that a higher value of \(F_{1}\) score is caused by a less number for both false negative and false positive.

As observed from the experimental results, our proposed mSVDD is faster than others (except LCVM) on all datasets. This observation is reasonable since mSVDD learns m hyperspheres in the input space and, hence, is slower than learning only one hypersphere in the input space. However, mSVDD is faster than the kernel versions because of its lower kernel computation cost. Regarding the measures involving the learning performance, mSVDD is comparable or a little less than others especially KBVM and KCVM. The reason is that in mSVDD, each component hypersphere in the input space can simulate a contour generated by mapping back the optimal hypersphere of KSVDD, KBVM, and KCVM in the feature space into the input space. The slightly lower accuracy and NPVs of mSVDD compared to the kernel versions comes from the fact the hyperspheres of mSVDD may be less robust than its representative contours.

To visually support the above reason, we design experiments on 2-D datasets as displayed in Figs. 4 and 5. In Fig. 4, mSVDD recommends two hyperspheres in the input space to approximate two contours formed by a single hypersphere in the feature space. In Fig. 5, mSVDD offers three simple hyperspheres which can be visually seen to sufficiently simulate the set of contours formed by a single hypersphere in the feature space. The results of LCVM in all cases are worse, because it learns a single hypersphere in the input space which cannot sufficiently represent the contours formed by a single hypersphere in the feature space.

Fig. 4
figure 4

Comparison of mSVDD with the linear kernel, SVDD with the RBF kernel, and CVM with the linear kernel

Fig. 5
figure 5

Comparison of mSVDD with the linear kernel, SVDD with the RBF kernel, and CVM with the linear kernel

5.3 How approximate and probabilistic approaches improve mSVDD

We empirically compare mSVDD with its two variations which are approximate approach (amSVDD) and probability approach (pmSVDD). In amSVDD, we use the threshold \(\mu =0.8\) to eliminate data points for reducing training size in each SVDD problem. Our pmSVDD applies probability perspective and allows experts to vote when predicting new data. We set the parameter k to 2 and the parameter \(\rho \) to 0.8. Tables 6, 7 and Fig. 6 summarize the performance measures of mSVDD, amSVDD, and pmSVDD. The results show that pmSVDD achieves higher accuracies on all datasets, except mushroom, in comparison to mSVDD. This demonstrates that the probability approach really improves mSVDD by allowing a voting scheme and using probabilistic perspective. For amSVDD, its training time is usually less than others in all datasets, except a9a. Obviously, it comes from the reasonable reduction in training size of each SVDD problem at each step. This allows amSVDD to run faster than others while still preserving the accuracy.

Fig. 6
figure 6

Experimental results of the datasets on cross-validation. a Training times, b One-class accuracy, c negative prediction value, d \(F_1\) score

Table 6 One-class accuracy (in %) and training time (in s) comparison
Table 7 Negative prediction value (in %) and \(F_{1}\) score (in %) comparison

5.4 How variation of the parameter \(\delta \) influences accuracy

To analyze how parameter \(\delta \) affects the one-class accuracy and negative prediction value, we conduct the experiment where \(\delta \) is varied and other parameters are kept fixed. As observed from Tables 8 and 9, when \(\delta \) is varied in ascending order, the accuracy at first increases to its peak and then gradually decreases. This fact may be partially explained as \(\delta \) is altered and also changes the probability function of the observed data to classify them with respect to a particular hypersphere. When \(\delta \) decreases, the component hypersphere tends to absorb more data outside it and, consequently, the hypersphere becomes bigger. Therefore, some abnormal data points can be misclassified. Reversely, the component hypersphere tends to be smaller and would not give a very high probability to data located inside the hypersphere and near the boundary. This prevents the model from misclassifying abnormal data points, but have the side effect of increasing the misclassification of normal data.

Table 8 The one-class accuracy when parameter \(\delta \) is varied
Table 9 The negative prediction value when parameter \(\delta \) is varied

To visually manifest the above reason, we also provide simulation study on \(2-D\) datasets, as displayed in Fig. 7. When we set \(\delta =-0.5\) and start with 3 hyperspheres, the hyperspheres tend to become bigger; as a result, two of them overlap each other. In the meanwhile, we increase \(\delta \) to 1 and get a better solution which induces \(90~\%\) one-class accuracy. However, if we continue to increase \(\delta \) to 12, the hypersphere is too small to describe normal data and, thus, the one-class accuracy declines to \(88.75~\%\).

Fig. 7
figure 7

Behavior of mSVDD when parameter \(\delta \) is varied

6 Conclusion

Leveraging on the expectation maximization principle, we propose a mixture of support vector data descriptions for one-class classification or novelty detection problem. Instead of learning the optimal hypersphere in the feature space involving costly RBF kernel computation, mSVDD learns a set of hypersphere(s) in the input space. Each component hypersphere is able to simulate a contour generated by mapping back the optimal hypersphere in the feature space into the input space. The experiments established on the benchmark datasets show that the mSVDD obtains shorter training time while achieving comparable one-class classification accuracies compared with other methods operated in the feature space.