Introduction

In many real-world applications, a test sample may be from an unknown class that is not available in the training set. These samples can be regarded as novelties or anomalies for known classes as they are far away from the distributions of known classes [1, 2]. The problem is termed as novelty detection or anomaly detection. When the known classes are more than one, it is called multi-class supervised novelty detection [3, 4]. Novelty detection is widely used in the community of pattern recognition. For instance, traffic police wants to find the illegal traffic flow [5], ophthalmologist wants to detect retinal damage [6], cyber security expert needs to monitor cyber-intrusion from massive visiting [7, 8], engineer needs to analyze big data in Internet of Things (IoT) [9], abrupt changes in air temperature [10], unknown pixels in hyperspectral image [11], and medical ultrasound image analysis [12, 13] to name just a few. During the past several decades, the works about supervised novelty detection mainly focus on the training set that only contains one known class [14, 15]. It is called one-class classification [16, 17]. The one-class classifier can only tell us whether a test sample is normal or not [18, 19]. When the training set contains more than one known class, it requires to treat all known classes as a superclass [20] or learn several one-class classifiers [21]. For the former way, it still requires to learn a multi-class classifier to tell us which class a test sample is from if it is not a novelty.

Kernel null Foley–Sammon Transform (KNFST) [3] can deal with one-class classification as well as multi-class supervised novelty detection. For multi-class supervised novelty detection, KNFST only learns a single model which can tell us a test sample whether is a novelty or which class it comes from if it is a normal sample. KNFST maps the samples from the same class to a single point in reproducing kernel Hilbert space (RKHS) via null projection directions. Let c represent the number of classes. Then, the label of a test sample is determined by the minimum distance to the mapping points. If the distances to mapping points are all very large, it is a novelty. However, KNFST can only capture the global information and neglects the local geometrical structure. It may fail when the local geometrical structure is complex, which also exists in classification [22] or ordinal regression [23]. In order to address this issue, we propose a manifold learning-novelty detection (MLND) method in which a manifold graph is introduce to regularize within- and between-class scatter matrices, respectively. The manifold graph is used to depict local geometrical structure. The experimental results demonstrate that MLND is superior to KNFST on several toy and benchmark datasets. The main contributions are summarized as following three points.

  • First, we introduce a manifold into within-class scatter and total scatter to depict local structure in class for Foley–Sammon transform (or Fisher discriminant analysis).

  • Second, a new criteria for the projected directions is proposed, which requires the regularized within-class scatter to be zero and the regularized total scatter to be greater than zero in the projected space.

  • Third, the manifold regularized Foley–Sammon transform is used as a detect novelty method and evaluated on several toy and benchmark datasets.

The rest of this paper is organized as follows. A simple review of supervised novelty detection and kernel null Foley–Sammon transformation (KNFST) is provided in “Related work”. A manifold regularized null Foley–Sammon transformation (NFST) is proposed in “Manifold regularized NFST”. In “Experiments and simulations”, we evaluate manifold learning-novelty detection (MLND) on two toy datasets, eight benchmark datasets, and Gestures dataset. The last section is “Discussion and conclusion”.

Related work

A review of supervised novelty detection

The supervised novelty detection is used to predict the test sample whether is from an unknown class by learning a model from a training set which consists of massive labelled samples. When the labelled samples follow i.i.d. assumption, they can be considered to be from the same class and the supervised novelty detection is a one-class classification problem [2]. For instance, Scholkopf et al. [16] proposed to find a decision hyperplane which could make the minimum margin between the samples in reproducing kernel Hilbert space (RKHS) and the origin be maximized; David and Robert [17] proposed to find a hypersphere which could enclose most of the training samples with minimum volume; Ruff et al. [24] proposed a one-class classifier based on deep learning; Iosifidis et al. [25] used extreme learning to do one-class classification, to name just a few.

When the labelled samples are from a mixture of distributions, they are from several known classes. Then, it is a multi-class supervised novelty detection problem [3]. Compared with multi-class classification, multi-class supervised novelty detection can identify whether a test sample is from an unknown class or which class the test sample is from if it comes from a known class. One way to solve multi-class supervised novelty detection is to treat all known classes as a superclass and learn a one-class classifier to detect the test sample whether is from an unknown class. If the test sample is not from an unknown class, then we train a multi-class classifier to predict which class it comes from [20]. Obviously, this way needs two models: a one-class classifier and a multi-class classifier. Additionally, the one-class classifier would be effected by the complex structure of the superclass. Another way to solve multi-class supervised novelty detection is to train several one-class classifiers. Each one-class classifier is associated with a known class [21]. It has many issues to train several models, such as more training time, more parameters to tune et al. To solve the issues in previous ways, some researchers proposed to learn a single model that can identify a test sample whether is from an unknown class and which class the test sample comes from if it is from a known class, simultaneously. For instance, Bodesheim et al. [3] proposed a multi-class supervised novelty detection method in which the samples from the same class are mapped to a single point in reproducing kernel Hilbert space (RKHS) via null projected directions (NPDs). Zhang et al. [26] proposed a semi-supervised version of KNFST and used it in person re-identification problem. Liu et al. [27] proposed a kernel null space discriminant analysis for incremental supervised novelty detection. Huang et al. [28] used incremental KNFST in person re-identification problem. T Ali and Chaudhuri [29] combined maximum margin metric learning with null space to do supervised novelty detection. The common ground of these methods is that they all adopt null space skill. However, null space only considers the global information and neglects the local geometrical structure. To solve this issue, we propose a manifold learning-based supervised novelty detection method in which the local geometrical structure is depicted by a manifold.

Recap of Kernel null Foley–Sammon transform (KNFST)

Foley–Sammon transform, also called Fisher transform or linear discriminant analysis (LDA), maximizes the between-class scatter and minimizes the within-class scatter, simultaneously. Let \(\mathbf {X}_j,j=1,\dots ,c\) represent the set consisting of the samples which belong to class j; the \(\mathbf {S}_w\), \(\mathbf {S}_b\), \(\mathbf {S}_t\) represent the within-class scatter, between-class scatter, and total scatter matrices, respectively; \(\varvec{\varphi }\in \mathbb {R}^D\) be one direction in the discriminant subspace. Then, the Fisher discriminant criterion is written as follows.

$$\begin{aligned} J(\mathbb {\varphi })=\frac{\varvec{\varphi }^T\mathbf {S}_b\varvec{\varphi }}{\varvec{\varphi }^T\mathbf {S}_w\varvec{\varphi }}. \end{aligned}$$
(1)

Maximizing Eq. (1) can be done via solving a generalized eigenvalue problem as follows:

$$\begin{aligned} \mathbf {S}_b\varvec{\varphi }=\lambda \mathbf {S}_w\varvec{\varphi }. \end{aligned}$$
(2)

The eigenvectors \(\varphi ^1,\dots ,\varphi ^k\) associated with the largest eigenvalues \(\lambda _1,\dots ,\lambda _k\) are selected as the discriminant directions.

In null Foley–Sammon transform (NFST), the direction should make the within-class scatter be zero and between-class scatter be positive. Therefore, Eq. (1) becomes \(J(\varvec{\varphi })=\infty \), which has the best separability. The solution in NFST should satisfy

$$\begin{aligned} \varvec{\varphi }^T\mathbf {S}_w\varvec{\varphi }=0 \quad \mathrm{{and}} \quad \varvec{\varphi }^T\lambda \mathbf {S}_b\varvec{\varphi }>0. \end{aligned}$$
(3)

The directions satisfying Eq. (3) are called null projection directions (NPDs). The Eq. (3) is equivalent to

$$\begin{aligned} \varvec{\varphi }^T\mathbf {S}_w\varvec{\varphi }=0 \quad \mathrm{{and}} \quad \varvec{\varphi }^T\lambda \mathbf {S}_t\varvec{\varphi }>0. \end{aligned}$$
(4)

Here, \(\mathbf {S}_t\) is the total scatter, \(\mathbf {S}_t=\mathbf {S}_w+\mathbf {S}_b\). The samples in the same class are mapped to a single point due to \(\varvec{\varphi }^T\mathbf {S}_w\varvec{\varphi }=0\). An illustration of NFST is shown in Fig. 1.

Fig. 1
figure 1

An illustration of NFST. The stars, x-marks, and pluses belong to class 1, class 2, and class 3, respectively. The left sub-figure is the original space and the right-subfigure is null space. The samples from the same class in original space are mapped to a single point in null space

Figure 1 is a three-class problem. The c1, c2, and c3 are the mapped points of class 1, class 2, and class 3 in null space. For a test sample, the associated label is decided by the minimum distance to the points, c1, c2, and c3. If the test sample is far away from all points, it comes from an unknown class.

In both FST and NFST, within-class scatter and between-class scatter only capture global information and neglect local geometrical structure. In this paper, we adopt a manifold to regularize the within-class scatter and between-class scatter to describe the local structure in a class.

Manifold regularized NFST

Manifold learning for novelty detection

In manifold learning, it assumes that if two data points are close in the original distribution, they are also close in the projected subspace. In this paper, we use the neighborhood preserving embedding (NPE) to describe the local geometrical structure in within-class scatter and total scatter. Then, we propose a regularized within-class scatter and a regularized total scatter instead of within-class scatter and local total scatter, respectively.

Definition 1

(Regularized within-class scatter) Given a dataset \(\mathbf {X}\in \mathbb {R}^{N\times D}\). The associated label is \(\mathbf {Y}\) (\(y_i\in \{1,\dots ,c\}\)). \(\mathbf {X}_j\) consists of all samples belonging to class j. The local within-class scatter is defined as

$$\begin{aligned} \mathbf {S}_\mathrm{{wreg}}&=\sum \limits _{i=1}^N\left( \alpha \left( \mathbf {x}_i-\varvec{\mu }_j \right) -\left( 1-\alpha \right) \left( \mathbf {x}_i-\sum \limits _{p=1}^N W_{i,p}\mathbf {x}_p\right) \right) \nonumber \\&\quad \times \left( \alpha \left( \mathbf {x}_i-\varvec{\mu }_j \right) -\left( 1-\alpha \right) \left( \mathbf {x}_i-\sum \limits _{p=1}^N W_{i,p}\mathbf {x}_p\right) \right) ^T. \end{aligned}$$
(5)

Here, \(\mathbf {W}\) is an adjacency graph. If \(\mathbf {x}_p\) is one of the k-nearest neighbors of \(\mathbf {x}_i\) and has the same label as that of \(\mathbf {x}_i\), there is an edge between \(\mathbf {x}_i\) and \(\mathbf {x}_p\) (\(W_{i,p}\ne 0\)); otherwise, \(W_{i,p}=0\).

The Eq. (5) can be rewritten as follows

$$\begin{aligned} \mathbf {S}_\mathrm{{wreg}}&=\sum \limits _{i=1}^N\left( \mathbf {x}_i-\alpha \dfrac{1}{N_j}\sum \limits _{\mathbf {x}_p\in \mathbf {X}_j}\mathbf {x}_p - \left( 1-\alpha \right) \sum \limits _{p=1}^N W_{i,p}\mathbf {x}_p\right) \nonumber \\&\quad \times \left( \mathbf {x}_i-\alpha \dfrac{1}{N_j}\sum \limits _{\mathbf {x}_p\in \mathbf {X}_j}\mathbf {x}_p - \left( 1-\alpha \right) \sum \limits _{p=1}^N W_{i,p}\mathbf {x}_p\right) ^T. \end{aligned}$$
(6)

Let \(\mathbf {I}\) be a \(N\times N\) identity matrix and \(\mathbf {L}\) be a block diagonal matrix whose block size is \(N_j\) and elements are \(\dfrac{1}{N_j}\). The \(\mathbf {S}_\mathrm{{wreg}}=\mathbf {X}\left( \alpha \left( \mathbf {I}-\mathbf {L}\right) +\left( 1-\alpha \right) \left( \mathbf {I}-\mathbf {W}\right) \right) \left( \alpha \left( \mathbf {I}-\mathbf {L}\right) +\left( 1-\alpha \right) \left( \mathbf {I}-\mathbf {W}\right) \right) ^T\mathbf {X}=\mathbf {X}\left( \mathbf {I}-\left( \alpha \mathbf {L}+\left( 1-\alpha \right) \mathbf {W}\right) \right) \left( \mathbf {I}-\left( \alpha \mathbf {L}+\left( 1-\alpha \right) \mathbf {W}\right) \right) ^T\mathbf {X}^T\) holds.

Let \(\mathbf {X}_w=\mathbf {X}\left( \mathbf {I}-\left( \alpha \mathbf {L}+\left( 1-\alpha \right) \mathbf {W}\right) \right) \). Then, the regularized within-class scatter is rewritten as \(\mathbf {S}_\mathrm{{wreg}}=\mathbf {X}_w\mathbf {X}_w^T\).

The weights are computed by minimizing the following objective function.

$$\begin{aligned} \begin{matrix} {\min :} &{} {\left\| \mathbf {x}_i-\sum \limits _{j=1}^N W_{i,j}\mathbf {x}_j\right\| ^2} \quad \mathrm{{s.t.}} \quad {\sum \limits _{j=1}^N W_{i,j}=1,} &{} {j=1,\dots ,N.} \\ \end{matrix} \nonumber \\ \end{aligned}$$
(7)

The term, \( \sum \nolimits _{\mathbf {x}_j\in KNN(\mathbf {x}_i) \& c(\mathbf {x}_j)\ne c(\mathbf {x}_i)} W_{i,j}\mathbf {x}_j\) , can be regarded as the weighted mean of k-nearest neighbors of \(\mathbf {x}_j\). The details to solve formula (7) can refer to the reference [30].

Definition 2

(Regularized total scatter) Given a dataset \(\mathbf {X}\in \mathbb {R}^{N\times D}\). The associated label is \(\mathbf {Y}\) (\(y_i\in \{1,\dots ,c\}\)). \(\mathbf {X}_j\) consists of all samples belonging to class j. The regularized total scatter is defined as

$$\begin{aligned}&\mathbf {S}_\mathrm{{treg}}=\sum \limits _{i=1}^N\left( \beta \left( \mathbf {x}_i-\varvec{\mu }\right) +\left( 1-\beta \right) \left( \mathbf {x}_i-\varvec{\mu }'\right) \right) \nonumber \\&\quad \left( \beta \left( \mathbf {x}_i-\varvec{\mu }\right) +\left( 1-\beta \right) \left( \mathbf {x}_i-\varvec{\mu }'\right) \right) ^T \end{aligned}$$
(8)

Here, \(\varvec{\mu }'\) and \(\varvec{\mu }'_j\) are defined as Eq. (9) and (10), respectively.

$$\begin{aligned}&\varvec{\mu }'=\frac{1}{N_c}\sum \limits _{j=1}^{N_c}\varvec{\mu }'_j. \end{aligned}$$
(9)
$$\begin{aligned}&\varvec{\mu }'_j=\frac{1}{N_j}\sum \limits _{\mathbf {x}_i \in C_j}\sum \limits _{\mathbf {x}_h\in KNN(\mathbf {x}_i)} W_{i,h}\mathbf {x}_h. \end{aligned}$$
(10)

The Eq. (8) can be rewritten as follows:

$$\begin{aligned}&\mathbf {S}_\mathrm{{treg}}=\sum \limits _{i=1}^N\left( \mathbf {x}_i-\left( \beta \varvec{\mu }+\left( 1-\beta \right) \varvec{\mu }'\right) \right) \nonumber \\&\quad \left( \mathbf {x}_i-\left( \beta \varvec{\mu }+\left( 1-\beta \right) \varvec{\mu }'\right) \right) ^T \end{aligned}$$
(11)

The NPDs for regularized within-class scatter and regularized total scatter satisfy the following conditions.

$$\begin{aligned} \varvec{\varphi }^T\mathbf {S}_\mathrm{{wreg}}\varvec{\varphi }=0 \quad and \quad \varvec{\varphi }^T\mathbf {S}_\mathrm{{treg}}\varvec{\varphi }>0. \end{aligned}$$
(12)

Since \(\mathbf {S}_\mathrm{{wreg}}\) is a semipositive definite matrix and can be represented as \(\mathbf {S}_\mathrm{{wreg}}=\mathbf {X}_w\mathbf {X}_w^T\), we can obtain \({{\varvec{\varphi }}^{T}}{{\mathbf {X}}_{w}}\mathbf {X}_{w}^{T}\varvec{\varphi }=0\Rightarrow {{({{\mathbf {X}_w}^{T}}\varvec{\varphi })}^{T}}({{\mathbf {X}_w}^{T}}\varvec{\varphi })=0\Rightarrow {{\mathbf {X}_w}^{T}}\varvec{\varphi }=0\). By multiplying both sides of the equation from the left with matrix \({{\mathbf {X}_w}}\), then we can obtain \({{\mathbf {X}}_{w}}{{\mathbf {X}}_{w}}^{T}\varvec{\varphi }=0\Rightarrow \mathbf {S}{_\mathrm{{wreg}}}\varvec{\varphi }=0\) holds.

On the other hand, the solution of \({{\mathbf {S}}_\mathrm{{wreg}}}\varvec{\varphi }=0\) exactly satisfies \({{\varvec{\varphi }}^{T}}\mathbf {S}{_\mathrm{{wreg}}}\varvec{\varphi }=0\). The \(\mathbf {S}{_\mathrm{{wreg}}}\varvec{\varphi }=0\Leftarrow {{\varvec{\varphi }}^{T}}\mathbf {S}{_\mathrm{{wreg}}}\varvec{\varphi }=0\) holds.

From the above two points, we can obtain \({{\varvec{\varphi }}^{T}}\mathbf {S}{_\mathrm{{wreg}}}\varvec{\varphi }=0\Leftrightarrow \mathbf {S}{_\mathrm{{wreg}}}\varvec{\varphi }=0\).

Let \(\mathbf {Z}_w=\left\{ \mathbf {z}\vert \mathbf {S}_\mathrm{{wreg}}\mathbf {z}=0\right\} \) be the null space of \(\mathbf {S}_\mathrm{{wreg}}\), \(\mathbf {Z}_t=\left\{ \mathbf {z}\vert \mathbf {S}_\mathrm{{treg}}\mathbf {z}=0\right\} \) be the null space of \(\mathbf {S}_\mathrm{{treg}}\), and \(\mathbf {Z}_t^\perp \) be the orthogonal complement space of \(\mathbf {Z}_t\). The NPDs satisfy

$$\begin{aligned} \varvec{\varphi }\in \mathbf {Z}_t^\perp \cap \mathbf {Z}_w. \end{aligned}$$
(13)

In order to ensure \(\varvec{\varphi }\in \mathbf {Z}_t^\perp \) , each \(\varvec{\varphi }\) can be represented as

$$\begin{aligned} \varvec{\varphi }=\gamma _1\varvec{\theta }_1+\gamma _2\varvec{\theta }_2+\dots +\lambda _m\varvec{\theta }_m=\mathbf {Q}\varvec{\gamma }. \end{aligned}$$
(14)

Here, \(\mathbf {Q}=(\varvec{\theta }_1,\varvec{\theta }_2,\dots ,\varvec{\theta }_m)\) and \(\varvec{\gamma }=(\gamma _1,\gamma _2,\dots ,\gamma _m)\). Since the \(\mathbf {S}_\mathrm{{treg}}\) is seimipositive definite, the solution of \(\varvec{\varphi }^T\mathbf {S}_\mathrm{{treg}}\varvec{\varphi }>0\) is just \(\mathbf {Z}_t^\perp \) (\(\mathbf {Z}_t=\left\{ \mathbf {z}\vert \mathbf {S}_\mathrm{{treg}}\mathbf {z}=0\right\} \)). It means that the subspace \(\varvec{\varphi }^T\mathbf {S}_\mathrm{{treg}}\varvec{\varphi }>0\) is just the \(\mathbf {Z}_t^\perp \). Let \(\varvec{\theta }_1,\varvec{\theta }_2,\dots ,\varvec{\theta }_m\) be the basis of subspace which is spanned by \(\mathbf {x}_1-(\beta \varvec{\mu }+(1-\beta )\mu '),\dots ,\mathbf {x}_m-(\beta \varvec{\mu }+(1-\beta )\mu ')\) and can be obtained by principal component analysis (PCA).

Substituting Eq. (14) into \(\varvec{\varphi }^T\mathbf {S}'_w\varvec{\varphi }=0\), then \((\mathbf {Q\gamma })^T\mathbf {S}_\mathrm{{wreg}}(\mathbf {Q\gamma })=0\). It is equivalent to the following eigenproblem.

$$\begin{aligned} \left( \mathbf {Q}^T\mathbf {S}_\mathrm{{wreg}}\mathbf {Q}\right) \varvec{\gamma }=0. \end{aligned}$$
(15)

Due to \(\mathbf {X}_w=\mathbf {X}\left( \mathbf {I}-\left( \alpha \mathbf {L}+\left( 1-\alpha \right) \mathbf {W}\right) \right) \) and \(\mathbf {S}_\mathrm{{wreg}}=\mathbf {X}_w\mathbf {X}_w^T\), the Eq. (15) can be rewritten as follows.

$$\begin{aligned} \mathbf {H}\mathbf {H}^T\varvec{\gamma }=0. \end{aligned}$$
(16)

Here, \(\mathbf {H}=\mathbf {Q}^T\mathbf {X}_w=\mathbf {Q}^T\mathbf {X}(\mathbf {I}-(\alpha \mathbf {L}+(1-\alpha )\mathbf {W}))\). Eq. (16) is an eigenproblem. After solve Eq. (16), the null projection directions \(\varvec{\gamma }_1,\varvec{\gamma }_2,\dots ,\varvec{\gamma }_l\) can be obtained from Eq. (14).

The null projection directions \(\varvec{\gamma }_1,\varvec{\gamma }_2,\dots ,\varvec{\gamma }_l\) are associated different eigenvalues \(\lambda _1,\lambda _2,\dots ,\lambda _l\) (\(\lambda _i\ne \lambda _j\)). From Eq. (16), we can obtain \(\varvec{\gamma }_i^T\mathbf {H}\mathbf {H}^T=\lambda _i\varvec{\gamma }_i^T,\varvec{\gamma }_j^T\mathbf {H}\mathbf {H}^T=\lambda _j\varvec{\gamma }_j^T\Rightarrow \varvec{\gamma }_i^T\mathbf {H}\mathbf {H}^T\varvec{\gamma }_j=\lambda _i\varvec{\gamma }_i^T\varvec{\gamma }_j,\varvec{\gamma }_j^T\mathbf {H}\mathbf {H}^T\varvec{\gamma }_j=\lambda _j\varvec{\gamma }_j^T\varvec{\gamma }_j \Rightarrow 0=\left( \lambda _i-\lambda _j\right) \varvec{\gamma }_i^T\varvec{\gamma }_j\). Due to \(\lambda _i\ne \lambda _j\), \(\varvec{\gamma }_i^T\varvec{\gamma }_j=0\) hols. Therefore, \(\varvec{\gamma }_i\) and \(\varvec{\gamma }_j\) are orthogonalized.

In Eq. (14), the matrix \(\mathbf {Q}\) is obtained from PCA. The column vectors of \(\mathbf {Q}\) are orthogonalized. We can obtain \(\varvec{\varphi }_i^T\varvec{\varphi }_j=\left( \mathbf {Q}\varvec{\gamma }_i\right) ^T\left( \mathbf {Q}\varvec{\gamma }_j\right) =\varvec{\gamma }_i^T\mathbf {Q}^T\mathbf {Q}\varvec{\gamma }_j=\varvec{\gamma }_i^T\varvec{\gamma }_j=0\). Therefore, the directions obtained from Eq. (14) are orthogonalized as well.

Let \(\mathbf {P}\) be a matrix whose columns are null projection directions \(\varvec{\varphi }_1,\dots ,\varvec{\varphi }_l\) (\(l<N\)). A test sample \(\mathbf {x}\) is mapped into null space via Eq. (17) and scored via Eq. (18).

$$\begin{aligned}&\mathbf {x}^{\star }=\mathbf {P}\mathbf {x}. \end{aligned}$$
(17)
$$\begin{aligned}&\mathrm{{Score}}(\mathbf {x})=\min \limits _{1\le j \le c} \mathrm{{dist}}(\mathbf {x}^{\star },\mathbf {t}_j). \end{aligned}$$
(18)

Here, \(\mathbf {t}_j\) is the mapped point of class j in null space. The \(\mathrm{{Score}}(\mathbf {x})\) is the novelty score of \(\mathbf {x}\), which reflects the probability of a test sample from an unknown class. When it is very large, \(\mathbf {x}\) is a novelty with high probability.

The procedure to find NPDs is summarized as the following Algorithm 1

figure a

Compared with null space Foley–Sammon transform (NFST), the extra cost of MLND includes solving Eq. (7) and calculating regularized within-class scatter and regularized total scatter. The rest cost of MLND is the same as that of NFST.

Kernel form manifold learning for novelty detection

In MLND, it assumes that \(\mathbf {S}_\mathrm{{wreg}}\) is singular. When \(\mathbf {S}_\mathrm{{wreg}}\) is full rank, the samples are mapped into a reproducing Hilbert space (RKHS) via kernel trick to avoid the null space \(\mathbf {Z}_w\) being empty. The mapping of a sample \(\mathbf {x}\) in RKHS is represented as \(\varPhi (\mathbf {x})\) where \(\varPhi (\mathbf {x})\) is an implicit function. The inner product of the mappings of two samples can be calculated via kernel function which is defined as \(k\left( \mathbf {x}_i,\mathbf {x}_j\right) =\langle \varPhi (\mathbf {x}_i),\varPhi (\mathbf {x}_j)\rangle \), such as Radial Basis Function (RBF) kernel \(k\left( \mathbf {x}_i,\mathbf {x}_j\right) =\exp \left( -\frac{\Vert \mathbf {x}_i-\mathbf {x}_j\Vert }{2\sigma ^2}\right) \). Obviously, \(\mathbf {S}_\mathrm{{wreg}}\) is a high dimensional space other than a \(d\times d\) matrix any more. For instance, \(\mathbf {S}_\mathrm{{wreg}}\) is a \(inf\times inf\) matrix when RBF kernel is adopted.

Let \(\widetilde{\varPhi }(\mathbf {x}_i)=\varPhi (\mathbf {x}_i)-\Big (\frac{\beta }{N}\sum \nolimits _{j=1}^N\varPhi (\mathbf {x}_j)+\frac{1-\beta }{N}\sum \nolimits _{j=1}^N\sum \nolimits _{\mathbf {x}_h\in KNN(\mathbf {x}_i)}W_{i,j}\varPhi (\mathbf {x}_h)\Big )\), \(\widetilde{\mathbf {K}}=(\mathbf {I}-(\beta \mathbf {1}_N+(1-\beta )\mathbf {1}_N\mathbf {W}))\mathbf {K}(\mathbf {I}-(\beta \mathbf {1}_N+(1-\beta )\mathbf {1}_N\mathbf {W}))^T\), \(\widetilde{\mathbf {X}}=[\varPhi (\mathbf {x}_1),\dots ,\varPhi (\mathbf {x}_N)]\), \(\mathbf {K}\) be the kernel matrix where \(K(i,j)=\langle \varPhi (\mathbf {x}_i),\varPhi (\mathbf {x}_j)\rangle \). Then, \(\mathbf {S}_\mathrm{{treg}}\) in RKHS is rewritten as follows

$$\begin{aligned} \mathbf {S}_\mathrm{{treg}}=\sum _{i=1}^N\widetilde{\varPhi }(\mathbf {x}_i)\widetilde{\varPhi }(\mathbf {x}_i)^T. \end{aligned}$$
(19)

The eigenvector \(\varvec{\theta }_j\) in high dimensional feature space lies in the space of \(\widetilde{\varPhi }(\mathbf {x}_1),\dots ,\widetilde{\varPhi }(\mathbf {x}_N)\) and there exist coefficients \(\delta _{1,j},\dots ,\delta _{N,j}\) satisfying the following equation.

$$\begin{aligned} \varvec{\theta }_j=\sum _{i=1}^N\delta _{i,j}\widetilde{\varPhi }(\mathbf {x}_i). \end{aligned}$$
(20)

The eigenvalues and eigenvectors of \(\mathbf {S}_\mathrm{{treg}}\) satisfy

$$\begin{aligned} \lambda _j\varvec{\theta }_j=\mathbf {S}_\mathrm{{treg}}\varvec{\theta }_j. \end{aligned}$$
(21)

Then,

$$\begin{aligned} \lambda _j\widetilde{\varPhi }(\mathbf {x}_i)^T\varvec{\theta }_j=\widetilde{\varPhi }(\mathbf {x}_i)^T\mathbf {S}_\mathrm{{treg}}\varvec{\theta }_j,\quad i=1,\dots ,l. \end{aligned}$$
(22)

Substituting Eqs. (20) and (21) into Eq. (22), we can obtain

$$\begin{aligned} \lambda _j\widetilde{\mathbf {K}}\varvec{\delta }_j=\widetilde{\mathbf {K}}\widetilde{\mathbf {K}}\varvec{\delta }_j. \end{aligned}$$
(23)

Here, \(\varvec{\delta }_j\) is the vector form of \(\delta _{1,j},\dots ,\delta _{N,j}\) and can be obtained by solving the following eigenvalue problem

$$\begin{aligned} \lambda _j\varvec{\delta }_j=\widetilde{\mathbf {K}}\varvec{\delta }_j. \end{aligned}$$
(24)

Since \(\langle \varvec{\theta }_j,\varvec{\theta }_j\rangle =\varvec{\delta }_j^T\widetilde{\mathbf {K}}\varvec{\delta }_j=\lambda _j\langle \varvec{\delta }_j,\varvec{\delta }_j\rangle \), the orthonormal basis of \(\mathbf {S}_\mathrm{{treg}}\) in high dimensional space is represented as follows:

$$\begin{aligned} \widetilde{\varvec{\theta }}_j=\sum \limits _{i=1}^N\widetilde{\delta }_{i,j}\widetilde{\varPhi }(\mathbf {x}_i). \end{aligned}$$
(25)

Here \(\widetilde{\delta }_{i,j}=\lambda _j^{-\frac{1}{2}}\delta _{i,j}\). Equation (25) can be solved implicitly. By introducing Eq. (25) and inner products in reproducing kernel Hilbert space (RKHS), the matrix \(\mathbf {H}\) is rewritten as follows.

$$\begin{aligned}&\mathbf {H}=\left( (\mathbf {I}-(\beta \mathbf {1}_N+(1-\beta )\mathbf {1}_N\mathbf {W}))\widetilde{\mathbf {V}}\right) ^T \nonumber \\&\quad \mathbf {K}\left( \alpha \left( \mathbf {I}-\mathbf {L} \right) +\left( 1-\alpha \right) \left( \mathbf {I}-\mathbf {W}\right) \right) . \end{aligned}$$
(26)

Here, \(\widetilde{\mathbf {V}}=\{\varvec{\theta }_1,\dots ,\varvec{\theta }_l\}\). Then, substituting Eq. (26) into Eq. (16), we can obtain \(\gamma _j\) in RKHS. The final null space directions in RKHS are obtained by the

$$\begin{aligned} \varvec{\varphi }=\left( (\mathbf {I}-\mathbf {1}_N)\widetilde{\mathbf {V}}\right) \gamma _j. \end{aligned}$$
(27)

Let \(\mathbf {P}=[\varvec{\varphi }_1,\dots ,\varvec{\varphi }_l]\). In kernel MLND, the test sample \(\mathbf {x}^{\star }\) is mapped into null space through \(\mathbf {K}_{\star }^T\mathbf {P}\). The \(\mathbf {K}_{\star }^T\) is the vector form of \([k(\mathbf {x}_1,\mathbf {x}_{\star });\dots ;k(\mathbf {x}_n,\mathbf {x}_{\star })]\). The novelty score of \(\mathbf {x}^{\star }\) is the minimum distance between the mapped point to each class. The procedure of kernel MLND is summarized as Algorithm 2.

figure b

When the parameters \(\alpha =1\) and \(\beta =1\), kernel form MLND degenerates as KNFST. When \(\alpha =0\) and \(\beta =0\), the \(\mathbf {S}_\mathrm{{wreg}}\) is just a LLE manifold. Compared with kernel null space Foley–Sammon transform (KNFST), the extra cost of MLND includes solving Eq. (7) and the time complexity of rest part is the same as that of KNFST.

Experiments and simulations

In this section, manifold learning novelty detection (MLND) will be evaluated on several datasets. Here, we use the kernel MLND. The code of MLND is implemented by Matlab 2018b. To verify the validity of MLND, we compare MLND with some state-of-the-art null space methods, including KNFST [3], Local KNFST [31], and NK3ML [29]. The codes of KNFST, Local KNFST, and NK3ML are provided by the authors.

The generalized histogram intersection kernel (HIK) is used as kernel function in KNFST, Local KNFST, and NK3ML. For fair comparision, the HIK is used as kernel function, which is defined as \(k(\mathbf {x}_i,\mathbf {x}_j)=\exp (2\kappa _\mathrm{{HIK}}(\mathbf {x}_i,\mathbf {x}_j)-\kappa _\mathrm{{HIK}}(\mathbf {x}_i,\mathbf {x}_i)-\kappa _\mathrm{{HIK}}(\mathbf {x}_j,\mathbf {x}_j)\). The \(\kappa _\mathrm{{HIK}}(\mathbf {x}_i,\mathbf {x}_j)\) is defined as \(\kappa _\mathrm{{HIK}}(\mathbf {x}_i,\mathbf {x}_j)=\sum \limits _{d=1}^D\min (x_{i,d},x_{j,d})\).

First, we adopt an EMG dataset to demonstrate the effectiveness of MLND on real dataset; then, two toy datasets are used to further evaluate MLND; lastly several benchmark datasets which are collected by UCIFootnote 1 or Libsvm website [32] are adopted to further evaluate MLND. The experimental results are reported in terms of AUC value, ROC curve and accuracy. The AUC value and ROC curve are used to evaluate novelty detection methods. The higher AUC value is, the better the novelty detector is. Accuracy is defined as the ratio of correctly predicted normal samples to all normal samples. It is used to measure the classification performance of multi-class supervised novelty detection for normal samples.

Table 1 The experimental results of gestures recognition

Experiments on EMG dataset

In this section, we use Gestures, which is an electromyogram signal (EMG) dataset, to verify MLND. The signals are collected via MYO Thalmic bracelet which is worn on user’s forarm. The bracelet is equipped with 8 sensors to collect myographic signals simultaneously. The raw signals are from 36 subjects. Each subject performs 2 series. Each series contains 6 or 7 basic gestures: hand at rest, hand clenched in a fist, wrist flexion, wrist extension, radial deviations, ulnar deviations, and extended palm. In this experiment, we only consider former six gestures since the extended palm is not performed in some subjects. An illustration of the signals of former six gestures is shown in Fig. 2. The label of horizontal axis is the channel which the signal are collected from. A channel is associated with a sensor in MYO Thalmic bracelet.

Fig. 2
figure 2

The signals of hand gestures. In the first row, the signals are from hand at rest, hand clenched in a fist, and wrist flexion, respectively. In the second row, the signals are from wrist extension, radial deviations, and ulnar deviations, respectively. The ch1 to ch8 are associated with eight sensors on MYO Thalmic bracelet

Different from previous gesture recognition work [33], this paper converts the gesture recognition as a multi-class supervised novelty detection problem to identify unknown gesture. Except seven basic gestures, some of the signals are not marked as basic gestures. In this section, we use hand at rest, hand clenched in a fist, wrist flexion, wrist extension, radial deviations, and ulnar deviations as normal classes. The extended palm and unmarked signals are used as anomalies. Therefore, the task of Gesture recognition becomes to recognize whether the EMG signal are from basic hand gestures or which basic hand gesture the EMG signal comes from if it belongs to one of the basic hand gestures. Obviously, it is a multi-class supervised novelty detection problem. The gesture recognition can be widely used in robot control [34, 35] and traffic control [36, 37].

We set a 200 ms window for sampling. The window overlaps with a 100 ms step. Then, we generate 30,240 normal samples (5,040 samples per class) and 10,000 abnormal samples as novelties. The normal samples are divided into two parts equally. One part is used as a training set, the other part and abnormal samples are used as a test set. The features from eight channels are reorganized as an 800*1 vector.

In MLND, we directly set parameters as (\(\alpha =0.5\), \(\beta =0.5\)). The number of nearest neighbors in Definition 1 and Definition 2 is directly set as 20 to avoid extra cost to tune parameters. To avoid randomness, we repeat the experiment 30 times. The results are reported as mean ± std. form in terms of AUC value and accuracy. The results are reported in Table 1.

From Table 1, it can be found the average AUC value of MLND reaches 0.9251 which is higher than KNFST, Local KNFST, and NK3ML; the average accuracy of MLND reaches 93.87% which is also higher than KNFST, Local KNFST, and NM3ML. The ROC curve of one trail is drawn in Fig. 3.

Fig. 3
figure 3

The AUC curve of Gestures. The greed dashed, dotted, dashdot, and red dashed represent the ROC curves of KNFST, Local KNFST, NK3ML, and MLND respectively

From Fig. 3, the ROC curve of MLND is still superior to that of KNFST, Local KNFST, and NK3ML. MLND performs better than KNFST, Local KNFST, and NK3ML on Gesture.

Furthermore, we also consider the influence of different parameter k in Definitions 1 and 2 on the performance of MLND. The parameter k is in the range from 10 to 100 steps by 10. Here, the parameters \(\alpha \) and \(\beta \) are both set as 0.5 directly. The curve between AUC value and the parameter k is shown in the left sub-figure of Fig. 4. The curve between accuracy and the parameter k is shown in the right sub-figure of Fig. 4.

Fig. 4
figure 4

The performance with different parameter k in MLND. Left: the curve between the AUC value and the parameter k; right: the curve between the accuracy and the parameter k

From the results in Fig. 4, it can be found that both AUC value and accuracy decrease with the increasing of the number of k nearest neighbors in MLND when \(k>30\). The reason is that the manifold is used to depict a small region. When the neighborhood is too large, the manifold is invalid. When \(k=20\), the AUC value reaches the peak (\(AUC=0.9152\)). When \(k=30\), the accuracy reaches the peak (\(accuracy=94.65\%\)). In our experience, the parameter k in MLND cannot be set too large. In the following experiments, the parameter k is set as 20 directly.

Experiments on toy datasets

In this subsection, we will evaluate MLND on two toy datasets. The first one contains 3 normal classes and the second one contains 2 normal classes. In toy 1, the samples in \(\mathbf {X}_j,j=1,2,3\) follows the below distributions:

$$\begin{aligned} \mathbf {X}_j=\mathbf {N}_j. \end{aligned}$$
(28)

Here, \(\mathbf {N}_1\thicksim \left( [\begin{matrix} 0&0 \end{matrix}],\left[ \begin{matrix} 0.5^2 &{} 0 \\ 0 &{} 1.25^2\end{matrix}\right] \right) \), \(\mathbf {N}_2\thicksim \left( [\begin{matrix} 2&0 \end{matrix}],\left[ \begin{matrix} 0.5^2 &{} 0 \\ 0 &{} 1.25^2\end{matrix}\right] \right) \), \(\mathbf {N}_3\thicksim \left( [\begin{matrix} 4&0 \end{matrix}],\left[ \begin{matrix} 0.5^2 &{} 0 \\ 0 &{} 1.25^2\end{matrix}\right] \right) \). An illustration of toy 1 is shown in Fig. 5a.

In toy 2, the samples in \(\mathbf {X}_j,j=1,2\) follows the below distributions:

$$\begin{aligned} \mathbf {X}_j=\mathbf {N}_j+\left[ 1-\frac{x_{i,2}^3}{25}+\epsilon \right] . \end{aligned}$$
(29)

Here, \(\mathbf {N}_1\thicksim \left( [\begin{matrix} 0&0 \end{matrix}],\left[ \begin{matrix} 2^2 &{} 0 \\ 0 &{} 2^2\end{matrix}\right] \right) \), \(\mathbf {N}_2\thicksim \left( [\begin{matrix} 3&3 \end{matrix}],\left[ \begin{matrix} 2^2 &{} 0 \\ 0 &{} 2^2\end{matrix}\right] \right) \), \(\epsilon \thicksim N\left( 0,0.25^2\right) \). An illustration of toy 2 is shown in Fig. 5b.

Fig. 5
figure 5

The illustration of toy datasets. In a, the samples from class 1, 2, and 3 are denoted as pluses, stars, and x-marks, respectively; the novelties are denoted as circles. In b, there are two normal classes in which the samples are denoted as pluses and stars, respectively; the circles are novelties

In toy 1, we generate 600 samples in the training set (200 samples per class) and 2000 samples in the test set (500 samples per class and 500 novelties). In toy 2, we generate 400 samples in the training set (200 samples per class) and 1500 samples in the test set (500 samples per class and 500 novelties). To refrain the randomness, we repeat the experiments 30 times. The AUC value and accuracy are reported in the form of mean ± std. in Table 2.

Table 2 The experimental results of toy datasets

In Table 2, the sixth column represents the results of MLND via fine-tuned parameters \(\alpha ,\beta \). The parameters \(\alpha ,\beta \) are tuned via grid search in the range from 0.1 to 1 stepped 0.1. The seventh column of Table 2 represents the results of MLND with fixed parameters \(\alpha =0.5,\beta =0.5\).

For Toy 1, the average AUC value of MLND is 0.9589 when tuning parameters \(\alpha ,\beta \) via grid search and is 0.9492 when \(\alpha =0.5,\beta =0.5\). For Toy 2, the average AUC value of MLND is 0.9314 when tuning parameters \(\alpha ,\beta \) via grid search and is 0.9249 when \(\alpha =1,\beta =1\). The average AUC value of MLND is higher than KNFST, Local KNFST, and NK3ML even the parameters are directly set as \(\alpha =0.5,\beta =0.5\).

For Toy 1, the average accuracy of MLND is 96.71% when tuning parameters \(\alpha ,\beta \) via grid search and is 95.17% when \(\alpha =0.5,\beta =0.5\). For Toy 2, the average accuracy of MLND is 93.47% when tuning parameters \(\alpha ,\beta \) via grid search and is 93.28% when \(\alpha =0.5,\beta =0.5\). The average accuracy of MLND is higher than KNFST, Local KNFST, and NK3ML even the parameters are directly set as \(\alpha =0.5,\beta =0.5\) as well.

This is because MLND considers both global information and local structure in class. The ROC curves of Toy 1 and Toy 2 are shown in Fig. 6.

Fig. 6
figure 6

The ROC curves of toy datasets. a Toy 1; b Toy 2. The green dashed, dotted, dashdot, red dashed, and magenta dotted lines represent the ROC curves of KNFST, Local KNFST, NK3ML, MLND with fine-tuned parameters, and MLND with fixed parameters (\(\alpha =0.5\), \(\beta =0.5\)), respectively

From Fig. 6, we can obtain the same conclusion that is from AUC value and accuracy for Toy 1 and Toy 2.

Experiments on benchmark datasets

In this subsection, we will compare MLND with KNFST, Local KNFST, and NK3ML on several benchmark dataset which are collected from UCI repository and Libsvm website [32]. The details of these datasets are listed in Table 3.

Table 3 The details of benchmark datasets

These datasets are reorganized to suit for evaluating multi-class supervised novelty detection. For DNA, protein, satimage, and shuttle, we remove a class from the training set and add the samples from this class into a test set for testing. For pendigits poker, SVHN, and usps, we remove five classes from the training set and add the samples from these classes into the test set for testing. The parameters in MLND is directly set as \(\alpha =0.5\), \(\beta =0.5\) and \(k=20\). The AUC value and accuracy are reported in Tables 4 and 5, respectively.

Table 4 The AUC value results of benchmark datasets
Table 5 The accuracy results of benchmark dataset

In Tables 4 and 5, the last row is the win-loss-tie (W-L-T) of AUC value and accuracy, respectively. The MLND is used as a based method. From Table 4, it can be found that the AUC value of MLND is higher than that of KNFST on eight datasets, Local KNFST on eight datasets, and NK3ML on seven datasets. From Table 5, it can be found that the accuracy of MLND is higher than that of KNFST on eight datasets, Local KNFST on eight datasets, and NK3ML on six datasets. The MLND is superior to KNFST, Local KNFST, and NK3ML on most of these benchmark datasets.

Discussions and conclusion

In this paper, we propose a manifold learning-based novelty detection method. The manifold learning novelty detection (MLND) can be regarded as an improvement of kernel null space Foley–Sammon transformation (KNFST). In MLND, first we introduce a manifold into within-class scatter and total scatter to depict the local geometrical structure in class; then we map the samples from the same class into a single point via null projection directions. Compared with KNFST, MLND considers both global information and local geometrical structure in the class. Therefore, MLND can overcome the weakness of KNFST caused by ignoring local geometrical structure in the class. We evaluate MLND on an EMG Gesture dataset, two toy dataset and eight benchmark datasets. The experimental results demonstrate MLND is superior to KNFST and its two improved methods: Local KNFST and NK3ML.