1 Introduction

Outlier detection, which is the identification of samples that are different from the majority (inliers) in the data, is a fundamental problem in data mining and has been studied for a long time [1, 16, 26]. The demand for outlier detection has further increased as data collection for various targets has become possible because of recent improvements in sensor and communication technologies [7, 14, 20]. However, as more data can be obtained, the cost of labeling data has become enormous. Therefore, unsupervised outlier detection without the need for clean data is highly practical and has been important for real problems in recent years [12, 32].

Reconstruction-based methods, such as principal component analysis (PCA) [21], are popular methods used for unsupervised outlier detection. These methods assume that outliers are incompressible and thus cannot be effectively reconstructed from low-dimensional projections. PCA linearly decomposes a data matrix X into a low-rank matrix L and an error matrix E that indicates the outlierness of elements of X such that the \(l_2\)-norm of E is minimized. Because the \(l_2\)-norm is sensitive to outliers, the detection accuracy of PCA for data corrupted by large noise tends to be low. Data corrupted by large noise are hereinafter referred to simply as “corrupted data.” Robust PCA (RPCA) [6], which is an improved method of PCA, linearly decomposes X into L and a sparse error matrix S. As the decomposition of RPCA is performed under an \(l_0\)-norm constraint on S, RPCA is robust against corrupted data. However, because both methods decompose X linearly, there is the drawback of poor accuracy for data that have nonlinear features .

In recent years, the robust deep autoencoder (RDA) [34] has been proposed as an unsupervised outlier detection method that can accurately detect outliers, even for corrupted and nonlinear data. In the RDA, a low-dimensional nonlinear manifold that flexibly captures nonlinear features of input data is trained to reconstruct the data using autoencoders (AEs) [15] instead of the linear manifold used in RPCA. The RDA aims to decompose data as \(X=L_\mathrm{D}+S\) using an alternating optimization algorithm, where \(L_\mathrm{D}\) is a matrix whose elements are close on the low-dimensional nonlinear manifold and S is a sparse error matrix. To learn a low-dimensional nonlinear manifold that is not affected by corrupted elements, the RDA aims to decompose data under the \(l_0\)-norm regularization on S; however, the RDA actually sparsifies S by solving the \(l_1\)-norm relaxed regularization problem because of the computational intractability of \(l_0\)-norm regularization. For corrupted data, \(l_0\)-norm regularization, which can avoid the effect of the values of the corrupted elements, is more robust for outliers than \(l_1\)-norm regularization, for which it is difficult to avoid the effects of the values of the corrupted elements (see Sect 5.3.1). Additionally, in the alternating optimization method of the RDA, no theoretical analysis of convergence is made. In practice, it has been confirmed that the progress of training the RDA model may be unstable (see Sect. 5.5).

Table 1 Comparison of features of reconstruction-based methods

In this paper, we propose L0-norm constrained AE (L0-AE), which is a novel reconstruction-based unsupervised outlier detection method that can learn low-dimensional manifolds under an \(l_0\)-norm constraint for the error term using AE. In Table 1, we briefly summarize the comparison of the features of each reconstruction-based method. Compared with other reconstruction-based methods, L0-AE can provably guarantee the convergence of optimization under the \(l_0\)-norm constraint in addition to considering nonlinear features. To the best of our knowledge, there is no reconstruction-based method that enforces neither relaxation of the \(l_0\)-norm nor linearity. The key contributions of this work are as follows:

  1. 1.

    We propose a new alternating optimization algorithm that decomposes data nonlinearly under an \(l_0\)-norm constraint for the error term (Sect. 3.1).

  2. 2.

    We prove that our alternating optimization algorithm always converges under the mild condition that AE is trained appropriately using gradient-based optimization instead of the RDA (Sect. 3.3).

  3. 3.

    Through extensive experiments, we show that L0-AE is more robust and accurate than other reconstruction-based methods not only for corrupted data but also datasets that contain well-known outlier distributions. Additionally, for real datasets, we show that L0-AE achieves higher accuracy than the baseline non-robustified method for most parameter values and can learn more stably than the RDA (Sect. 5).

2 Preliminaries

In this section, we describe related reconstruction-based methods. Throughout the paper, we denote a given data matrix by \(X\in \mathbb {R}^{N\times D}\), where N and D denote the number of samples and feature dimensions of X, respectively.

2.1 Robust principal component analysis

RPCA [6] is a modification of PCA [21].

First, we describe PCA. PCA decomposes a data matrix X into a low-rank matrix L and a small error matrix E such that

$$\begin{aligned} X = L + E. \end{aligned}$$
(1)

This decomposition can be formulated as the following optimization problem

$$\begin{aligned} {\begin{matrix} \min _{L} ||X - L||_{2}^{2}\\ \mathrm{s.t.} \quad \mathrm{rank}(L) \le k', \end{matrix}} \end{aligned}$$
(2)

where \(||\cdot ||_{2}\) is the \(l_2\)-norm and \(k'\) is a parameter that determines the maximum value allowed for \(\mathrm{rank}(L)\); that is, PCA seeks the best low-rank representation L of the given data matrix X with respect to (w.r.t.) the \(l_2\)-norm. PCA is used for outlier detection because \(E \circ E\) indicates element-wise outlierness, where \(\circ \) is the element-wise product. Generally, the outlierness of each sample is obtained by adding \(E\circ E\) along the feature dimension. PCA is sensitive to outliers because of the use of the \(l_2\)-norm, which makes the use of PCA for outlier detection difficult.

To avoid this drawback of PCA, RPCA decomposes X into a low-rank matrix L and a sparse error matrix S such that

$$\begin{aligned} X = L + S \end{aligned}$$
(3)

based on the assumption that there are not many corrupted elements in X. This decomposition problem can be represented as follows: minimize the rank of the matrix L satisfying the sparsity constraint \(||S||_0 \le k\), where \(||\cdot ||_0\) is the \(l_0\)-norm that represents the number of nonzero elements and k is a parameter that determines the maximum value of \(||S||_0\). The Lagrangian reformulation of this optimization problem is

$$\begin{aligned} {\begin{matrix} \min _{L, S} \mathrm{rank}(L) + \lambda ||S||_{0}\\ \mathrm{s.t.} \quad ||X-L-S||_{2}^{2} = 0, \end{matrix}} \end{aligned}$$
(4)

where \(\lambda \) is a parameter that adjusts the sparsity of S instead of k. However, this non-convex optimization problem (4) is NP-hard [2]. A convex relaxation whose solution matches the solution of Eq. (4) under broad conditions has been proposed as follows:

$$\begin{aligned} {\begin{matrix} \min _{L, S} ||L||_{*} + \lambda ||S||_{1}\\ \mathrm{s.t.} \quad ||X-L-S||_{2}^{2} = 0, \end{matrix}} \end{aligned}$$
(5)

where \(||\cdot ||_{*}\) is the nuclear norm and \(||\cdot ||_{1}\) is the \(l_1\)-norm (i.e., the sum of the absolute values of the elements). RPCA is also used for outlier detection because \(S \circ S\) indicates the outlierness of each element.

RPCA is more robust against outliers than PCA because of the use of the \(l_0\)-norm. However, RPCA cannot represent data distributed in a nonlinear manifold because the mapping from X to L is restricted to be linear.

2.2 Robust deep autoencoders

The RDA [34] is a method that relaxes the linearity assumption of RPCA. The RDA uses an AE for nonlinear mapping from X to L instead of the linear mapping of RPCA.

First, we describe the AE. The AE [15] is a special type of multilayer neural network in which the number of nodes in the output layer is the same as that in the input layer. Generally, the model parameters \(\theta \) of AEs are trained to minimize the reconstruction error, which is defined by the following optimization problem:

$$\begin{aligned} \min _{\theta } ||X - f_{\mathrm{AE}}(X;\theta )||_{2}^{2}, \end{aligned}$$
(6)

where \(f_{\mathrm{AE}}(X;\theta )\) is an output of the AE with an input X and parameters \(\theta \). If an AE with a bottleneck layer can be trained so that its reconstruction error becomes small, \(f_{\mathrm{AE}}(X;\theta )\) forms a low-dimensional nonlinear manifold; that is, the AE is a generalization of PCA and decomposes data such that

$$\begin{aligned} X = \bar{L} + E, \end{aligned}$$
(7)

where \(\bar{L} = f_{\mathrm{AE}}(X;\theta )\) denotes a matrix that forms a low-dimensional nonlinear manifold. The AE can be used as an outlier detection method that is a nonlinear version of PCA. Similar to PCA, the AE is sensitive to outliers because of the \(l_2\)-error minimization.

Next, we describe the RDA. Similar to RPCA, the RDA aims to decompose X into \(X = L_\mathrm{D} + S\), where S is a sparse error matrix whose nonzero elements indicate the reconstruction difficulty and \(L_\mathrm{D}\) is easily reconstructable data for the AE. This decomposition of the RDA is defined as the following optimization problem:

$$\begin{aligned} {\begin{matrix} \min _{\theta , S} ||L_\mathrm{D} - f_{\mathrm{AE}}(L_\mathrm{D};\theta )||_{2} + \lambda ||S||_{0}\\ \mathrm{s.t.} \quad X-L_\mathrm{D}-S = 0. \end{matrix}} \end{aligned}$$
(8)

This decomposition is more robust than Eq. (6) and can capture nonlinear features of X, unlike Eq. (4).

As Eq. (8) is difficult to optimize, an \(l_1\)-relaxation of the following form has been also proposed [34]:

$$\begin{aligned} {\begin{matrix} \min _{\theta , S} ||L_\mathrm{D} - f_{\mathrm{AE}}(L_\mathrm{D};\theta )||_{2} + \lambda ||S||_{1}\\ \mathrm{s.t.} \quad X-L_\mathrm{D} - S = 0. \end{matrix}} \end{aligned}$$
(9)

2.2.1 RDA for structured outliers

The outlier detection methods mentioned above assume that outliers are unstructured. However, in real applications, outliers are often structured [34]; that is, outliers are concentrated on a specific sample. To improve the detection accuracy for data that have structured outliers along the sample dimension, a method with grouped norm regularization has been proposed:

$$\begin{aligned} {\begin{matrix} \min _{\theta , S} ||L_\mathrm{D} - f_{\mathrm{AE}}(L_\mathrm{D};\theta )||_{2} + \lambda ||S^{T}||_{2,1}\\ \mathrm{s.t.} \quad X-L_\mathrm{D} - S = 0, \end{matrix}} \end{aligned}$$
(10)

where the \(l_{2,1}\)-norm is defined as

$$\begin{aligned} ||X||_{2,1} = \sum _{j=1}^{D}||\varvec{x}_{j}||_{2} = \sum _{j=1}^{D} ( \sum _{i=1}^{N}|x_{ij}|^{2} )^{1/2}. \end{aligned}$$
(11)

The RDA essentially uses Eq. (10) as an objective function for outlier detection.

2.2.2 Alternating optimization algorithm of the RDA

To optimize Eq. (10), the RDA applies an alternating optimization method to \(\theta \) and S. In particular, the RDA uses a back-propagation method to train AE to minimize \(||L_\mathrm{D} - f_{\mathrm{AE}}(L_\mathrm{D};\theta )||_{2}\) with S fixed and a proximal-gradient-based method with a fixed step size to minimize the penalty term \(||S^{T}||_{2,1}\) with \(\theta \) fixed.

3 L0-norm constrained autoencoders

Although the RDA can detect outliers even for nonlinear data, there are several concerns with the RDA. First, because of the NP-hardness of the problem, the RDA uses the \(l_1\) or \(l_{2,1}\)-norm instead of the \(l_0\)-norm, which makes the result highly sensitive to outliers. Second, the alternating optimization method of the RDA does not guarantee convergence. In practice, it has been experimentally confirmed that the progress of training the RDA model may be unstable (Sect. 5.5).

To address these issues, we propose an unsupervised outlier detection method that can decompose data nonlinearly using AE under an \(l_0\)-norm constraint for the sparse matrix S. We prove that our algorithm always converges under a certain condition. For clarity, we first describe L0-AE for unstructured outliers and then extend it for structured outliers.

3.1 Formulation of the objective function

We consider decomposing a data matrix X into three terms: a low-dimensional nonlinear manifold \(\bar{L} = f_{\mathrm{AE}}(X;\theta )\), sparse error matrix S and small error matrix E similar to stable principal component pursuit [35], which overcomes the drawbacks of RPCA for noisy data, such that

$$\begin{aligned} X = f_{\mathrm{AE}}(X;\theta ) + S + E. \end{aligned}$$
(12)

Note that we can calculate element-wise outlierness using \((X - f_{\mathrm{AE}}(X;\theta )) \circ (X - f_{\mathrm{AE}}(X;\theta ))\) because \(X - \mathrm{AE}_\theta (X)\) equals \(S + E\), which denotes the total error matrix.

Equation (12) can be satisfied freely depending on E if E is not constrained. To train an AE that captures the features of X successfully, \(||E||_{2}^{2}\) needs to be as small as possible. Therefore, we aim to obtain an AE model that successfully reconstructs the data by excluding the influence of outliers and a sparse matrix S by solving the following optimization problem that minimizes \(||E||_{2}^{2} = ||X - f_{\mathrm{AE}}(X;\theta ) - S||_{2}^{2}\) with \(l_0\)-norm regularization of S:

$$\begin{aligned} \min _{\theta , S} ||X - f_{\mathrm{AE}}(X;\theta ) - S||_{2}^{2} + \lambda ||S||_{0}. \end{aligned}$$
(13)

Specifically, this optimization problem (13) means “minimizing the reconstruction error of the AE without considering the sparse noisy elements S.” Because of the \(l_0\)-norm regularization of S, it is robust against outliers contained in X. However, Eq. (13) is difficult to optimize, as is Eq. (8) in the RDA.

To facilitate the optimization described later, we consider the following \(l_0\)-norm constrained optimization problem instead of Eq. (13):

$$\begin{aligned} {\begin{matrix} \min _{\theta , S} ||X - f_{\mathrm{AE}}(X;\theta ) - S||_{2}^{2}\\ \mathrm{s.t.} \quad ||S||_{0} \le k. \end{matrix}} \end{aligned}$$
(14)

Equation (13) can be considered as the Lagrangian reformulation of this optimization problem (14). In unsupervised settings, it is not possible to know the optimal sparsity of S for a problem in advance; therefore, \(\lambda \) and k that adjust the sparsity of S are both hyperparameters. In practical situations, domain experts can often estimate the rate of outliers in the data. In Section 5.3.1, we experimentally show how to set the appropriate \(C_\mathrm{p}=k/N\), which is the normalized value of k, for the data when the estimated rate of outliers is available. If we can solve Eq. (14), we can obtain \(\bar{L}\), which captures nonlinear features of X and can avoid the influence of outliers completely.

3.2 Alternating optimization algorithm

In the following, we propose an alternating optimization algorithm for \(\theta \) and S for the \(l_0\)-norm constrained optimization problem (14). We denote the reconstruction error terms \(X - f_{\mathrm{AE}}(X;\theta )\) by \(Z(\theta )\). Then the objective function can be expressed as \(||Z(\theta ) - S||_{2}^{2}\). In the optimization phase of \(\theta \) with S fixed, we use a gradient-based method. In the optimization phase of S with \(\theta \) fixed, the optimum S of \(||Z(\theta ) - S||_{2}^{2}\) can be obtained using the following method.

When \(||S||_{0} \le 1\) (S has at most one nonzero element), S that minimizes \(||Z(\theta ) - S||_{2}^{2}\) is the matrix that zeroes out the element with the largest absolute value of \(Z(\theta )\). Similarly, when \(||S||_{0} \le k\), S that minimizes \(||Z(\theta ) - S||_{2}^{2}\) is the matrix that zeroes out the elements with the top-k largest absolute values in \(Z(\theta )\); that is, S that minimizes Eq. (14) with \(\theta \) fixed can be determined in a closed form as follows:

$$\begin{aligned} s_{ij}=\left\{ \begin{array}{ll} z_{ij} &{} (|z_{ij}| \ge c) \\ 0 &{} (\mathrm{otherwise}), \\ \end{array} \right. \end{aligned}$$
(15)

where \(z_{ij}\) is an (ij)-element of \(Z(\theta )\) and c is the kth largest value in \(\{|z_{ij}| \mid 1 \le i \le N, 1 \le j \le D\}\).

Theorem 1

S obtained by Eq. (15) always minimizes Eq. (14) with \(\theta \) fixed.

Proof

To optimize Eq. (14) w.r.t. S under \(\Vert S\Vert _0\le k\), we can choose at most k nonzero elements. We denote a set of subscripts of nonzero elements in S by \(H=\{(i_1,j_1),\cdots , (i_{k'},j_{k'})\}\) where \(k'\le k\). The optimality of Eq. (15) is proved by showing the following two facts.

  1. (i)

    With H fixed, the optimal S is described by \(s_{ij}=z_{ij}\) if \((i,j)\in H\) and \(s_{ij}=0\) otherwise.

  2. (ii)

    The optimal H consists of subscripts that have the top-k largest absolute values in \(\{|z_{ij}|\mid 1\le i\le N, 1\le j\le D\}\).

(i) With H fixed, the objective function Eq. (14) is

$$\begin{aligned} \sum _{(i,j)\in H} (s_{ij}-z_{ij})^2 + const. \end{aligned}$$
(16)

Note that, as \(|H|\le k\), we do not need to consider the constraint. Hence, the optimal S is clearly given by \(s_{ij}=z_{ij}\) if \((i,j)\in H\) and \(S_{ij}=0\) otherwise.

(ii) We denote the optimal S given H by \(S^*_H\). First, it is clear that, if \(H\subset H'\), \(S^*_{H'}\) achieves an equal to or better objective value than \(S^*_H\). This is because \(H'\) can make more terms zero than H. This implies that if |H| is less than k, we can determine an equivalent or better \(H'\) with k elements.Footnote 1 Hence, without loss of generality, we can assume \(|H|=k\).

Consider an optimal \(H^{*}\). If we assume that there exists a subscript in \(H^{*}\) that does not correspond to any of the top-k absolute values, then by replacing that subscript with a subscript corresponding to any of the top-k absolute values that is not already included in \(H^{*}\), we obtain an \(H'\) that necessarily yields a lower value for the objective function. Therefore, \(H^{*}\) is not optimal, which is a contradiction. Hence, the optimal H consists of subscripts that have the top-k largest absolute values in \(\{|z_{ij}|\mid 1\le i\le N, 1\le j\le D\}\).

From the facts (i) and (ii), S that zeroes out the elements with the top-k largest absolute values in \(Z(\theta )\) with \(\theta \) fixed is the optimal matrix that minimizes the objective function value; that is, S obtained by Eq. (15) always minimizes Eq. (14) with \(\theta \) fixed. \(\square \)

Using the fact that the optimal S is described by \(s_{ij}=z_{ij}\) if \((i,j)\in H^*\) and \(s_{ij}=0\) otherwise, we rewrite our proposed formulation (14) and alternating optimization method to make it algorithmically concise as follows:

$$\begin{aligned} \min _{A, \theta } ||A \circ Z(\theta )||_{2}^{2} \quad \mathrm{s.t.} \; ||A||_{0} \ge \mathrm{ND} - k, \end{aligned}$$
(17)

where \(A\in \{0,1\}^{N\times D}\) is a binary-valued matrix. In the alternating optimization of Eq. (17), \(\theta \) is optimized using gradient-based optimization and A is optimized using

$$\begin{aligned} a_{ij}=\left\{ \begin{array}{ll} 1 &{} (|z_{ij}| < c) \\ 0 &{} (\mathrm{otherwise}). \\ \end{array} \right. \end{aligned}$$
(18)

If multiple elements in \(Z(\theta )\) match c, ties are broken arbitrarily so that \(||A||_{0} \ge \mathrm{ND} - k\).

The procedure of our proposed optimization algorithm is as follows:

Input: \(X \in \mathbb {R}^{N\times D}\), \(k \in [0, N\times D]\) and \(Epoch_{\max } \in \mathbb {N}\)

Initialize \(A \in \mathbb {R}^{N\times D}\) as a zero matrix, epoch counter \(Epoch = 0\) and an AE \(f_{\mathrm{AE}}(\, \cdot \, ;{\theta })\) with randomly initialized parameters.

Repeat the following \(Epoch_{\max }\) times:

1. Obtain reconstruction error matrix \(Z(\theta )\): \(Z(\theta ) = X - f_{\mathrm{AE}}(X;\theta )\)

2. Optimize A with \(\theta \) fixed:

Get threshold \(c = k\)th largest absolute value in \(Z(\theta )\) and update A using Eq. (18)

3. Update \(\theta \) with A fixed:

Minimize \(||A \circ Z(\theta )||_{2}^{2}\) using gradient-based optimization

Return the element-wise outlierness \(R \in \mathbb {R}^{N\times D}\), which is computed as follows:

$$\begin{aligned} R = (X - f_{\mathrm{AE}}(X;\theta )) \circ (X - f_{\mathrm{AE}}(X;\theta )). \end{aligned}$$
(19)

In step 3, the number of iterations in each gradient-based optimization process affects the performance of L0-AE. In practice, the values of a performance indicator and convergence of L0-AE are sufficient if the update occurs once based on a gradient-based optimization method (e.g., Adam [22]) (see Sect. 5). In this case, the total computational cost of L0-AE is the sum of the cost of a normal AE and sorting to obtain the top-k error value.

3.3 Convergence property

In this section, we prove that our alternating optimization algorithm always converges under the assumption that AE is trained appropriately using gradient-based optimization. Here, we denote the objective function \(||A \circ Z(\theta )||_{2}^{2}\) by \(K(A, \theta )\) and the variables A and \(\theta \) at the tth step of each alternating optimization phase by \(A^{t}\) and \(\theta ^{t}\), respectively. Under this assumption, the convergence of the proposed alternating optimization method can be shown as follows:

Theorem 2

Suppose \(K(A^t, \theta ^t)\) is updated to \(K(A^{t+1}, \theta ^{t})\) using Eq. (18), and assume that \(K(A^{t+1}, \theta ^{t}) \ge K(A^{t+1}, \theta ^{t+1})\) with gradient-based optimization. Then there exists a value \(a^{\infty } \ge 0\) such that \(\lim _{t\rightarrow \infty } K(A^{t}, \theta ^{t}) = a^{\infty }\).

Proof

By updating A using Eq. (18), the obtained \(A^{*}\) minimizes Eq. (17) for any \(Z(\theta )\). Hence, for any \(\theta ^{t}\), we have \(K(A^t, \theta ^t) \ge K(A^{t+1}, \theta ^{t})\). Furthermore, \(K(A^{t+1}, \theta ^{t}) \ge K(A^{t+1}, \theta ^{t+1})\) holds by assumption, which indicates that \(K(A^{t}, \theta ^{t}) \ge K(A^{t+1}, \theta ^{t+1})\). This implies that a sequence \(\{K(A^t, \theta ^{t})\}\) is a monotonically non-increasing and non-negative sequence. Therefore, by applying the monotone convergence theorem [4], there exists a value \(a^{\infty } = \lim _{t\rightarrow \infty } K(A^{t}, \theta ^{t}) \ge 0\).\(\square \)

Remark

The assumption \(K(A^{t+1}, \theta ^{t}) \ge K(A^{t+1}, \theta ^{t+1})\) holds when the learning rate of the AE model is sufficiently small. Although this assumption might not hold for a fixed learning rate in practice, L0-AE results in better convergence than the RDA (see Sect. 5.5). Note that, by Theorem 2, our training algorithm is stable, although it does not guarantee that the parameters converge to a stationary point.

3.4 Algorithm for structured outliers

In the following, we describe an alternating optimization algorithm for data with structured outliers along the sample dimension. To detect structured outliers, Eqs. (17) and (18) are, respectively, reformulated as follows:

$$\begin{aligned}&{\begin{matrix} \min _{A, \theta } ||A \circ (X - f_{\mathrm{AE}}(X;\theta ))||_{2}^{2}\\ \mathrm{s.t.} \; ||A||_{2,0} \ge N - k, \end{matrix}} \end{aligned}$$
(20)
$$\begin{aligned}&a_{i\cdot }=\left\{ \begin{array}{ll} 1 &{} (\sum \limits _{j=1}^{D} (z_{ij})^{2} < c') \\ 0 &{} (\mathrm{otherwise}), \\ \end{array} \right. \end{aligned}$$
(21)

where the \(l_{2,0}\)-norm is defined as

$$\begin{aligned} ||A||_{2,0} = \sum _{i=1}^{N} || \sum _{j=1}^{D}a_{ij}^{2} ||_{0}, \end{aligned}$$
(22)

and \(c'\) is the kth largest value of the vector \(\sum _{j=1}^{D} (z_{\cdot j})^{2}\). The sample-wise outlierness \(\varvec{r}'\) is calculated using the \(R = [r_{ij}]\) defined by Eq. (19) as follows:

$$\begin{aligned} {\textstyle r'_{i} = \sum \limits _{j=1}^{D}r_{ij}}. \end{aligned}$$
(23)

L0-AE uses this version of the formulation and the alternating optimization method for outlier detection.

As with the update of A using Eq. (18), the update of A using Eq. (21) always minimizes the objective function (20) with \(\theta \) fixed. The convergence of this algorithm using Eq. (21) is easily proved in a similar manner to Theorem 2.

Remark

The concept of our optimization methodology for structured outliers can be regarded as least trimmed squares (LTS) [27], in which the sum of all squared residuals except the largest k squared residuals is minimized.

4 Related work

Recently, highly accurate neural network-based anomaly detection methods, such as AE, variational AE (VAE) and generative adversarial network-based methods [3, 28, 36], have been proposed. However, they assume a different problem setting from ours; that is, the training data do not include anomalous data, and finding anomalies in test datasets is the target task. Therefore, these methods do not have a mechanism that excludes outliers during training. In [10], the equivalence of the global optimum of the VAE and RPCA is shown under the condition that a decoder has some type of affinity; however, connections between VAE and RPCA are not shown for general nonlinear activation functions.

The discriminative reconstruction AE (DRAE) [30] has been proposed for unsupervised outlier removal. The DRAE labels samples for which the reconstruction errors exceed a threshold as “outliers” and omits such samples for learning. To appropriately determine the threshold, the loss function of the DRAE has an additional term to separate the reconstruction errors of inliers and outliers. Because of this additional term, the DRAE does not solve an \(l_0\)-norm constrained optimization problem; that is, the learned manifold is affected by outlier values, which degrades the detection performance (see Section 5).

RandNet [9] has been proposed as a method to increase robustness through an ensemble scheme. Although this ensemble may improve robustness, it does not completely avoid the adverse effects of outliers because each AE uses a non-regularized objective function. The deep structured energy-based model [31], which is a robust AE-based method that combines an energy-based model and non-regularized AE, has the same drawback. In [19], a method that simply combines AE and LTS was proposed; however, no theoretical analysis for the combined effects of AE and LTS was presented.

5 Experimental results

5.1 Experimental settings

5.1.1 Datasets

We used both artificial and real datasets. Four types of artificial datasets were used to evaluate the detection performance against three types of well-known outliers [13, 17]: global outliers, clustered outliers and local outliers, in addition to outliers contained in image data. Figure 1 illustrates the artificial datasets. In particular, the data in Fig. 1a have large noise, which can be regarded as an example of corrupted data. Figure 1a shows the dataset containing global outliers: We sampled 9,000 inlier samples \((x,2x^2-1)\in \mathbb {R}^2\) where \(x\in [-1,1]\) was sampled uniformly. Furthermore, we sampled 1000 outliers uniformly from \([-1,1]\times [-1,1]\) that were regarded as corrupted samples. Figure 1b shows the two-dimensional dataset containing clustered outliers: We sampled 9000 inlier samples in the same manner as in Fig. 1a. Furthermore, we generated 1000 outliers with a normal distribution at the mean of (-0.1087, 0.6826) (randomly generated central coordinates) and the standard deviation of 0.04. Figure 1c shows the two-dimensional dataset containing the local outliers that were generated by referring to [17]. We sampled a total of 9750 inlier samples. Half of them were uniformly sampled from the center (0.5, 0.5), distance \(r \in [0, 0.5]\) and angle \(rad \in [0, 2\pi ]\) from the center. The other half were uniformly sampled from the center (-0.5, -0.5), \(r \in [0.4, 0.5]\) and \(rad \in [0, 2\pi ]\) from the center. Furthermore, we generated 250 outliers uniformly within a radius of 0.35 from the center (-0.5, -0.5). Figure 1d shows a part of the image dataset that was generated by referring to [34]. This dataset was generated using “MNIST”: The inliers were 4750 randomly selected “4” images, and the outliers were 250 randomly selected images that were not “4.”

Fig. 1
figure 1

Artificial datasets: black/red points are inliers/outliers in (a)–(c)

Table 2 Summary of the real datasets

As real datasets, we used 11 datasets from Outlier Detection Datasets (ODDS) [25], which are commonly used as benchmarks for outlier detection methods. Additionally, we also used the “kdd99_rev” dataset introduced in [36]. Table 2 summarizes the 12 datasets. Before the experiments, we normalized the values of the datasets by dimension into the range of -1 to 1.

5.1.2 Evaluation method

Following the evaluation methods in [9, 23, 24], we compared the area under the receiver operating characteristic curves (AUCs) of the outlier detection accuracy. The evaluation was performed as follows: (1) all samples (whether inlier or outlier) were used for training; (2) the outlierness of each sample was calculated after training; and (3) the AUCs were calculated using outlierness and inlier/outlier labels. Note that we did not need to specify the detection threshold in this evaluation scheme.

5.2 Methods and configurations

5.2.1 Robust PCA (RPCA)

We used the RPCA implemented in [11] as a baseline linear method.

5.2.2 Normal autoencoders (N-AE)

We implemented N-AE with the loss function \(||X - f_{\mathrm{AE}}(X;\theta )||_{2}^{2}\) as the baseline non-regularized detection method. For every AE-based method below, we used common network settings. We used three fully connected hidden layers (with a total of five layers), in which the number of neurons was \(ceil([D, D^{\frac{1}{2}}, D^{\frac{1}{4}}, D^{\frac{1}{2}}, D])\) from the input to the output unless otherwise noted. These were connected as [input layer] \(\rightarrow \) [hidden layer1: linear + relu] \(\rightarrow \) [hidden layer2: linear] \(\rightarrow \) [hidden layer3: linear + relu] \(\rightarrow \) [output layer: linear], where “linear” means that the layer is a fully connected layer with no activation function and “linear + relu” means that the layer is a fully connected layer with the rectified linear unit function. We set the mini-batch size to N/50 and applied Adam [22] (\(\alpha =0.001\)) for optimization with \(Epoch_{\max } = 400\). We used Eq. (23) to calculate the sample-wise outlierness. To prevent undue advantages to our method (L0-AE) and the other AE-based methods, we searched this architecture by maximizing the average AUC of N-AE.

5.2.3 Robust deep autoencoders (RDA)

We implemented the RDA [34] with Eq. (10). To make the number of loops equal to those of the other AE-based methods, the parameter inner_iteration, which is the number of iterations required to optimize AE during one execution of \(l_1\)-norm optimization, was set to 1. We varied the \(\lambda \) values in 10 ways, \(\{ 1\times 10^{-5}, 2.5\times 10^{-5}, 5\times 10^{-5}, 1\times 10^{-4}, 2.5\times 10^{-4}, 5\times 10^{-4}, 1\times 10^{-3}, 2.5\times 10^{-3}, 5\times 10^{-3}, 1\times 10^{-2} \}\), for each experiment with reference to [34] and adopted the result with the highest average AUC.

5.2.4 RDA-Stbl

This baseline was used to confirm the effect of the \(l_0\)-norm constraint of L0-AE. There are three differences between the objective functions of L0-AE and the RDA: (1) the \(l_0\)-norm constrained problem versus the \(l_1\)-norm regularized problem; (2) for the input to AEs, X versus \(X - S\); and (3) in the first term of the objective function, \(||\cdot ||_{2}^{2}\) versus \(||\cdot ||_{2}\). To bridge gaps (2) and (3), we introduce RDA-Stbl, which minimizes the objective function \(||L_\mathrm{D} - f_{\mathrm{AE}}(X;\theta )||_{2}^{2} + \lambda ||S^{T}||_{2,1}\) such that \(X-L_\mathrm{D} - S = 0\), w.r.t. S and \(\theta \). By comparing the results of this model with those of L0-AE, we can confirm the effects that result from the difference between the \(l_0\)-norm constraint and \(l_1\)-norm regularization. We varied the \(\lambda \) values in 10 ways using the same values as those in the RDA and adopted the result with the highest average AUC.

5.2.5 L0-norm constrained autoencoders (L0-AE)

We used L0-AE for structured outliers (described in Sect. 3.4). The sample-wise outlierness of L0-AE was calculated using Eq. (23). We did not iterate the process to update the parameters of the AE at each gradient-based optimization step. Instead of k, we used \(C_\mathrm{p} = k / N \, (0\le C_\mathrm{p} \le 1)\), which was normalized by the number of samples. This type of normalized parameter is often used in other methods such as the one-class support vector machine (OC-SVM) [29] and isolation forest (IForest) [24]. Note that L0-AE is equivalent to N-AE when \(C_\mathrm{p} = 0\). We varied \(C_\mathrm{p}\) from 0.05 to 0.5 in 0.05 increments for each experiment and adopted the result with the highest average AUC.

5.2.6 Variational autoencoder (VAE)

We adopted VAE for our problem setting. Outlierness was computed using the reconstruction probability [3]. Note that the number of output dimensions of hidden layer1 and layer3 is twice that of the other AE-based methods.

5.2.7 Discriminative reconstruction autoencoder (DRAE)

We chose \(\lambda \), which determines the weight of the term in the objective function for separating the inlier and outlier reconstruction errors (see Sect. 4), as the parameter for the parameter search. We varied the \(\lambda \) values in 10 ways, \(\{ 1\times 10^{-3}, 2.5\times 10^{-3}, 5\times 10^{-3}, 1\times 10^{-2}, 2.5\times 10^{-2}, 5\times 10^{-2}, 1\times 10^{-1}, 2.5\times 10^{-1}, 5\times 10^{-1}, 1\times 10^{0} \}\), for each experiment with reference to [30] and adopted the result with the highest average AUC.

We used Chainer (version 1.21.0) [8] for the implementation of the above AE-based methods. Additionally, we applied the following three conventional methods for a comparison of the AUCs for the real datasets.

5.2.8 One-class support vector machine (OC-SVM)

We used the OC-SVM implemented in scikit-learn and set kernel = “rbf.”

5.2.9 Local outlier factor (LOF)

We used the LOF [5] implemented in scikit-learn and set “k” for the k-nearest neighbors to 100.

5.2.10 Isolation forest (IForest)

We used the IForest from a Python library pyod [33] with “ n-estimators” (the number of trees) set to 100.

We tuned the above-mentioned parameters to achieve a high AUC on average over all the real datasets; for the other parameters, we used the recommended (default) values, unless otherwise noted.

Table 3 Average measurements from reconstruction-based methods for global outliers
Fig. 2
figure 2

Distributions of outlierness for each method in the first run for global outliers: sample IDs of inliers and outliers are 1\(\sim \)9000 and 9001\(\sim \)10,000, respectively

5.3 Evaluation results for artificial datasets

5.3.1 Results for the global outlier dataset

First, we evaluated the AUCs and robustness against the outliers of L0-AE and the baseline reconstruction-based methods using the global outlier dataset that has highly corrupted samples (Fig. 1a). We compared the average AUC, in addition to the average outlierness of inlier samples \(O_{\mathrm{avg}}^i\), average outlierness of outlier samples \(O_{\mathrm{avg}}^o\) and ratio \(O_{\mathrm{avg}}^o / O_{\mathrm{avg}}^i\) (a higher value implies that fewer outliers are close to a low-dimensional manifold). In this experiment, because \(D=2\), we could not set the number of neurons and parameters as mentioned in Sect. 5.2; instead, for N-AE to achieve a high AUC, we used [2, 100, 1, 100, 2], which were obtained empirically.

Table 3 presents the average of these measurements over 20 trials with different initial network weights, and Fig. 2 shows the distributions of outlierness for each method, except VAE, in the first run. Note that “ params” in Table 3 indicates the parameters with which each method achieved the highest average AUC. As VAE uses the probability as outlierness, only the AUC was included in Table 3. These results show that L0-AE outperformed the other methods in terms of AUC and the distribution of outlierness between inliers and outliers (i.e., sparsity of the error matrix).

RPCA performed significantly poorer than the other methods because of its linearity. Among the AE-based methods, L0-AE performed best. In L0-AE, we observed that the learned manifold was almost entirely composed of inliers. Therefore, it can be confirmed that the \(l_0\)-norm constraint of L0-AE functions as intended, and L0-AE can learn by almost completely eliminating the influence of corrupted samples while capturing nonlinear features. By contrast, the performances of the other AE-based methods were inferior to that of L0-AE because the other methods cannot completely exclude the influence of corrupted outliers. VAE was less accurate than the other AE-based methods; we consider that VAE was unable to demonstrate robustness because of the non-affinity of the decoder. For the DRAE, the reconstruction errors of inliers and outliers were relatively well separated, but the DRAE was more strongly affected by outliers than L0-AE because the DRAE objective function depends on how large outliers are, whereas the L0-AE objective function does not.

Fig. 3
figure 3

Parameter sensitivity of L0-AE for the global outlier datasets with 10 outlier rates; each error bar represents a 95% confidence interval, gray vertical lines indicate the true outlier rates of each dataset, and gray horizontal lines indicate the AUCs of a normal AE (\(C_\mathrm{p}=0\)) for each dataset

Next, we evaluated the parameter sensitivity of L0-AE using the artificial dataset with global outliers (Fig. 1a). We used 10 different datasets in which the outlier rate in the dataset was changed to \(\{0.05, 0.1, 0.15,\ldots , 0.45, 0.5\}\). The parameter sensitivity was evaluated based on the change in the average AUC over 10 trials with different random seeds when \(C_\mathrm{p}\) was changed from 0.05 to 0.55 in 0.05 increments for each dataset.

Figure 3 shows the AUCs with different \(C_\mathrm{p}\) values for each dataset. From the results, we confirmed that, even if \(C_\mathrm{p}\) was changed significantly, the AUCs did not change significantly and remained above 80%. Therefore, we can say that the quality of the distribution of outlierness obtained by L0-AE is robust to changes in \(C_\mathrm{p}\). For all results, the maximum AUC values occurred at \(C_\mathrm{p}\) values moderately greater than the true outlier rates. When \(C_\mathrm{p}\) was moderately greater than the true outlier rate, we considered that outliers were safely detected as outliers; therefore, we believe that the highest AUCs could be achieved. When \(C_\mathrm{p}\) was less than the true outlier rate, the AUCs were better than in the case of N-AE (\(C_\mathrm{p} = 0\)) because some outliers were not reconstructed. Therefore, in practical use, when the true outlier rate can be approximately estimated by a domain expert, it is appropriate to set \(C_\mathrm{p}\) to a value moderately greater than the estimated outlier rate. On the basis of this experiment and experiments on various datasets presented below, we recommend setting \(C_\mathrm{p}\) to a value approximately 0.1 greater than the estimated outlier rate. By contrast, for the parameters in other robustified reconstruction-based methods, it is difficult to determine which value should be set even if the true outlier rate is accurately estimated by a domain expert because they are abstract in that they determine the ratio between the reconstruction error and the regularization term.

Table 4 Average measurements from reconstruction-based methods for three types of outliers
Fig. 4
figure 4

Distributions of outlierness for clustered outliers in the first run: sample IDs of inliers and outliers are 1\(\sim \)9000 and 9001\(\sim \)10,000, respectively

Fig. 5
figure 5

Distributions of outlierness for local outliers in the first run: sample IDs of inliers and outliers are 1\(\sim \)9750 and 9751\(\sim \)10,000, respectively

Fig. 6
figure 6

Distributions of outlierness for image outliers in the first run: sample IDs of inliers and outliers are 1\(\sim \)4750 and 4751\(\sim \)5000, respectively

5.3.2 Results for the clustered, local and image outlier datasets

In the following, we evaluate L0-AE for the artificial datasets other than the global outlier dataset. First, we compare the AUCs and the distributions of outlierness. Table 4 presents the average of the measurements over 20 trials with different random seeds and the parameters for achieving the maximum average AUC for the three datasets (Figs. 1b–d). For the clustered and local outlier datasets, we used the same number of neurons used for the global outlier dataset. Figures 4, 5 and 6 show the distributions of outlierness for each method in the first run under the best parameters for the clustered, local and image outlier datasets, respectively.

For the clustered outlier dataset, we can see that L0-AE completely separated the distributions of outlierness of inliers and outliers. This is likely to be because of the appropriate \(C_\mathrm{p}\) settings and the \(l_0\)-norm constraint, which allowed the manifold to be learned by excluding the effects of all clustered outliers. By contrast, the other methods had a lower average AUC than L0-AE because the learned manifolds were affected by the outliers as a result of the nonuse of the \(l_0\)-norm constraint.

Fig. 7
figure 7

Parameter sensitivity of L0-AE for the clustered, local and image outlier datasets; each error bar represents a 95% confidence interval, gray vertical lines indicate the true outlier rates of each dataset, and gray horizontal lines indicate the AUCs of a normal AE (\(C_\mathrm{p}=0\)) for each dataset

For the local and image outlier datasets, our L0-AE again had the best performance measures, as shown in Table 4, although the degree of improvement was smaller than that of the clustered outlierness experiments. A possible reason for this is that there was little room for improvement for the other robustified AE-based methods because the network structure, which was set up to provide the highest accuracy of unconstrained N-AE for fairness, was not suitable for the other methods. Thus, we verified that L0-AE can learn nonlinear manifolds better than other reconstruction-based methods for all cases, avoiding the influence of outliers.

Next, we evaluated the parameter sensitivity of L0-AE for the three datasets (Figs. 1b–d). Figure 7c shows the AUCs with different \(C_\mathrm{p}\) values for each dataset. As with the global outlier dataset, the AUCs were moderately robust to changes in \(C_\mathrm{p}\), and the maximum AUC values were achieved at \(C_\mathrm{p}\) values moderately greater than the true outlier rates. A similar trend was observed for all well-known outliers, thereby confirming the utility of L0-AE. The development of an automatic optimal \(C_\mathrm{p}\) search method under the \(l_0\)-norm constraint without ground truth outlier rates or labels is important future work. For the image outlier dataset, the AUC did not decrease as \(C_\mathrm{p}\) increased. Given the high AUC of RPCA, which is a linear method, for the image dataset, it is likely that the “4” images in the MNIST dataset were confined to a linear subspace and well separated from the images of the other classes. Therefore, even though \(C_\mathrm{p}\) was high and the number of inlier samples to be reconstructed reduced, we believe that L0-AE can learn manifolds such that the entire inlier sample can be successfully reconstructed.

Table 5 Comparison of AUCs [%] with standard deviation
Fig. 8
figure 8

Parameter sensitivity of L0-AE for real datasets; each error bar represents a 95% confidence interval, gray vertical lines indicate the true outlier rates of each dataset, and gray horizontal lines indicate the AUCs of a normal AE (\(C_\mathrm{p}=0\)) for each dataset

Fig. 9
figure 9

Examples of the transition of the values of the objective functions from RDA and L0-AE

Table 6 Number of epochs in which the objective function value increased over the previous epoch (summed over all datasets)

5.4 Evaluation results for real datasets

First, we compared the AUCs for the real datasets. The AUC values were averaged over 50 trials with different random seeds. For the RDA, we used \(\lambda =5.0\times 10^{-5}\); for RDA-stbl, \(\lambda =5.0\times 10^{-4}\); for the DRAE, \(\lambda =1.0\times 10^{-1}\); and for L0-AE, \(C_\mathrm{p}=0.3\). Table 5 presents the average AUCs for each dataset; Avg. AUC, Avg. rank and Avg. time refer to the average AUC, average rank over the datasets and average run-time, respectively.

L0-AE achieved the highest average AUC and average rank. Among the reconstruction-based methods, L0-AE achieved the highest AUCs for eight out of 12 datasets. Particularly on kdd99_rev, the AUC of L0-AE was considerably higher than those of the other AE-based methods. Because kdd99_rev had a high rate of outliers and they were distributed close to each other, the methods with \(l_1\)-norm regularization and no regularization could not avoid reconstructing the outliers, whereas L0-AE almost completely avoided reconstruction because of its \(l_0\)-norm constraint. Furthermore, we observed that the AUCs of the RDA and RDA-Stbl were nearly equal. This shows the importance of the \(l_0\)-norm constraint. L0-AE outperformed the DRAE on average; we consider that L0-AE selectively reconstructs only the inliers, whereas the DRAE reconstructs inliers and reduces the variance of each label, thereby allowing outliers to affect manifolds. Additionally, the computational cost of the DRAE was higher than that of L0-AE, because of the calculation of the threshold. For VAE, training was unstable for some datasets. One possible reason is that VAE involves random sampling in the reparametrization trick, which increased the randomness of the results under these experimental settings. By contrast, among the AE-based methods, L0-AE achieved stable results. The RPCA results were relatively good for some datasets, which suggests that these datasets had linear features and \(l_0\)-norm regularization worked; L0-AE performed well by capturing nonlinear features, even for the other datasets. The reason that RPCA outperformed some AE-based methods on average is that RPCA automatically detects the rank of the inlier, whereas the AE-based methods have a fixed latent dimension (there is no known method for obtaining an appropriate latent dimension in an unsupervised setting).

Next, we evaluated the parameter sensitivity of L0-AE using real datasets. Figure 8 shows the AUCs with different \(C_\mathrm{p}\) values for L0-AE (averaged over 50 trials). For most \(C_\mathrm{p}\), the AUCs of L0-AE were higher than those of the normal AE (\(C_\mathrm{p}=0\)), that is, the baseline of the non-robustified method. Therefore, even when the appropriate \(C_\mathrm{p}\) is unknown, L0-AE is still practical enough to be used. Additionally, as in the case of the artificial datasets, the maximum AUC values essentially occurred at \(C_\mathrm{p}\) values moderately greater than the true outlier rates. Therefore, in practical terms, it is easier to select an appropriate \(C_\mathrm{p}\) if the true outlier rate can be estimated.

5.5 Evaluation of convergence

We compared the convergence of L0-AE with that of the RDA. Here, we did not use mini-batch training to remove the effect of randomness. Table 6 presents the sum of the number of epochs in which the value of the objective function increased over the previous epoch during 20 trials (96,000 epochs in total). The results of N-AE are also included for reference. Additionally, Fig. 9 shows two transition examples of the values of the objective functions of the RDA and L0-AE. Among them, Fig. 9d shows the results of the only trial in which the objective function increased in L0-AE; the epochs in which the objective function increased were 294–302 when \(C_\mathrm{p} = 0.3\), with an average increase of 0.23, which is considerably less than the value of the objective function. In Table 6 and Fig. 9, we observe that L0-AE converged well, regardless of the parameter \(C_\mathrm{p}\), unlike the RDA. This empirically demonstrates the validity of Theorem 2, which states that our alternating optimization algorithm converges when the gradient-based optimization behaves ideally. For the RDA, when \(\lambda \) was small, the value of the objective function was unstable, but when \(\lambda \) was large, it was stable. We believe that this is because the sparsity of S improves as \(\lambda \) increases, thus reducing the gap in the objective function between phases of alternating optimization. We observe that, with N-AE, the values of the objective function did not increase, which implies that our gradient-based optimization generally satisfies the assumption in Theorem 2.

6 Conclusion

In this paper, we proposed L0-AE for unsupervised outlier detection. L0-AE decomposes data nonlinearly into low-dimensional manifolds that capture the nonlinear features of the data and a sparse error matrix under the \(l_0\)-norm constraint. We proposed an efficient alternating optimization algorithm for training L0-AE and proved that this algorithm converges under a mild condition. We conducted extensive experiments with real and artificial datasets and confirmed that L0-AE is highly robust to outliers. We also confirmed that this high robustness leads to higher outlier detection accuracy, defined by the AUC, than those of existing unsupervised outlier detection methods.