Bounding open space risk with decoupling autoencoders in open set recognition

One-vs-Rest (OVR) classification aims to distinguish a single class of interest (COI) from other classes. The concept of novelty detection and robustness to dataset shift becomes crucial in OVR when the scope of the rest class is extended from the classes observed during training to unseen and possibly unrelated classes, a setting referred to as open set recognition (OSR). In this work, we propose a novel architecture, namely decoupling autoencoder (DAE), which provides a proven upper bound on the open space risk and minimizes open space risk via a dedicated training routine. Our method is benchmarked within three different scenarios, each isolating different aspects of OSR, namely plain classification, outlier detection, and dataset shift. The results conclusively show that DAE achieves robust performance across all three tasks. This level of cross-task robustness is not observed for any of the seven potent baselines from the OSR, OVR, outlier detection, and ensembling domain which, apart from ATA (Lübbering et al., From imbalanced classification to supervised outlier detection problems: adversarially trained auto encoders. In: Artificial neural networks and machine learning—ICANN 2020, 2020), tend to fail on either one of the tasks. Similar to DAE, ATA is based on autoencoders and facilitates the reconstruction error to predict the inlierness of a sample. However unlike DAE, it does not provide any uncertainty scores and therefore lacks rudimentary means of interpretation. Our adversarial robustness and local stability results further support DAE’s superiority in the OSR setting, emphasizing its applicability in safety-critical systems.


Introduction
Deep neural networks (DNNs) have achieved state-of-the-art classification results in diverse domains such as natural language understanding [73], computer vision [27], and speech recognition [39]. Despite [3] and tend to fail when exposed to data from unseen distributions.
This issue often goes unnoticed, as state-of-the-art classification results are primarily obtained in extremely controlled benchmark environments with an inherent closed world assumption, raising the question of applicability to real-world scenarios [28,29,37,53]. Specifically, DNNs tend to generalize only well within the concepts they were trained on but tend to provide incorrect predictions with exaggerated confidence when exposed to samples from unseen distributions [3,21,28,29,37,51], jeopardizing model robustness. As visualized in Fig. 1b, the multi-layer perceptron (MLP) learns to separate the two half-moons and XOR circles but fails to generalize to the uniform noise. Note that the MLP and MiMo [24] models were trained via empirical risk minimization (ERM) to separate the two classes, so there is no preference to which class the outliers should be attributed. However, we would expect a well-generalizing model to be uncertain about these samples and not assign them to one of the classes with high confidence.  [24] visualized as contours: in contrast to MiMo and MLP, DAE learns a hull around the red class of interest (COI) for the three datasets, enabling the method to not only separate the two inlier classes but also to reject the unseen uniform noise. MLP and MiMo only wrap the inliers if the rest samples encourage such a decision boundary, as shown for the Bounding Gaussians dataset In this work, we consider open set recognition (OSR) as the generalized version of one-vs-rest (OVR) classification [5], in which the rest samples not only stem from known classes but also from different unknown sources of outlier generating processes [6,62]. Therefore, the aforementioned deficiencies of DNNs also pose a severe threat to OSR. As suggested by Scheirer et al. [62], these deficiencies can be alleviated by framing the optimization problem in OSR as a combination of ERM and open space risk minimization [62]. The open space is defined as the set of points that have at least distance d to any of the inlier samples, i.e., the closed space. The open space risk is then defined as the relative measure of the open space volume of false inliers over the closed space volume classified as inliers (i.e., true inliers), as illustrated in Fig. 2. Consequently, minimizing the open space risk forces the model to learn a hull around the inliers leading to increased model robustness.
The difference between ERM and ERM in conjunction with open space risk minimization is illustrated in Fig. 1. Due to its bounded open space risk DAE, learns a hull around the inlier data, whereas the MLP assigns an infinite space to the inlier class resulting in infinite open space risk. In practice, this problem is often mitigated by the long-established background class setup [41], in which all rest classes are subsumed within a single background class [19,57]. This approach can be an effective measure to learn a working OSR model, as illustrated in Fig. 1h. On the downside, this approach is also highly data-intensive, rendering it often infeasible and errorprone while still having an unbounded open space risk.
In OSR, we usually deal with significant class imbalance. When the focus lies on filtering a single, narrow class of interest (COI), similar to the needle in the haystack problem, then this inlier class tends to be underrepresented. When the focus shifts toward outlier detection, for instance, in the case of computer virus detection, inlier samples outnumber instances of the rest class. Further, only some classes of the problem domain are usually known for the set of rest classes (RCs), and outliers or dataset shifts are only witnessed at test time, making OSR a highly imbalanced semi-supervised setting. Throughout the paper, we refer to COI samples as To measure model robustness in isolation, we subdivide the OSR task into three disjunct subtasks by gradually increasing the scope of RC: (1) OVR classification task T c : The model is evaluated on the COI and the RCs it was trained on. (2) Contextual outlier detection task T o : Comprises the evaluation on COI and conceptually related RCs of T c . In practice, the rest samples stem from the same dataset, but from RCs the model was not trained on. Consistent with the literature, we define samples originating from RC in this case as contextual outliers, since they are possibly generated from a completely different underlying mechanism [1,2,25]. 3) Dataset shift task T d : In this case, rest samples stem from a new dataset, equivalent to the evaluation approach in [29]. By extending the scope of RC in tasks T o and T d to rest classes unseen during training, aspects of outlier detection and robustness to dataset shift become predominant. In fact, tasks T o and T d can be seen as semi-supervised one-class classification (OCC): In this semi-supervised setting, the model is only exposed to COI samples during training time and supposed to learn to reject samples that deviate from the COI representation, i.e., outliers [48,60]. With our three-task experiment design, we are therefore able to bridge the gap between OVR and OCC within OSR and precisely pinpoint the classification and robustness performance for each algorithm.
Due to the widespread application of OSR in safetycritical environments such as medical diagnosis [9,64,69], fraud detection [13,55,76] and intrusion detection [18,33], the extension of deep learning methods toward generalization and robustness with open space risk regularization is of the essence.
To this end, we propose the decoupling autoencoder (DAE) method, a novel autoencoder-based architecture that learns a radial basis function (RBF) kernel mapping the reconstruction error to class probabilities. The reconstruction error as a measure of outlierness is learned by a novel adversarial loss function that separates inliers from rest samples in reconstruction error space and is based on gradient ascend as suggested by Lübbering et al. [45]. The inlier and outlier distributions are separated by a decision boundary that is optimized end-to-end to be as close as possible to the inlier distribution. Thus, observed and unobserved rest samples can be effectively rejected, resulting in an increased robustness. While the related ATA method employs gradient ascend to samples irrespective of their reconstruction error and estimates the decision boundary offline in a brute-force fashion, the DAE loss function scales down the reconstruction error with increasing distance from the decision boundary. Therefore, samples close to the decision boundary are heavily optimized for better separation; hence the term "decoupling" in DAE. Furthermore, ATA only provides binary predictions therefore lacking rudimentary means of interpretation. In contrast, DAE yields subjective probability scores on the inlierness of a sample, substantially improving the models interpretability especially when deployed in safety-critical environments.
Generally, our method combines the merits of MLPs succeeding on classification tasks and autoencoder methods with their strong robustness. Further, we prove the existence of an upper bound on the open space risk for DAE and leverage this insight to actively minimize open space risk within DAE's loss function. Throughout a variety of experiments, we empirically show its benefits by outperforming the two most prominent OSR methods C2AE [52] and OpenMax [4], outlier detection methods ATA [45] and one-class autoencoder (OCA), ensemble method MiMo [24], and one-vs-rest MLP.
Since this is an extension paper to Lübbering et al. [43], previous contributions are summarized as follows: • We provide a novel framework for evaluating OSR w.r.t. classification, outlier detection and dataset shift, partially based on earlier work by Hendrycks and Gimpel [28].
With this setup, we are able to show that the three subtasks T c , T o , and T d of OSR pose different challenges toward deep learning methods. • We propose the new method DAE that provides robust results across all subtasks of OSR. This kind of robustness across all three tasks of OSR is not observed for any of the four baselines MLP, MiMo, ATA, and OCA. In contrast to the autoencoder-based baselines ATA and OCA, DAE can be trained in an end-to-end fashion by alleviating the vanishing gradient problem [67]. Additionally, ATA provides probability scores for COI, which is an important feature for deployment in safety-critical systems.
• While our findings suggest that DAE, MLP, and MiMo become uncalibrated w.r.t. the T d and T o settings, we find indications that calibration improvements on T c will generalize better to T d and T o for DAE than for MiMo and MLP.
This paper comprises the following additional contributions: While the previous paper focused on OVR, we examine DAE in light of the OSR framework in this paper and prove its upper bound on the open space risk. The benefits of the bound and DAE's ability to minimize open space risk are empirically verified on a range of text and image classification datasets with a varying number of inlier classes. These experiments reveal the resulting robustness gain and pinpoint DNNs' deficiencies with their lack of a bounded open space risk. We additionally benchmark DAE against recent OSR methods C2AE and OpenMax and show its superiority from an intuitive and empirical point of view.
Further, we visually compare the reconstruction differences on two image datasets between DAE and the autoencoder-based baselines and explain the benefits of the adversarial loss function. Moreover, we show DAE's robustness w.r.t. local stability in feature space and adversarial robustness using the FGSM method [21]. Finally, we provide an ablation study empirically verifying that only the combination of all loss terms yields the desired classification and robustness properties.
With its added theoretical foundation and empirical verification, DAE becomes a promising candidate for deployment in safety-critical systems.

Related work
The OVR classification strategy is often applied to binary models in order to extend them to multi-class classification [5]. Naturally, there is no necessity for OVR classification for DNNs, since they already support multi-class classification by design. However, there are common situations where OVR becomes relevant, e.g., if faced with only positively labeled samples and all remaining samples with potentially unknown sources are assigned to a single negative class [16,31,66] or if the goal, as in this paper, is to filter normal samples from abnormal ones [6]. As motivated previously, vanilla MLPs are unsuitable in this case due to their infinite open space risk and consequential robustness deficiencies toward outliers [6]. While in modern architectures, researchers often try to circumvent this problem by subsuming all rest classes within a single background class [19,57], this effectively only reduces but does not solve the unbounded open space risk [6]. However, there have been few attempts to bound the open space risk in DNNs. For instance, Bendale and Boult [4] and Rudd et al. [59] leverage extreme value theory (EVT) to determine a compact abating probability model based on the deep features of the full network outputs. Noteworthy, both approaches are offline and are therefore not involved in training the network. The autoencoder-based approach C2AE proposed by Oza and Patel [52] also uses EVT to determine the decision boundary in reconstruction error space and requires at least two inlier classes. Other approaches try to bound open space risk by using tent activation functions [58]. They show that it increases robustness to adversarial attacks while potentially compromising classification performance. In conclusion, as stated in a recent survey, OSR is still largely unsolved within the deep learning domain [6].
Other contributions focus on the calibration, as DNNs tend to provide wrong predictions with overly high confidence estimates for out-of-distribution samples [23,37]. Since OVR classification incorporates outliers when extended to OCC, this issue is also prominent in OVR classification. Several methods have been proposed: either by adding a calibration task to the model, which aligns it with the target probability distribution [12] or by incorporating diverse predictions from ensemble methods such as deep ensembles [37] and MiMo [24]. These methods combine multiple neural networks as weak classifiers, whose diverse outputs are aggregated to well-calibrated predictions, thus compensating for overly confident predictions. Conversely, as pointed out by Van Amersfoort et al. [67] and also supported by our findings, the diversity of the weak classifiers within the ensemble methods is not strong enough to generalize well to out-ofdistribution samples. Instead, Amersfoort et al. propose the kernel-based method DUQ, which learns centroids of classes in a lower-dimensional space [67]. Uncertainty is measured as a distance from the class centroids to out-of-distribution points.
In contrast to Van Amersfoort et al. [67], our DAE method incorporates radial basis functions to estimate the outlierness of a sample via the distance to its reconstruction and therefore does not require any centroid updating routines. Further, ATA [45] optimizes the decision boundary w.r.t. F1 score in an offline brute-force line search in reconstruction error space, which does not actively minimize the open space risk. For DAE, we designed a customized loss function that allows learning the decision boundary end-to-end and minimizes open space risk. Furthermore, it forces the classes to be more separated in reconstruction error space than ATA's adversarial loss function.

Decoupling autoencoders
Similar to existing autoencoder-based approaches for outlier detection [8,26,45], the decoupling autoencoder (DAE) method learns the outlierness of a sample via its reconstruction error. Existing approaches estimate the decision boundary via brute-force algorithms [1,45] or learn the decision boundary via a subsequent downstream layer [44]. In contrast, DAE learns the decision boundary end-to-end while optimizing for a pessimistic decision boundary that is most close to the inlier samples without compromising generalization performance. Thus, the decision boundary's open space risk is actively minimized, a favorable setting in safetycritical systems.
From an architectural point of view, as displayed in Fig. 3, the network reconstructs a sample x ∈ R n using the autoencoder ϕ(x) = d(e(x)) consisting of an encoder e(·) and a decoder d(·). The reconstruction error e MSE between the original and reconstructed samplex ∈ R n computes to The reconstruction error is mapped to the inlier probability via Gaussian g : z → e − z 2 2σ 2 1 with a mean of zero. Thus, the higher the reconstruction error, the smaller the inlier probability. More formally, the full network f is given by Note that the standard deviation σ 1 is directly coupled with the decision boundary t via t = −2σ 2 1 ln 1 2 , as the threshold is fixed at the 0.5 level of function g, also shown in Fig. 4. During training, the network has three objectives: (a) To minimize inlier reconstruction errors and maximize rest sample reconstruction errors, such that the inlier samples are easily distinguishable from rest samples within the one-dimensional reconstruction error space. The overall loss functionL incorporates these three training objectives by combining the adversarial loss function L R , binary cross-entropy (BCE) classification loss L BCE (ŷ, y) = −[y lnŷ + (1 − y) ln(1 −ŷ)] and a regularizer term |t|, as follows: where y andŷ denote the target label and the model's predicted COI probability of sample x, respectively. Factors λ 2 and λ 3 scale the classification loss term and |t| regularization.
The adversarial reconstruction loss L R comprises the minimization and maximization of reconstruction errors w.r.t. inliers and rest samples, as defined by: Fig. 3 Decoupling autoencoder (DAE) architecture: a joint architecture composed of an autoencoder ϕ(x) = d(e(x)) for sample reconstruction, a reconstruction error module e MSE for outlierness estimation and a RBF kernel g with standard deviation σ 1 for classification. Thus, the entire network is given by  where scaling factors λ 0 ∈ R + and λ 1 ∈ R + determine the minimization/maximization magnitude, respectively. Within L R , the mean squared error L MSE (x,x) = 1 n n i (x i −x i ) 2 is weighted by w i : L MSE → R for inliers and by w o : L MSE → R for rest samples. These two functions, given by Eqs. (6) and (7), push the reconstruction errors of inliers and rest samples away from the decision boundary t toward the origin and ∞, respectively, thereby providing a clear class separation: Note that in both cases, the standard deviation σ 2 of the Gaussian is a hyperparameter and determines how far the two classes are being separated. σ 2 is not to be confused with σ 1 , which is coupled with the decision boundary t. The reconstruction error maximization of rest samples is achieved in Eq. (7) by the negation which is equal to flipping the loss gradients [45] and thus corresponds to gradient ascend. The reconstruction error objective L R is conceptually visualized in Fig. 5.
Due to the spacious separation of the inlier set and rest sample set in reconstruction loss space, there is a wide range of possible decision boundaries t. We argue that the best t is as close as possible to the inlier set without compromising the classification performance. Thus, offering a reasonable trade-off between classification (T c ) and outlier detection (T o )/dataset shift (T d ), while reducing the open space risk. The trade-off is modeled via the second and third addends ofL in Eq. (2). The ||t|| 1 regularizer minimizes the decision boundary t toward 0, eventually leading to an impractical classifier always predicting RC independently of x. This impractical solution is prevented by the classification loss term L BCE acting as a stopping criterion, as visualized in Fig. 4. Note that the four scaling factors λ 0 , . . . , λ 3 trade off these three objectives of reconstruction minimization/maximization, classification performance and out-of-distribution robustness.
The combination of L R and L BCE also solves the vanishing gradient problem, which would occur for large e MSE , due to the Gaussian output activation function. When solely minimizing the classification loss L BCE jointly with the regularizer term, i.e.,L * = L BCE + ||t|| 1 , we can show that where are the network weights of autoencoder ϕ. The total derivative of L * total computes to As x tends to infinity, Gaussian g becomes a horizontal line, resulting in gradients equal to 0: Thus, the expression in Eq. (8) zeros out, which is why the gradient updates become ineffective for inliers with large e MSE . As L R is independent of Gaussian g, it is not affected by this problem and enforces convergence by minimizing these inliers.

Classification concern conflicting with robustness
During training of DNNs, we minimize a surrogate loss function L, such as negative log-likelihood (NLL) instead of the non-differentiable 0-1 loss, over a given empirical data distributionp data (x, y), as the true p data (x, y) is unknown [20]. This corresponds to minimizing the empirical risk which is given by E where N is the training set size and f the model with parameters . This procedure optimizes for a discriminating function which correctly separates the classes in the training set, i.e., optimization for classification performance. Under the assumption that the empirical data generating distributionp data (x, y) is similar to the true data generating distribution p data (x, y), then within the problem domain the discriminatory function will generalize to unseen data x, y ∼ p data (x, y).
A problem arises when the model is exposed to samples that are highly unlikely according top data (x, y), i.e., outliers because the model was not optimized w.r.t. such samples. This issue is visualized in Fig. 1 for MLP and MiMo on the Half-Moon and XOR Circles datasets. Both methods learn to separate the observed data (i.e.,p data (x, y)). When we consider only the training data (red and blue samples), the learned model indeed minimizes the empirical risk. However, for out-of-distribution data (orange points), we would like to observe high uncertainty. Since this is not reflected in the training objective, the model often predicts one of the two classes with high confidence, even though the out-ofdistribution data cannot be attributed to any of the two classes [28].
A simple solution would be to facilitate reconstructive representation learning with, e.g., autoencoders by forcing the reconstruction error to capture the outlierness of a sample. As shown in Fig. 1 for DAE and in Fig. 6 for ATA, each model has learned a representation of COI and rejects any major deviation from it as RC. While this indeed increases the robustness [44,45], it can also harm the classification performance since a representation for all input features needs to be learned [67]. Uninformative, non-causal features can therefore also have a diminishing effect on classification performance. Models trained within the ERM framework do not suffer from this problem, as the feature extractor would neglect these features [67].
In conclusion, there is a trade-off between classification performance and robustness to outliers/dataset shift which DAE aims to alleviate within the OSR framework.

Achieving bounded open set recognition with autoencoders
In recent years, novel deep learning algorithms have advanced state-of-the-art in many classification tasks. However, as noted in the previous section, it has also been shown that these algorithms, when solely optimized for empirical risk, often give wrong predictions with high confidence when exposed to dataset shift and outliers. In this section, we formalize this issue in line with Scheirer et al. [63] and prove that our approach has an upper bound on the open space risk, a primary criterion for robust OSR. OSR was first defined by Scheirer et al. [63] and recently surveyed by Geng et al. [17] and Boult et al. [6] and is still a largely unsolved topic within the deep learning domain   [4,52,63]. OSR formalizes the problem of distinguishing a class of interest (i.e., samples originating from an observed set of classes) from samples derived from, e.g., outlier generating processes, dataset shifts, or other possibly unobserved but related classes. As deep learning classifiers are generally trained based on ERM by leveraging a surrogate loss such as cross-entropy, they only learn to differentiate the observed classes. This can be viewed as closed set classification, which is illustrated in Fig. 1: the MLP successfully learns to distinguish the two half-moons resembling the closed set. A problem arises when we zoom out of the problem domain and consider samples from the open set (i.e., outside of the closed set), then these samples that are far away from the closed set are still assigned to one of the two half-moons. As a solution, OSR proposes an indicator function f over input space χ , that maps inliers to 1 and rest samples to 0. Partially following the notation of Scheirer et al. [63], letV be the COI and SV = {x ∈ χ | min s∈V |x − s|< d} be the corresponding closed set, i.e., the set of all points within χ that are in d proximity to at least one of the inlier samples  [4,58,63]. Therefore, we have to turn toward deep learning architectures, such as autoencoders, that are capable of learning manifolds on the input space and by design learn a hull around the inliers: Lemma 1 Any autoencoder with saturating activation functions (e.g., sigmoid) within at least one of its layers and a reconstruction error output module acting as a manifold learner has a bounded open space risk R O .
is an indicator function which maps the reconstruction error function of autoencoder ϕ : R n → R n to {0, 1} based on threshold d ∈ R.
Let ϕ comprise at least one layer with an activation function that is saturating toward both tails (here, sigmoid). Assuming layer ϕ s to be the last layer with sigmoid activation and ϕ s to have m neurons, then the image this layer maps to is fixed within hypercube (0, 1) m . Therefore, the image of all subsequent layers ϕ i , ∀i > s, is also fixed. It follows that lim x∈R n →∞ |ϕ(x) − x| = ∞, as the image of ϕ(x) is bounded. In conclusion, when starting from a sample classified as inlier and moving in any but fixed direction in feature space, then the reconstruction error will approach infinity. This enforces to cross the decision boundary of the recognition function f , proving the existence of a bound on the open space risk.

Lemma 2
The open space risk can be approximated solely based on the weights of the layers succeeding the last layer with saturating activations.
Proof As previously defined, let f (x) be the recognition function based on the autoencoder ϕ with ϕ s being the last layer with saturating activation function a s and α s be the corresponding activations. Neurons in the subsequent layers can have a monotonic, non-saturating activation function a s+i , ∀i > 0 (such as ReLu). For simplicity, we assume every layer ϕ i to comprise m neurons. Let ϕ s ∈ , then the supremum of activation α s+1 j and neuron ϕ s+1 j can be bounded by the following inequalities, respectively: where w k i, j denotes the weight between neuron i of layer k − 1 to neuron j of layer k and b k j denotes the bias of neuron j in layer k. It follows for the subsequent layers s + 1 + l, Therefore, the image of ϕ can be bounded by a hypercube with its center at the origin. It follows that the recognition function's open space O f (x)dx can be approximated as the union of the set of points that are within the hypercube and those that are in less than d proximity to the hypercube.
Lemmas 1 and 2 have multiple practical implications. Firstly, our autoencoder methods are capable of bounding the open space risk and are therefore by design superior to MLP-based architectures in the OSR setting. Secondly, approximating the open space risk enables us to filter models with higher robustness during the model selection process. Thirdly, given that the approximated bound is a hypercube with its center in the origin of the feature space, it is recommendable to perform feature transformation such that the inlier class is located close to the origin, thus allowing for a smaller hypercube. Furthermore, Lemma 1 depicts the dependency of the open space risk on the weights after the last layer with saturating activation functions. By regularizing the weights in conjunction with the centering of the inlier samples, it is possible to actively minimize the bound on the open space risk. While this idea is out-of-scope of this contribution, it is a promising future research direction. For instance, it is possible to handcraft the connections in the first layer of ϕ such that the weights perform translation and scaling of the feature space.

Toward adversarial robustness and local stability
Adversarial perturbations offer an effective way of measuring a model's robustness locally as well as globally. As initially described by Goodfellow et al. [21], the idea of gradientbased adversarial attacks is to confuse a neural network f with parameters by adding an imperceptible perturbation to the original sample x with target y. The perturbation itself is not arbitrary but maximizes the loss L of the network. A common methodology of calculating these adversarial examples is the fast gradient sign method (FGSM) [21], which determines the perturbation by taking the sign of the gradients w.r.t. the sample: where scaling factor determines the volume of change. Note that since is fixed, also the perturbation's volume is fixed for all steps and across models. While in practice, adversarial examples are often used to improve model robustness via adversarial training [21], in this work, we use the FGSM framework for robustness estimation of trained models. By perturbing a sample in a step-wise adversarial fashion, the model confidence development can be tracked which provides deep insights into local stability and with an increasing number of steps also into global robustness of the model. Technically, at a given step i, a sample x i−1 is perturbed according to and the difference in confidence c i at step i w.r.t. the original sample is defined as where x 0 denotes the original sample. By varying the stepsize and the number of steps, the aforementioned local stability and global robustness can be easily estimated. Further, there also exists another adversarial robustness metric [74], which is defined as: and computes the Kullback-Leibler divergence D K L in the denominator as the relative entropy between the two confidence estimates. However, we did not consider this measure, as D K L rapidly changes around (0, 0) and (1, 1) and small changes in this area therefore lead to overly pessimistic robustness scores. Generally, we hypothesize that DAE, as a representative for an OSR architecture, is more robust than the MLP, as the latter one rather slices the spaces in discriminative hyperplanes w.r.t. the two classes. Therefore, we can expect a given dataset shift sample to be closer to a hyperplane than DAE's hull which is learned directly around the inlier class. We would assume similar adversarial robustness between MLP and DAE for an inlier sample.

Experiments and results
In this section, we discuss the experiment setup and compare the performance of DAE to the aforementioned baselines. The approaches are compared on an algorithmic level by applying nested cross-validation (CV) [32,56,68]. The best models are selected by the highest area under the precisionrecall curve (AUPR) score and are reported along with area under the receiver operating characteristics (AUROC) and F1 score with respective confidence measurements.

Selected baselines
Since OSR touches upon multiple machine learning areas such as binary classification and outlier detection, we decided to consider the two most prominent methods published under the OSR framework and seven potent baselines from the outlier detection, OVR classification and ensembling domain.
OpenMax: this is an offline OSR method that replaces the softmax layer within a fully trained multi-class DNN [4]. It applies extreme value theory to the network activations, which yields a final layer that outputs a probability score for the outlierness of a sample and closed-set probability scores. While the method's theoretical foundations are sound, it assumes that outlier activations differ significantly from inlier activations. In practice, this requirement is not always fulfilled [67], leading to robustness scores similar to those of the SoftMax baseline. By design, this method is limited to OSR problems with multiple COIs.
C2AE: this OSR approach [52] is derived from OCA which is used in outlier detection. The C2AE autoencoder is trained in a two-step fashion: (1) during closed set training, the encoder is trained jointly with a downstream classifier to perform closed set classification. (2) For decoder training, the latent vector is conditioned on an inlier-class-specific vector. During inference, it is assumed that an inlier has a lower reconstruction error compared to rest samples when the inlier's latent vector is conditioned on the respective inlier-class-specific vector. Similar to OCA, this requires rest samples to be uncorrelated with inlier samples. By design, this method is limited to OSR problems with multiple COIs and requires #closed-set classes many inference steps for each sample.
OCA: classic semi-supervised, autoencoder-based method from the outlier and novelty detection domain that is trained to reconstruct inliers. The reconstruction error is assumed to be lower for inliers than for outliers, thus rendering the reconstruction error predictive of the inlierness of a sample. As shown by Lübbering et al. [46], this method requires rest samples to be true outliers, as rest samples correlated with inliers tend to get reconstructed accurately.
ATA: this is a recent autoencoder-based method from supervised outlier detection, which, in contrast to OCA, actively maximizes the reconstruction error of rest samples [45]. Therefore, rest samples that are correlated with the COI also get maximized, thus alleviating OCA's aforementioned deficiencies.
MLP: OVR classification DNN which subsumes the set of rest classed within a single background class [19,41,57]. This binary neural network requires the rest samples to wrap the COI in feature space such that the model learns a decision boundary hull around the COI, making this approach highly data-intensive.
SoftMax: this DNN has a softmax layer for inlier probabilities as its final layer. The outlierness of a sample is estimated offline by the entropy of the sample's predicted inlier class probabilities (softmax output). Since the softmax predictions are generally overly confident for outlier data [28], the utility of this method is limited. By design, this method is restricted to problems with multiple COIs.
MiMo: this ensemble method [24] combines several weak classifiers into a single DNN by weight sharing, making this method more efficient compared to traditional ensemble DNNs. Similar to the OVR setup in MLP, this baseline subsumes non-COI classes into a single rest class. As shown by Lakshminarayanan et al. [37], ensemble methods can yield accurate predictive uncertainty estimates, which could ultimately improve OSR performance.

Evaluation approach
Evaluating OSR models w.r.t. classification performance and robustness toward outliers and dataset shifts are not straightforward. Firstly, the OSR setting is often highly imbalanced with observed rest samples significantly outnumbering the COI samples, whereas outliers/corruptions generally have a low prevalence in practice. Secondly, depending on the application's deployment domain, the focus between precision and recall can shift: For instance, in medical diagnosis systems, recall is often of the uttermost importance, whereas precision is generally favorable in equity trading precision. Due to this, threshold-dependent metrics such as F1 score or accuracy can be misleading and fail to capture the big picture.
As a viable solution to this issue, there are thresholdindependent metrics such as AUPR and AUROC, which evaluate the model at all possible thresholds [11]. The AUROC metric calculates the area under the receiver operating characteristic (ROC) curve, which maps the false positive rate (FPR) onto the true positive rate (TPR) for each threshold. The two rates are defined by FPR = FP FP+TN and TPR = TP TP+FN , where TP, FP, TN and FN depict the number of true positives, false positives, true negatives and false negatives at a certain threshold, respectively. Note that TPR is also often referred to as recall in the literature. From an interpretation point of view, AUROC yields the probability of a random rest sample being ranked higher than a random inlier [15,28] and therefore is invariant to class imbalance. This invariance to class imbalance allows for interpretable results across datasets since a random classifier achieves an expected AUROC score of 50% and a perfect classifier a score of 100%.
Since AUROC in isolation is insufficient for OVR classification with its prevalent class imbalance, we additionally consider AUPR. This metric is generally employed when faced with the "needle in the haystack problem", as AUPR takes the different class base rates into account [28]. Similar to AUROC, the PR curve maps the recall onto the precision = T P T P+F P for each threshold. Since the AUPR metric is base rate dependent, there is no fixed baseline AUPR score for a random classifier as there is for AUROC. In fact, given a random classifier, the AUPR score is roughly equal to the random classifier's precision, which is equal to the rate of the positive class [28,61]. Assuming a random classifier predicts sample i as the positive class with confidence p i ∼ U [0, 1], then for any given but fixed threshold q we get two sample sets: (1) set of positive predicted samples, (2) set of negative predicted samples. Naturally, any random subsample set has an expected positive rate of POS/N , where POS and N are the number of positive samples and the number of all samples in the entire dataset, respectively. Therefore for an arbitrary but fixed threshold q ∈ [0, 1], the subsample set of positive predictions contains an expected rate of true positives of POS/N , which is equivalent to the precision. Similar consideration hold for recall. Given any threshold q, we sample approx. q · N many samples of which approx. q · POS are positive, i.e., recall is approx. equal to q. This is why the random classifier has a constant PR curve at the precision level of number of positive samples/total number of samples and an AUPR score of the same value. In conclusion, it is essential to communicate the AUPR scores always w.r.t. the random classifier performance and the pre-defined positive class; otherwise, these metrics lack interpretation means.
Additionally, F1 score is also reported to show if each algorithm can learn reasonable decision boundaries, especially w.r.t. the classification task T c . In contrast to the former mentioned AUROC and AUPR metric, this metric evaluates the model at a fixed 0.5 threshold level, which is reasonable for task T c , but is less conclusive for tasks T o and T d . Similar to AUPR, the F1 score metric is also affected by the base rate of the positive class. Following the argumentation of the AUPR baseline, the F1 score of a random classifier calculates to Further, the algorithms are evaluated based on the correctness of their subjective uncertainty estimation (calibration) in terms of the class-wise expected calibration error (CECE) [23,36,49] and Brier score [7]. While calibration is a crucial criterion for the trustworthiness of machine learning models, it is noteworthy that it is an orthogonal concern to model accuracy, e.g., in the trivial case, a uniformly random classifier is perfectly calibrated on a balanced dataset but inaccurate [37].
The CECE metric is defined as where parameters K , m, N denote the number of classes, number of bins and dataset size, respectively. The set B i, j contains the samples whose confidence prediction w.r.t. class j falls into the i th bin. The actual ratio of class j samples and average predicted confidence of samples in the bin is denoted by y(B i, j ) andp j (B i, j ), respectively. Therefore, bin-CE i, j denotes the calibration error for class j within bin i and CECE is computed as the weighted average over all bin-CE i, j .
The Brier score for binary classification is defined as where N is the number of samples, y i is the target of sample x i and f (x i ; ) is the respective prediction (probability) of the model. For a comprehensive robustness study, we also evaluate the algorithms w.r.t. their robustness to adversarial attacks and the related concern of local stability via the difference in confidence c i [see Eq. (17)]. We empirically determined that a perturbation scaling factor = 0.001 and #steps = 300 captures both adversarial robustness and local stability. While for MLP, it is straightforward to calculate the sample's perturbation w.r.t. the binary cross-entropy loss, DAE's training loss yields gradients that are almost 0 for small and large reconstruction errors due to the Gaussian's flatness at its mean and limits. Similar to Eq. (8), this results in dead updates within FGSM. As a solution, we calculate the gradients directly on the reconstruction error e MSE .

Datasets
To evaluate the models on OSR with its classification subtask T c and the more challenging subtasks of contextual outlier detection T o and dataset shift T d , we extended four textual datasets and three image datasets: Reuters dataset: this multi-label dataset is a standard benchmark for document classification and outlier detection, which contains 10,788 news documents from 90 different categories published by the news outlet Reuters. Since multilabel classification is out of the scope of this work, we only consider documents that are assigned to a single class.
ATIS dataset: this dataset comprises 5871 transcribed queries that passengers requested to the air travel information system (ATIS) for flight related information. These queries were grouped into 17 categories.
Newsgroups dataset: this dataset contains 18,000 newsgroup posts from 20 different topics and is a standard dataset for text classification and text clustering.
TREC dataset: question classification dataset contains 5500 questions not limited to any particular topic domain [40]. This makes the dataset compelling for dataset shift evaluation.
FMNIST dataset: image classification dataset comprising 70,000 fashion items [70]. Each of the 28 × 28 grayscale images shows one of the ten fashion items t-shirt, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag, or ankle boot.
EMNIST-letter dataset: dataset for image classification providing 145,600 grayscale images of alphabetic characters. Similar to TREC, we use this dataset solely for evaluation.
Due to the high number of different categories and their large size, each dataset is a viable benchmark dataset for OSR model evaluation. As shown in Table 1, we evaluated DAE and the baselines in seven different experiment setups. For each experiment setup, there is a single train split S t and up to five test splits. All splits share the same set of inlier classes, which depending on the derived dataset vary from a narrow to a broader range of topics. The rest class covers a wide range of topics with an increasing scope from T c , T o to T d . Specifically, classification split S c shares all training classes and therefore resembles classic OVR classification T c . Split S o increases the scope by incorporating contextual outliers from the same dataset, as defined by the outlier detection task T o . Finally, S d1 , S d2 and S d3 have maximum RC scope by providing rest samples from a completely new dataset, representative of dataset shift task T d . We vectorized the samples of the textual datasets as pooled 100-dimensional dense Glove embeddings [54] and z-transformed the image samples.

Results
To benchmark DAE against the aforementioned baselines, we train all models with approximately analogical complexity in terms of the number of trainable parameters. For text classification, the models MiMo, OpenMax, SoftMax, and MLP have three hidden layers of sizes 50, 25, and 12. The autoencoder-based approaches have three hidden layers of size 50, 25, and 12 for the encoder and the decoder in reverse order. For image classification, OpenMax, SoftMax and MLP have hidden layers of sizes 410, 256, 128, 64, and 43; MiMo has hidden layer sizes of 120, 32, and 16. The autoencoderbased image classifiers have hidden layers of size 256 and 128 for the encoder and the decoder in reverse order. C2AE has additional classification downstream layers of sizes 128, 32, 5. MiMo has five ensemble components and, therefore, an input size five times the input size of the other approaches. All neurons have sigmoid activation functions.
Within the nested CV, we performed a hyperparameter search concerning lr, balanced sampling and weight decay for all approaches. Specifically for DAE, we optimized for the initial decision boundary t and σ 2 and the loss scaling factors λ i as defined in Eqs. (2) and (7). ATA was optimized w.r.t. outlier weighting factor and bin range. We found that across various experiments all baselines showed best performances when optimized with Adam [34]. Tables 3 and 4 show the results for each task with the best performing approaches on each dataset highlighted in boldface. To make model robustness comparable, a model is counted as weak when its AUROC score drops at least 5     percentage points below the best performing model. These scores are highlighted with underline within the result tables. Each results table aggregates the best and weakly performing models in the first column. Since the class base rates fluctuate significantly across datasets and splits, AUPR and F1 score as base rate-dependent metrics were not considered for the model robustness evaluation. Tables 3 and 4 present the results on datasets with a single COI and multiple COIs, respectively.
On the classification task T c , represented by split S c , MLP and MiMo yield the best results on all datasets. DAE and ATA provide competitive results on this task, whereas OCA, C2AE, OpenMax, and SoftMax fail on almost all S c splits in terms of AUROC.
For the contextual outlier detection task T o , we see that DAE and ATA outperform MLP and MiMo in terms of AUPR and AUROC scores on the multi-COI datasets. DAE and ATA provide similar results to MiMo and MLP on the single-COI datasets. As expected, semi-supervised methods OCA, Soft-Max, OpenMax, and C2AE do not achieve any performance gains.
Concerning dataset shift task T d , the autoencoder-based methods yield by far the strongest results with DAE being the only method with a zero weak count and OCA providing the most top scores. In contrast, the performance of MLP and MiMo diminishes further from T o to T d . The results of OpenMax, C2AE, and SoftMax improve on T d compared to T o . On average, the autoencoder-based methods have a weak performance rate of 6%, whereas the remaining baselines have a weak performance rate of 80%, clearly depicting the superiority of the autoencoder-based methods on the task T d .
Taking the architectural properties of each method into account, we can conclude the following: The OVR baselines MLP and MiMo require the one-vs-rest relationship to be reflected within the data, similar to the Bounding Gaussians dataset example in Fig. 1. Only in this case, the ERM objective encourages for a hull around the inlier data, which generalizes to unseen rest classes. However, with an increasing number of COI classes (see FMNIST 0,1,2,3,7 results), there is no data-inherent rest preference for unseen classes, leading to poor robustness scores on the outlier detection and dataset shift task. Compliant with earlier research [28,29], MLP and MiMo reveal the most severe robustness deficiencies when facing dataset shift. While in practice OSR is often approached by subsuming all the rest samples within a single background class, our results display its insufficiency.
Conversely, semi-supervised OCA and C2AE are not able to learn the OVR relationship in the problem domain, indicating that inliers and rest samples within T c and T o are too correlated in features space. When the scope of OVR is extended to dataset shift, OCA outperforms all baselines, and C2AE becomes competitive. While the results for SoftMax and OpenMax express the same behavior, the underlying rea-sons are different: ERM has no intrinsic mechanism that prevents outliers from being mapped to inlier feature representations, a problem described as feature collapse that is prevented via, e.g., two-sided Lipschitz constant regularization [67]. OpenMax and SoftMax both suffer from this effect since both of them apply offline uncertainty estimates solely based on the activations. Anecdotally, we also replaced the sigmoid activation functions with RELU within OpenMax, since the network becomes piece-wise linear with possibly more expressive activations. However, this did not lead to any robustness improvements.
In contrast to aforementioned baselines, DAE and ATA do not express any robustness deficiencies. In fact, they provide competitive results on all three OSR subtasks, showing that they are able to distinguish the two OVR classes in the problem domain while being highly robust to dataset shift. Nevertheless, we find that there is still a small tradeoff between accuracy and robustness, which has also been reported in previous research for various deep learning methods [30,42]. Both methods use an adversarial loss function that minimizes/maximizes inlier and rest sample reconstruction errors, respectively. Therefore, these methods resolve OCA's issue of correlated rest and inlier samples within T c and T o . Additionally, due to the bounded open space risk, they suffer less remote artifact areas that map outliers to inlier data, as seen for MLP and MiMo. While DAE and ATA are the most robust models, DAE is the best performing model in 7 cases compared to ATA, which performs best in only 3 cases. Since both methods mainly differ in terms of decision boundary estimation, the results suggest that DAE's loss function with its BCE term and t regularization has a positive effect.
The robustness properties of DAE and the baselines can also be seen in Figs. 1 and 6. DAE, ATA, MiMo, and MLP are capable of separating the red inliers from the blue rest samples, however, in a fundamentally different way. While DAE and ATA learn a representation for the inlier class, MiMo and MLP learn a separating line, which does not generalize to the unobserved orange outliers. As shown in Fig. 1h, the background class setup enables the ERM models MLP and MiMo to learn a proper hull around the inlier samples only if the observed rest samples bias the ERM toward such a decision boundary. Finally, OCA learns to separate the inliers from the orange outliers but passively minimizes the rest samples along with the inliers. This explains the poor classification performance of OCA on T c and T o , but high robustness to dataset shift.
Similar conclusions on the autoencoder-based methods can be drawn from Fig. 7 which displays samples from each split of the MNIST 7 and FMNIST 3,7 datasets. DAE and ATA can reconstruct inliers and distort rest samples, resulting in a reconstruction error that is highly predictive of the inlierness of a sample. In contrast, OCA not only learns to With adversarial robustness and local stability, there are two additional crucial aspects of robustness, which can be measured by the change in model confidence after step-wise applying adversarial perturbations, as defined in Eq. (17). As shown in Fig. 8, both DAE and MLP are similarly stable when exposed to adversarially perturbed inliers of MNIST 7 . Regarding rest samples, there is a substantial robustness gain from S c to S d on MNIST 7 with DAE being significantly more robust than MLP. On FMNIST 3,7 , the increased diversity of inlier samples diminishes the MLP's adversarial robustness. This supports our presumption that the MLP's recognition function has artifact areas far from the inliers that erroneously map to the inlier class. Conversely, the increase in inlier diversity enhances the inlier adversarial robustness of DAE. Allegedly, this forces DAE to learn a more voluminous decision boundary hull that is less susceptible to inlier perturbations.
As shown in Table 5, we also investigated the well-known problem of poorly calibrated neural networks on out-ofdistribution examples after being trained via ERM [24,37]  Best scores highlighted in bold face Table 6 Ablation study on MNIST 2,7 w.r.t. the loss terms inL controlled by hyperparameters λ i [see Eqs. (2) and (5)]. The respective loss histograms are shown in Fig. 9. The results clearly indicate that the combination of all loss terms yields the highest dataset shift robustness with slight degradation in classification performance. Without the adversarial loss term (i.e., λ 1 = 0), the models express significant robustness deficiencies. The two cases without L BC E lead to unusable results as t becomes 0. Note that the F1 scores can deviate from previous results in Table 4 for which the best models were selected based on AUROC scores on S c . If F1 score is a concern, we suggest filtering models whose decision boundary t has converged to a constant value and subsequently select the best model based on AUROC AUROC F1 score AUROC F1 score AUROC F1 score AUROC F1 score  Fig. 9 Inlier and outlier loss histograms regarding the different cases within the ablation study in Table 6. The model output g(e M SE ) mapping reconstruction errors to probabilities is plotted in green. The decision boundary t is plotted as dotted red line. Note that in some cases, e.g., in Fig. 9a-d, the maximum reconstruction errors are capped to highlight the accuracy of the decision boundary. The results show that the combination of loss termsL leads to inliers and outliers being minimized/maximized, respectively, and the decision boundary being as close as possible to the inliers to maximize robustness without jeopardizing classification performance. All the other combinations lead to poor classification performance or lack of outlier robustness (color figure online) been proposed which could exploit the calibration similarity across tasks. As shown in Table 6, we performed an ablation study w.r.t. the different loss terms inL to show that only the specific combination inL leads to the desired classification and robustness properties. The results clearly show that the loss functionL has the highest robustness with a minor classification degradation. If we remove the classification loss term L BC E fromL, then the decision boundary t converges to 0 which classifies all samples as outliers irrespective of their true class. If the model is solely trained on L BC E or L BC E + |t|, then the classification performance increases slightly, however, accompanied by severe robustness deficiencies to outliers and dataset shift.
These findings can be explained by Fig. 9 which plots the loss histograms for each hyperparameter combination in Table 6. The strong robustness performance ofL can be attributed to the wide inlier/outlier separation and small decision boundary t which allows to reject outliers effectively, as shown in Fig. 9a-d. Interestingly, L BC E (Fig. 9m-p) and L BC E + |t| (Fig. 9q-t) can separate inliers from rest classes on S c but fail to generalize to unseen classes. Especially without |t| regularization, the inlier reconstruction errors are less minimized, leading to dataset shift samples becoming indistinctive from inliers (see Fig.9o). If L BC E is jointly optimized with |t| regularization, then the minimization of inliers improves but outliers get less maximized due to the vanishing gradient problem, as derived in Eq. (8), resulting in poor OOD data robustness (see Fig. 9s). The adversarial component within L R2 does not suffer from the vanishing gradient limitation and enforces the maximization outliers which becomes apparent when comparing Figs. 9a-9q. If the adversarial component is missing in L R , then only inliers are minimized which causes a significant overlap of inliers and rest classes especially on S c (see Fig. 9i). In conclusion, the combination of all loss terms inL yields the best separation of inliers and outliers due to effective minimization of inliers and maximization of outliers and solves the vanishing gradient problems. Moreover, the minimization of the decision boundary t with L BC E acting as an antipole enables the model to robustly reject outliers without jeopardizing classification performance.
From this extensive analysis, we can conclude that DAE, as an OSR method with a bounded open space risk, clearly shows its superiority compared to the potent baselines ranging from OSR, OVR to outlier detection methods. Apart from ATA, every baseline consistently failed at more than one of the three subtasks of OSR, questioning their applicability in safety-critical systems. The consistent classification performance across all three tasks T c , T o and T d combined with an increased (adversarial) robustness shows the benefits of DAE's reduced and bounded open space risk and exposes the deficiencies of the ERM and semi-supervised baselines.

Conclusion
Open set recognition (OSR) is a common setup in machine learning applications. Whenever the objective is to distinguish at least one class of interest (COI) from all remaining, possibly unknown classes (RC), e.g., ordinary internet traffic from novel intrusion attempts or general discussions from all types of hate speech, OSR methods seem natural. Our analysis revealed that when extending the scope of RC, OSR poses severe challenges of outlier detection and dataset shift to deep neural networks solely optimized via empirical risk minimization. We provide an effective solution to these deficiencies with our proposed decoupling autoencoder (DAE) architecture. We proved the existence of a bounded open space risk for DAE and supported its classification and (adversarial) robustness benefits across three different subtasks of OSR. Specifically, we benchmarked DAE against capable baselines from various domains (DNNs, ensemble methods, outlier detection, and OSR) w.r.t. the OSR subtasks of classification, outlier detection, and dataset shift. DAE showed superior robustness across all subtasks throughout all experiments compared to the baselines, which failed on at least one of the tasks apart from ATA. In comparison with ATA, DAE can actively minimize the open space risk and does not require an offline brute-force line search for decision boundary estimation.
For future work, it would be interesting to extend DAE toward multi-class classification with a bounded open space risk, which would allow for robust multi-class classification under extreme dataset shift conditions. Finally, a promising idea could be the development of feature extractors that prevent the model from learning representations of noisy or uninformative features, thereby further alleviating the tradeoff between classification and robustness performance. cate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.