Introduction

Over recent years, complex classification tasks such as natural image classification have been solved with high accuracy [33]. One major step toward a high classification accuracy was deep learning, since it enabled to train large models in an end-to-end manner without requiring manual feature engineering [16]. However, most of the research has been limited to closed set problems in which the number of occurring classes is known in advance. This is not the case in many real-world applications. For example, in face recognition, new faces may occur during inference, which were not known during training [8]. In such a case, open set recognition is necessary.

In open set recognition (OSR), samples of K known classes are given during training. During inference, samples of both K known and U unknown classes may occur. An open set recognizer is able to classify samples from known classes and to reject samples from unknown classes [27]. Figure 1 visualizes the difference between a closed set classifier and an open set recognizer.

Fig. 1
figure 1

Exemplary comparison between closed set classification and open set recognition based on a three known classes A, B and C. b A closed set classifier can only learn decision boundaries that divide the feature space into three parts and thus cannot be used to detect unknown samples. c In contrast, in open set recognition, tight decision boundaries around the known classes are desired. Therefore, the gray space represents unknown classes

Since its goal is closed and tight decision boundaries, OSR requires special methods. Conventional OSR methods are mostly based on decision scores obtained by closed set classifiers, such as support vector machines (SVMs) [34]. By selecting a threshold, they compare decision scores with this threshold to decide whether to reject a test sample. In contrast, state-of-the-art methods often utilize generative models to generate fake samples which model unknown classes. More details about related work are presented in the next section.

Fig. 2
figure 2

Basic idea of the ICS method: Split given data from K known classes, here A, B and C, into typical and atypical subsets. By discriminating the K typical subsets and one additional class combining all atypical subsets from each other, a trained \(K{+}1\)-class classifier is expected to reject samples from unknown classes (gray) (color figure online)

Recently, we proposed a novel approach toward OSR based on intra-class splitting (ICS) [30]. Its idea is to split given training samples into two subsets: typical and atypical samples. Then, the atypical samples are used to model unknown classes. This enables to transform a K-class open set problem into a \(K{+}1\)-class closed set problem as shown in Fig. 2. This transformation is also common for generation-based methods. However, the ICS-based method achieved better results. A possible reason is that atypical subsets of training samples are closer to real unknown classes compared to generated samples.

Although the ICS-based method showed a good performance, it requires two stages, the ICS and the training of an open set recognizer. As the splitting is so far done by training another model, the method requires the training of two different deep neural networks. Therefore, the training procedure of the first stage implicitly becomes a general hyperparameter, which is not user-friendly enough. Is it possible to combine both training steps?

In this paper, we answer this question by compressing the ICS-based method into a one-stage method. Furthermore, we provide an insight into open set recognition and the new method. To this end, many experiments on image datasets were conducted.

Related Work

Many previous studies summarize the algorithms toward OSR problems into two groups: conventional machine learning methods and deep learning methods. However, this grouping hides the key ideas of different approaches in solving OSR problems. Therefore, in this paper, we propose to group prior work into two categories: threshold-based methods and generation-based methods. In particular, threshold-based approaches use predicted scores or re-calibrated probabilities from a conventional classifier with a predefined threshold to reject samples from unknown classes. In contrast, generation-based methods try to model the unknown classes and therefore transform an OSR problem into a classification problem by discriminating generated samples from given known classes.

Threshold-based Open Set Recognizers Open set recognition was first formally formulated by Scheirer et al. [27]. In this work, a 1-vs-set support vector machine was proposed by adding an additional hyperplane between a learned decision boundary and non-matching data. Thus, samples located in the space between the two hyperplanes were considered to belong to unknown classes. However, this method was only a linear model and resulted in a loose decision boundary.

Afterward, Scheirer et al. proposed the Weibull support vector machine (WSVM) [28]. It utilizes decision scores obtained by a one-class support vector machine (OCSVM) [32] and a binary support vector machine [5] to fit a Weibull distribution [23]. Subsequently, the fitted cumulative density function was used to calibrate scores for classification. Although this model could outperform previous approaches toward OSR problems, it was thresholded and thus had sensitive hyperparameters. Similarly, Jain et al. [11] proposed a \(P_I\)-SVM based on WSVM by introducing an automatic threshold estimation.

Moreover, Rudd et al. [24] proposed the extreme value machine (EVM) which used distances among training samples to fit a Weibull distribution. Given a new sample, an EVM estimates inclusion probabilities for each known class. According to a threshold, a sample is then either rejected or assigned to the class with the highest inclusion probability.

As the first deep-learning-based method, Bendale et al. [3] proposed OpenMAX to be used as an output layer in a deep neural network, enabling the rejection of unknown classes. However, it introduced three additional hyperparameters, which should be carefully selected based on unknown classes or validation sets. This is not always feasible in practice. Moreover, the OpenMAX layer was not used during training. In other words, the OpenMAX method can be considered as an extended EVM for a feature space learned by a deep neural network.

Generation-based Open Set Recognizers One recent generation-based method is the counterfactual image generation method (CF) proposed by Neal et al. [19]. It models unknown classes by generated samples which enable to reformulate an OSR problem into a classification task. Indeed, the authors first trained a modified Wasserstein-GAN [7, 18, 35] to obtain an encoder and a decoder. Then, the encoder transformed samples of known classes into latent representations. With a pretrained multi-class neural network, they sought latent representations in the feature space which were close to known classes but had low decision scores. Subsequently, these representations were decoded into counterfactual images. Accordingly, an OSR problem could be reformulated into a classification problem by discriminating among known classes and the counterfactual image class. Although this method achieved state-of-the-art performance in the literature, it suffers from common problems with GAN-based methods such as mode collapse. Furthermore, the numbers of training steps for a GAN and optimization steps for the counterfactual image generation are difficult to determine, because there are few quantity metrics to evaluate such generation qualities. Beyond that, there are other studies based on generation-based methods. For example, Jo et al. generated fake unknown data by modeling a noisy distribution in a latent space [12].

Intra-class Splitting Intra-class splitting (ICS) is a strategy to model unknown classes [30, 31]. More precisely, given training samples are split into typical and atypical subsets. The atypical subsets are then used to model unknown classes. This is because samples with less frequent patterns are often less important to model known classes and thus are not representative for them. For example, in the field of image classification, training a classifier on images with many redundant details may mislead the classifier and decrease the performance as shown in [29, 30]. Similarly, Li et al. [17] stated that misclassified samples from a training dataset can be considered as outliers or hard examples, which are not representative.

Fig. 3
figure 3

The DICS network architecture consisting of a deep feature extractor, an open set (OS) layer with \(K{+}1\) outputs and a closed set (CS) layer with K output neurons. During inference, the deep feature extractor and OS layer form the final open set recognizer

Proposed Method

Similar to our previous approach [30], the proposed method transforms a K-class open set problem into a \(K{+}1\)-class closed set problem. Thereby, intra-class splitting is used to find atypical samples which then serve as an additional class representing unknown classes. In addition, the proposed method shares the same neural network structure [30] as shown in Fig. 3. Equal to the previous approach, the closed set (CS) layer with K outputs is used as a closed set regularization to increase the closed set classification performance of atypical samples. During inference, only the combination of the deep feature extractor and the open set (OS) layer with \(K{+}1\) outputs is used as an open set recognizer.

In contrast to the previous approach, intra-class splitting is now performed by directly using the outputs of the CS layer instead of training a separate neural network. Moreover, the splitting is not performed once before training, but dynamically epoch by epoch. We call this dynamic intra-class splitting (DICS). Based on DICS, the original two-stage method is transformed into a one-stage method in which only one neural network must be trained. In the following, DICS and the training procedure are described in more detail.

Dynamic Intra-class Splitting (DICS)

Let \({\mathcal {D}}=\{(\varvec{x}_1, y_1), (\varvec{x}_2, y_2), \ldots , (\varvec{x}_N, y_N)\}\) denote a training dataset with N samples affiliated to K known classes. Correspondingly, each sample \(\varvec{x}_i\in {\mathcal {X}} = \{\varvec{x}_1, \varvec{x}_2, \ldots , \varvec{x}_N\}\) has an individual class label \(y_i \in \{1, 2, \ldots , K\}\). After the e-th training epoch, the score \(s^{(e)}_i\) of an input sample \(\varvec{x}_i\) depends on its predicted class label \({\hat{y}}^{(e)}_{i,\text {cs}}\):

$$\begin{aligned} \forall \varvec{x}_i \in {\mathcal {X}}: s^{(e)}_i=\left\{ \begin{array}{cl} {\hat{P}}(y_i|\varvec{x}_i), &\quad \text {if } {\hat{y}}_{i,\text {cs}}^{(e)} = y_i \\ 0, &\quad \text {otherwise} \end{array} \right. \end{aligned}$$
(1)

where \({\hat{P}}(y_i|\varvec{x}_i)\) is the conditional class probability modeled by the closed set classifier, i.e., the concatenation of deep feature extractor and CS layer in Fig. 3. Note that a higher score means a more typical sample. Per e-th training epoch, scores for all training samples are collected as a score set \({\mathcal {S}}^{(e)} = \{s^{(e)}_1, s^{(e)}_2, \ldots , s^{(e)}_N\}\) for the e-th training epoch. Let \(0<\rho <1\) be a predefined intra-class splitting ratio and \({\mathcal {S}}_\rho ^{(e)}\) be the \(\rho \)-th fraction of \({\mathcal {S}}^{(e)}\) with the lowest scores. Then, \(\tau _\rho ^{(e)}=\max {\mathcal {S}}_\rho ^{(e)}\) acts as a threshold between atypical and typical samples. Hence, the training dataset is split according to:

$$\begin{aligned} \forall \varvec{x}_i \in {\mathcal {X}}: \varvec{x}_i\in \left\{ \begin{array}{cl} {\mathcal {X}}_\text {typical}, &\quad \text {if } s^{(e)}_i>\tau _\rho ^{(e)}\\ {\mathcal {X}}_\text {atypical}, &\quad \text {else} \end{array} \right. \end{aligned}$$
(2)

Thereby, the goal of the scoring procedure is to find those samples which are either incorrectly classified or correctly classified but with a low confidence. As a result, \(\rho \) shows how many samples from known classes are allowed to be incorrectly rejected as unknown classes, similar to [31].

Training

Let the deep feature extractor from Fig. 3 be denoted as \(f_0(\cdot )\), the OS layer be named \(f_\text {os}(\cdot )\) and the CS layer be denoted as \(f_\text {cs}(\cdot )\). Then, the resulting open set recognizer is defined as

$$\begin{aligned} f_\text {osr}(\cdot ) = (f_\text {os}\,\circ\,f_0)(\cdot )~, \end{aligned}$$
(3)

while the conventional closed set regularization is denoted as

$$\begin{aligned} f_\text {csr}(\cdot ) = (f_\text {cs}\,\circ\,f_0)(\cdot )~. \end{aligned}$$
(4)

Based on these definitions, the objective of the proposed method at each training epoch can be defined as:

$$\begin{aligned} \min _{f_0,f_\text {os}} \left( {\mathbb {E}}_{(\varvec{x},y)\sim {\mathcal {D}}}[{\mathcal {L}}_\text {os}(f_\text {osr}(\varvec{x}), \zeta ^{(e)}(\varvec{x})\cdot y)] + \lambda \cdot {\mathbb {E}}_{(\varvec{x},y)\sim {\mathcal {D}}}[{\mathcal {L}}_\text {cs}(f_\text {csr}(\varvec{x}), y)]\right) , \end{aligned}$$
(5)

where the OS loss \({\mathcal {L}}_{\mathrm {os}}\) and CS loss \({\mathcal {L}}_{\mathrm {cs}}\) are the learning objectives for regular \(K{+}1\)- and K-class classification problems, respectively. Moreover, the hyperparameter \(\lambda \) controls the trade-off between both losses.

In this work, the categorical entropy loss is used for both terms in the objective function. Note that \(\zeta (\cdot )^{(e)}\) is an indicator function that returns 1 if a given sample is affiliated to the typical subset and otherwise returns 0. This means that typical samples maintain their original ground truths while atypical samples are assigned to a new label of zero during the optimization. The superscript (e) is used to emphasize that the outputs of \(\zeta (\cdot )^{(e)}\) may change epoch by epoch because of the dynamic ICS.

Consequently, a minimization of the first term in Eq. 5 equals forcing the decision boundary to be between the typical and atypical samples, i.e., minimizing the open risk [27]. On the contrary, a minimization of the second term corresponds to minimizing the empirical risk on the training data from the known classes. Hence, the decision boundary is forced to enclosure the known classes.

Evaluation

Setup

In an open set recognition scenario, K known classes and U unknown classes are present. During the conducted experiments, to be consistent with previous studies, K was equal to six while U varied depending on different datasets.

To evaluate the performance of an open set recognizer, the balanced accuracy (BACCU) [4] was used as the fundamental metric. In order to be consistent with prior work [19, 26,27,28], known classes were also denoted as positive classes, while unknown classes were considered as negative classes. Accordingly, BACCU is defined as

$$\begin{aligned} \text {BACCU} = \frac{1}{2}\cdot \bigg (\frac{\textit{TN}}{\# \text { negative samples}} + \frac{\textit{TP}}{\# \text { positive samples}}\bigg ), \end{aligned}$$
(6)

where \(\textit{TN}\) (true negative) is the number of correctly rejected negative samples and \(\textit{TP}\) (true positive) is the number of correctly classified positive samples. The BACCU gives the same weights to both rejecting negative samples and correctly classifying positive samples. Finally, in order to be consistent with prior work [19, 28], the area under curve (AUC) and closed set accuracy (CSACCU) were taken into consideration, too.

The backbone neural network architecture of the DICS method shared the same settings with [31]. The batch size was set to 64 and the network was trained for 80 epochs in each experiment.

Baseline Methods

We selected seven baselines including state-of-the-art methods from the literature for comparison.

Multi-Class Neural Network with Rejection Option (CRO) A multi-class classifier was trained on the known classes in a closed set configuration. Then, a rejection threshold \(\delta \) was selected by assuring up to 10% of the training samples were incorrectly rejected as unknown-class samples. During inference, samples with predicted scores lower than \(\delta \) were rejected as from unknown classes. Note that this multi-class classifier shared the same architecture and hyperparameters as the proposed method.

Extreme Value Machine (EVM) EVM was implemented based on [24] with the default suggested hyperparameters. \(\delta \) was set as 0.99 according to a grid search in the set \(\{0.01, 0.05, 0.1, 0.5, 0.9, 0.99, 0.999\}\).

One-Class Support Vector Machine with Multi-Class Classifier (OCMC) An OCSVM [32] was trained with \(\nu =0.01\) and a kernel with radial basis function (RBF) [6] on the given known samples to reject unknown samples during inference. Then a multi-class classifier was trained on the known classes for a closed set prediction. Note that this multi-class classifier shared the same architecture and hyperparameters with the proposed method.

Weibull Support Vector Machine (WSVM) The OCSVM and binary SVM were implemented using [22]. Both SVMs utilized an RBF kernel. We selected the hyperparameters as follows: \(\nu =0.01\) for OCSVM, \(C=2, \gamma = 0.03125\) for the binary SVM as suggested in [28]. A Weibull distribution was fitted according to [26]. The decision thresholds were set as \(\delta _\tau =0.001, \delta _R=0.5\).

OpenMAX We modified the codes from [3] as little as possible to satisfy our datasets. In order to have a fair comparison, the backbone network shared the same architecture of our method. As suggested in [3], we used \(\alpha =1, \eta =20\) for all experiments.

Counterfactual Image Generation for OSR (CF) We translated the original codes [19] from PyTorch [21] into Keras [1] in order to maintain a consistent experimental environment for all baselines. All hyperparameters were maintained the same as in [19].

Intra-class Splitting (ICS) We kept all hyperparameters as in [30].

Fig. 4
figure 4

Exemplary images from the datasets

Datasets

We used three image datasets to validate the effectiveness of our method and to evaluate the sensitivity to the key hyperparameters. The first dataset MNIST [15] contains images of handwritten digits from 0 to 9 in gray scale. The number of training samples is around 6000 per class, while the number of test samples is around 1000 per class. The second dataset SVHN [20] consists of color digit images from 0 to 9 obtained in the real world. Thereby, most classes contain around 5000 training samples and 2000 test samples. The third dataset CIFAR-10 [13] is the most difficult considered dataset as it contains images of real-world objects such as airplanes, dogs and trucks. Each class in CIFAR-10 consists of 5000 training and 1000 testing samples. Figure 4 shows exemplary images from the three datasets.

Table 1 Results with performance metrics (std.) in %

Comparison

First, the proposed DICS method was compared to the other baselines on all three datasets. In each experiment on a dataset, 6 classes from the training set were randomly selected as the known classes for training. Subsequently, we used all samples from the test set for the evaluation, i.e., 6 known classes and 4 unknown classes. The experiment was repeated five times for each dataset, and the results were reported by means and standard deviations (std.).

Table 1 shows the resulting BACCU on the three datasets mentioned above. DICS achieved a comparable or better performance than the original ICS method. On the datasets SVHN and CIFAR-10, the DICS method outperformed the original ICS. We argue that the DICS adds stochastic behavior to the selection of atypical samples. This means that at each epoch, a small number of typical samples were wrongly labeled as atypical, which leads to a higher robustness of the entire open set recognizer and thus a better performance on more complex datasets.

In an open set configuration, the AUC is a measure for the ability of an open set recognizer to correctly reject a sample from unknown classes based on manually selected thresholds. As shown in Table 1, the AUC shows a similar trend as the BACCU. Thereby, bold values mark the best result. Both ICS and DICS outperformed other considered baselines. Considering ICS and DICS, the dynamic splitting procedure seems to be more superior for complex datasets such as CIFAR-10.

The major weakness of both ICS and DICS is the performance regarding the closed set accuracy as shown in Table 1. In order to achieve a tight decision boundary, the key idea of all ICS-based methods is to use atypical samples to shrink the resulting decision boundaries. Therefore, samples from known classes that are located near the decision boundaries are sometimes rejected as unknown classes. Nevertheless, in practice, a slightly lower closed set accuracy is tolerable [2, 19].

Fig. 5
figure 5

Performance metrics over different openness

Openness

Openness is used to describe how “open” an OSR problem is [28]. It is defined as:

$$\begin{aligned} \text {openness}= 1 - \sqrt{\frac{2\cdot K}{K+C}} \in [0,1)~, \end{aligned}$$
(7)

where K equals the number of known classes. Furthermore, \(C = K + U\) where U is the number of unknown classes encountered during testing. The more unknown classes are encountered during inference, the more open an OSR problem is. Thereby, an OSR problem with a higher openness typically requires a more advanced OSR algorithm. In other words, an optimal OSR algorithm should have consistent performance over different openness.

In this work, we compared the performance of WSVM, CF, ICS and the proposed DICS method under different openness. These four methods were trained on six randomly selected classes from the dataset CIFAR-10. During inference, randomly selected images from the datasets CIFAR-100 [13] and Tiny ImageNet [25] were used as unknown classes. Thereby, seven different numbers of unknown classes were considered in this work: 5, 10, 20, 50, 75, 100 and 200.Footnote 1 The images for the last case were randomly sampled from Tiny ImageNet, while the images for the other six cases were randomly selected from CIFAR-100. Here we only report the resulting balanced accuracy in Fig. 5a and AUC in Fig. 5b, because the closed set accuracy did not change in respect of variable numbers of unknown classes.

The ICS and DICS methods outperformed the other two models regarding BACCU and AUC. Figure 5 shows the performance over different openness with corresponding standard deviations of the four methods. In average, the proposed DICS method achieved a balanced accuracy comparable to the original ICS method as depicted in Fig. 5a. As discussed before, this improvement is caused by the better robustness of DICS which was achieved by the dynamic splitting at each epoch.

Fig. 6
figure 6

Performance metrics over different splitting ratios

Splitting Ratio

Similar to the original ICS method, the splitting ratio \(\rho \) plays a crucial role in the proposed DICS method. As shown in Fig. 6a, the balanced accuracy on all three datasets first increased and then decreased with ascending splitting ratios. Indeed, a small \(\rho \) such as \(\rho =1\%\), means that almost all training data maintain their original ground truths. Such a training procedure is similar to closed set classification. Therefore, the trained model cannot well reject samples from unknown classes. On the contrary, a large splitting ratio \(\rho =75\%\) means that the majority of the given training data has new labels differing from the original ground truths. In this case, the proposed closed set regularization cannot guarantee maintaining a high closed set accuracy. This comparison can also be observed by the closed set accuracy shown in Fig. 6c. The proposed DICS achieved a high closed set accuracy with a small splitting ratio and vice versa. Interestingly, as shown in Fig. 6b, AUC does not have much variance over different splitting ratios. In a wide range of splitting ratios, such as \(1\%\le \rho \le 50\%\), the AUC is on a similar level.

Initialization

In the DICS method, the given training data from the known classes are first split based on a randomly initialized network, meaning the deep feature extractor and CS layer. Hence, the splitting results in a first training epoch may not be reliable because the atypical samples are randomly selected at the beginning.

Therefore, the proposed method was evaluated under different network initializations to examine their impact. In this subsection, the proposed DICS method was tested on the datasets MNIST, SVHN and CIFAR-10 with a randomly selected combination of known classes. After fixing the known classes for training, we repeated the experiment for five times with randomly initialized network weights and reported the averaged results and their standard deviations.

Figure 7 shows the resulting performance. On the three datasets, the DICS method achieved a consistent performance for different initializations of the neural networks. In particular, DICS had a low standard deviation regarding BACCU and AUC. As discussed before, the introduction of dynamic splitting enabled a higher robustness to reject unknown classes. On the contrary, we noticed that CSACCU had a slightly higher standard deviation than BACCU. A possible reason is that DICS forces the classifier to be confused with the true labels of some samples from the known classes for the early training epochs.

Fig. 7
figure 7

Impact of different initializations

Discussion

Although one reason to develop a new method for OSR was the dependence of existing methods on hyperparameters, the proposed DICS method also depends on one crucial hyperparameter, the splitting ratio \(\rho \). Why is this better than other methods?

In OSR, there is a trade-off between correctly rejecting unknown classes and identifying known classes. Hence, if no prior information at all about unknown classes is available, there must be at least one inevitable hyperparameter that sets the trade-off between the two contradictory objectives. Regarding the proposed DICS method, this hyperparameter corresponds to the splitting ratio \(\rho \) which sets the fraction of training samples to be considered as atypical. In practice, the inevitable hyperparameter in OSR can often be set based on given regulations or experience.

In contrast to the proposed method, existing methods do not only depend on one inevitable hyperparameter, but also on several algorithm-dependent hyperparameters. For example, WSVM depends on \(\nu \), \(\delta _{\tau }\) and \(\delta _R\) [28]. Likewise, CF also has many algorithm-dependent hyperparameters. For example, CF utilizes the generated counterfactual images to represent unknown classes. Therefore, the quality of generated images plays a key role in this algorithm [19]. Accordingly, the entire training procedure can be considered as a general hyperparameter for CF, including the number of optimization steps and ratios among all losses.

In fact, such algorithm-dependent hyperparameters complicate the usage of open set recognizers, because the choice of these hyperparameters often requires a full understanding of the algorithm. Furthermore, algorithm-dependent hyperparameters are often sensitive and hard to fine-tune for new datasets or domains. Some of these hyperparameters even depend on the number of unknown classes, which is not feasible in practice. In contrast, the splitting ratio \(\rho \) of DICS is inevitable and easy to interpret.

Finally, it should be noted that the concrete network architectures are also hyperparameters in DICS. However, they can be considered as inevitable since they are common in almost all deep learning-based approaches.

Conclusion

We proposed a new method for open set recognition. By applying dynamic intra-class splitting (DICS), the method enables to use an arbitrary deep neural network as a one-stage end-to-end open set recognizer. Experiments on several image datasets showed the superiority over state-of-the-art methods regarding a compromise between closed set accuracy and rejection capability. In addition, the proposed method achieves a comparable or better performance than a former proposed two-stage method using ICS. Thereby, DICS still depends on a hyperparameter. However, we argue that this hyperparameter is inevitable and easier to choose than algorithm-dependent hyperparameters of existing methods due to its easy interpretability. The experiments indicated that DICS did not have the best closed set accuracy, although it might be tolerable in specific cases. Therefore, further research could focus on improving the closed set accuracy, for example by combining DICS with generative models or by choosing a more sophisticated network architecture.