1 Introduction

Neural networks (NNs) are increasingly used in safety-critical applications due to their good performance even on complex problems. However, their notorious unreliability makes their safety assurance even more important. In particular, even if the NN is well trained on the data that it is given and works well on similar data (so-called in-distribution (ID) data), it is unclear what it does if presented with a significantly different input (so-called out-of-distribution (OOD) data). For instance, what if an NN for traffic signs recognition trained on pictures taken in Nevada is now presented with a traffic sign in rainy weather, a European one, or a billboard with an elephant?

To ensure safety in all situations, we must at least recognize that the input is OOD; thus, the network’s answer is unreliable, no matter its confidence. Verification, a classic approach for proving safety, is extremely costly and essentially infeasible for practical NNs [34]. Moreover, it is mainly done for ID or related data [6, 34]. For instance, robustness is typically proven for neighborhoods of essential points, which may ensure correct behavior in the presence of noise or rain, but not elephants [18, 24, 25, 35]. In contrast, runtime verification and particularly runtime monitoring provide a cheap alternative. Moreover, the industry also finds it appealing as it is currently the only formal-methods approach applicable to industrial-sized NNs.

OOD runtime monitoring methods have recently started flourishing  [7, 14, 20, 22, 32, 42]. Such a runtime monitor tries to detect if the current input to the NN is OOD. To this end, it typically monitors the behavior of the network (e.g., the output probabilities or the activation values of the neurons) and evaluates whether the obtained values resemble the ones observed on known ID data. If not, the monitor raises an alarm to convey suspicion about OOD data.

Fig. 1.
figure 1

Illustration of challenges for OOD detection

Challenges: While this approach has demonstrated potential, several practical issues arise:

  • How can we compare two monitors and determine which one is better? Considering the example of autonomous driving, an OOD input could arise from the fact that some noise was introduced by sensors or the brightness of the environment was perturbed. A monitor might perform well on one kind of OOD input but may not on another [44], as better performance in one class of OOD data does not imply the same in another class (see Fig. 1a).

  • Applying a particular monitoring technology to a concrete NN involves significant tweaking and hyperparameter tuning, with no push-button technology available. OOD monitors typically compute a value from the input and the behavior of the NN. The input is considered OOD if this value is smaller than a configurable threshold \(\tau \) (see Fig. 1)b. The value of this threshold has a significant influence on the performance of the monitors. More inputs would be classified as OOD if the threshold value is high, and vice versa. Moreover, OOD monitors generally have multiple parameters that require tuning, thereby aggravating the complexity of manual configuration.

  • As OOD monitoring can currently be described as a search for a good heuristic, many more heuristics will appear, implying the need for streamlining their handling and fair comparison.

In this paper, we provide the infrastructure for users and developers of NN monitors aiming at detecting OOD inputs (onwards just “monitors”).

Our contributions can be summarized as follows:

  • We provide a modular tool called Monitizer for automatic learning/constructing, optimizing, and evaluating monitors.

  • Monitizer supports (i) easy practical use, providing various recent monitors from the literature, which can directly be optimized and applied to user-given networks and datasets with no further inputs required; the push-button solution offers automatic choice of the best available monitor without requiring any knowledge on the side of the user; (ii) advanced development use, with the possibility of easily integrating a new monitor or new evaluation techniques. The framework also foresees and allows for the integration of monitoring other properties than OOD.

  • We provide a library of 19 well-known monitors from the scientific literature to be used off-the-shelf, accompanied by 9 datasets and 15 NNs, which can be used for easy but rich automatic evaluation and comparison of monitors on various OOD categories.

  • We demonstrate the functionality for principled use cases accompanied by examples and a case study comparing a few recent monitoring approaches.

Altogether, we are giving users the infrastructure for automatic creation of monitors, development of new methods, and their comparison to similar approaches.

2 Related Work

NN Monitoring Frameworks. OpenOOD  [47, 48] contains task-specific benchmarks for OOD detection that consist of an ID and multiple OOD datasets for specific tasks (e.g., Open Set Recognition and Anomaly Detection). Both OpenOOD and Monitizer contain several different monitors and benchmarks. Monitizer provides functionality to tune the monitors for the given objective, supports a comprehensive evaluation of monitors on a specific ID dataset by automatically providing generated OOD inputs by, e.g., the addition of noise, and can easily be extended with more datasets. OpenOOD, in contrast to Monitizer, does not support hyperparameter tuning and generation of OOD inputs.

Samuels et al. propose a framework to optimize an OOD monitor during runtime on newly experienced OOD inputs [26]. While this contains optimization, the framework is specific to one monitor and is based on active learning. Monitizer is meant to work in an offline setting and optimize a monitor before it is deployed. Additionally, Monitizer is built for extensibility and reusability, which the other tool is not, e.g., it lacks an executable.

PyTorch-OOD  [27] is a library for OOD detection, yet despite its name, it is not part of the official PyTorch-library. It includes several monitors, datasets, and supports the evaluation of the integrated monitors. Both Monitizer and PyTorch-OOD provide a library of monitors and datasets. However, there are significant differences. Monitizer supports optimization of monitors, allowing us to return monitors optimal for a chosen objective, provides a more structured view of the dataset, and provides a transparent and detailed evaluation showing how a monitor performs on different OOD classes. Besides, we provide a one-click solution to easily evaluate the whole set of monitors and automatically return the best available option, fine-tuned to the case. Consequently, Monitizer is a tool that is much easier to use and extend. Last but not least, it is an alternative implementation that allows cross-checking outcomes, thereby making monitoring more trustworthy.

OOD Benchmarking. Various datasets have been published for OOD benchmarking [15, 16, 19, 37, 38], Breitenstein et al. present a classification for different types of OOD data in automated driving [5], and Ferreira et al. propose a benchmark set for OOD with several different categories [11].

3 Monitizer

Monitizer aims to assist the developers and users of NN monitors and developers of new monitoring techniques by supporting optimization and transparent evaluation of their monitors. It structures OOD data in a hierarchy of classes, and a monitor can be tuned for any (combination) of these classes. It also provides a one-click solution to evaluate a set of monitors and return the best available option optimized for the given requirement.

3.1 Overview

Monitizer offers two main building blocks, as demonstrated in Fig. 2: optimization and evaluation of NN monitors. NN monitors are typically parameterized and usually depend on the NN and dataset. Before one can evaluate them, they need to be configured and possibly tuned. We refer to monitors that are not yet configured as monitor templates. Monitizer optimizes the monitor templates and evaluates them afterward on several different OOD classes, i.e., types of OOD data.

Fig. 2.
figure 2

Architecture of Monitizer: The required inputs are an NN and the dataset (both can be chosen from existing options). The dashed area indicates optional inputs, and the bold-faced option indicates the default value. The icons(see footnote 1) indicate which types of users are expected to use each of the options.

Monitizer needs at least two inputs (see Fig. 2): an NN, and an ID-dataset. The user can also provide a monitor template and an optimization configuration (consisting of an optimization objective and optimization method). If these are not provided, Monitizer reverts to the default values (i.e., evaluating all monitors using the AUROC-score without optimization). For both inputs, the user can choose from the options we offer or provide a custom implementation.

Monitizer optimizes the provided monitor based on the optimization objectives and method on the given ID dataset. An example of optimization would be:Footnote 1 maximize the detection accuracy on blurry images, but keep the accuracy on ID images at least 70%. Optimization is necessary to obtain a monitor that is ready to use. However, it is possible to evaluate a monitor template on its default values for the parameters using the AUROC-score (Area Under the Receiver Operating Characteristic Curve)Footnote 2.

On successful execution, Monitizer provides the user with a configuration of the monitor template and the evaluation result. This can be either a table with the accuracy of OOD detection for each OOD dataset along with a parallel coordinate plot for the same (in case of optimization) or the AUROC score.

3.2 Use Cases

We envision three different types of users for Monitizer:

  1. 1.

    The End User

    Context: The end user of a monitor, e.g., an engineer in the aviation industry, is interested in the end product, not in the intricacies of the underlying monitoring technique. She intends to evaluate one or all monitors provided by Monitizer for her custom NN and dataset, and wants to come to a conclusion on which one to use. She has an NN that needs to be monitored. Additionally, she has her own proprietary ID dataset, e.g., the one on which the NN was trained. She wants a monitor fulfilling some requirement, e.g., one that is optimal on average for all classes or one that can detect a specific type of OOD that her NN is not able to handle properly.

    Usage: Such a user can obtain a monitor tuned to her needs using Monitizer without much effort. Monitizer supports this feature out of the box. It provides various monitors (19 at present) that can be optimized for a given network. In case she wants to use a custom NN or a dataset, she has to provide the NN as PyTorch-dump or in onnx-format [4] and add some lines of code to implement the interface for loading her data.

    Required Effort: After providing the interface for her custom dataset, the user only has to trigger the execution. The execution time depends on the hardware quality, the NN’s size, the chosen monitor’s complexity, and the dataset’s size.

  2. 2.

    The Developer of Monitors

    Context: The developer of monitoring techniques, e.g., a researcher working in runtime verification of NNs, aims to create novel techniques and assess their performance in comparison to established methods.

    Usage: Such a user can plug their novel monitor into Monitizer and evaluate it. Monitizer directly provides the most commonly used NNs and datasets for academic evaluation.

    Required Effort: The code for the monitor needs to be in Python and should implement the functions specified in the interface for monitors in Monitizer. Afterward, she can trigger the evaluation of her monitoring technique.

  3. 3.

    The Scholar

    Context: An expert in monitoring, e.g., an experienced researcher in NN runtime verification, intends to explore beyond the current boundaries. She might want to adapt an NN monitor to properties other than OOD, or to experiment with custom NNs or datasets.

    Usage: Monitizer provides interfaces, and instructions on how to integrate new NNs, datasets, monitors, custom optimization methods and objectives.

    Required Effort: The required integration effort depends on the complexity of the concrete use case. For example, adding an NN would take much less time than developing a new monitor.

More detailed examples are available in [1].

3.3 Phases of Monitizer

An execution of Monitizer is typically a sequence of three phases: parse, optimize, and evaluate. As mentioned, the user can decide to skip the optimization or the evaluation.

Parse. This phase parses the input, loads the NN and dataset, and instantiates the monitor. It also performs sanity checks on the inputs, e.g., the datasets are available in the file system, the provided monitor is implemented correctly, etc.

Optimize. This phase tunes the parameters of a given monitor template to maximize an objective. It depends on two inputs, the optimization method and the optimization objective, that the user has to give.

An illustrative depiction of this process can be found in [1]. The optimization method defines the search space and generates a new candidate monitor by setting its parameters. Monitizer then uses the optimization objective to evaluate this candidate. If the objective is to optimize at least one OOD class, Monitizer evaluates the monitor on a validation set of this class, which is distinct from the test set used in the evaluation later. The optimization method obtains this result and decides whether to continue optimizing or stop and return the best monitor that it has found.

Monitizer provides three optimization methods: random, grid-search, and gradient descent. Random search tries out a specified number of random sets of parameters and returns the monitor that worked best among these. Grid-search specifies a search grid by looking at the minimal and maximal values of the parameters. It then defines a grid on the search space. The monitor is infused with these parameters for each grid vertex and evaluated on the objective. Gradient-descent follows the gradient of the objective function towards the optimum.

Monitizer supports multi-objective optimization of monitors. A user can specify a set of OOD classes to optimize for and the minimum required accuracy for ID detection. Single objective optimization is a special case when only one OOD class is specified for optimization. Based on a configuration value, Monitizer would generate a set of different weight combinations for the objectives and create and evaluate a monitor for each of these combinations. If there are two objectives, Monitizer generates a Pareto frontier plot; in the case of more than two objectives, the tool generates a table. The user obtains the performance of the optimized monitor for each weight-combination of objectives.

Fig. 3.
figure 3

Class diagram depicting the different types of OOD data.

Evaluate. The evaluation of NN monitors in Monitizer is structured according to the OOD classification (detailed in the next section). We introduce this classification of OOD data to enable a clearer evaluation and gain knowledge about which monitor performs well on which particular class of OOD. Typically, no monitor performs well on every class of OOD [44]. We highlight this in our evaluation to ensure a fair and meaningful comparison between monitors rather than restricting to a non-transparent and possibly biased average score.

After evaluation, Monitizer reports the detection accuracy for each OOD class and can also produce a parallel-coordinates-plot displaying the reported accuracy. Monitizer can also provide confidence intervals for the evaluation quality, which is explained in [1].

3.4 Classification of Out-of-Distribution Data

We now introduce our classification of OOD data. At the top level, an OOD input can either be generated, i.e., obtained by distorting ID data [3, 14, 17, 31, 41], or it can be collected using data from some other available dataset.

Fig. 4.
figure 4

Examples for OOD

The notion of generated OOD is straightforward. These classes are created by slightly distorting ID data, for example, by increasing the contrast or adding noise. An important factor is the amount of distortion, e.g., the amount of noise, as it influences the NN’s performance and needs to be high enough to transform an ID into an OOD input.

We explain the idea of collected OOD with the help of an example shown in Fig. 4. Consider an ID dataset that consists of textures (Fig. 4a). Images containing objects (Fig. 4b) differ from images showing just a texture. But, when we consider a dataset of numbers as ID (Fig. 4c), it seems much more similar to a dataset of letters (Fig. 4d) than textures are to objects. In the first case, the datasets have no common meaning or concept, as if they were belonging to a new world. In the second case, the environment and the underlying concept are similar, but an unseen object is placed in it.

Figure 3 shows our classification of the OOD data. It is based on the kind of OOD data we found in the literature (discussed in Sect. 2). [1] contains a detailed description of each class and an illustrative figure.

OOD Benchmarks Implementation. Note that the generated OOD will be automatically created by Monitizer for any given ID dataset. The collected OOD data has to be manually selected. We provide a few preselected datasets (for example, KMNIST [9] as unseen objects for MNIST [29]) in the tool. A user can easily add more when needed. However, for a user like the developer of monitors, MNIST and CIFAR-10 are often sufficient to test new monitoring methodologies, as related work has shown [13, 20].

3.5 Library of Monitors, NNs, and Datasets

Monitizer currently includes 19 monitors, accompanied by 9 datasets and 15 NNs. In the following, we give an overview of the available options.

Monitors. Monitizer provides different highly cited monitors, which are also included in other tools such as OpenOOD/Pytorch-OOD. We extended this list by adding monitors from the formal methods community (e.g., Box monitor, Gaussian monitor). The following monitors are available in Monitizer: ASH-B,ASH-P,ASH-S [10], Box-monitor [20], DICE [42], Energy [32], Entropy [33], Gaussian [13], GradNorm [23], KL Matching [15], KNN [43], MaxLogit [50], MDS [30], Softmax [17], ODIN [31], ReAct [41], Mahalanobis [39], SHE [49], Temperature [12] VIM [45].

Datasets. The following datasets are available in Monitizer: CIFAR-10, CIFAR-100 [28], DTD [8], FashionMNIST [46], GTSRB [21], ImageNet [40], K-MNIST [9], MNIST [29], SVHN [36].

Neural Networks Monitizer provides at least one pretrained NN for each available dataset. The library contains more NNs trained on commonly used datasets in academia, such as MNIST and CIFAR-10, allowing users to evaluate monitors on different architectures. [1] contains a detailed description of the pretrained NNs.

4 Summary of Evaluation by Case Study

Table 1. Comparison of the AUROC-score of all implemented monitors on different OOD datasets multiplied by 100 (and rounded to the nearest integer). All monitors were evaluated on a fully connected network trained on MNIST. The cells are colored according to the relative performance of a monitor (column) in a specific OOD class (row). The monitors are divided in three ranks and the darker color represents better performance. If several monitors have the same score, they all belong to the better group.

We demonstrate the necessity of having a clear evaluation in Table 1. The full table containing all available OOD datasets can be found in [1]. We evaluate the available monitors on a network trained on the MNIST dataset on a GPU and depict the AUROC score. The values of MDS and Mahalanobis can differ when switching between CPU and GPU; refer to [1] for details. The Box monitor [20] is not included as it does not have a single threshold and, therefore, no AUROC score can be computed. The table shows the ranking of the monitors for the detection of Gaussian noise, increased contrast, color inversion, rotation, and a new, albeit similar dataset (KMNIST). A darker color indicates a better ranking. One can see that there is barely any common behavior among the monitors. For example, while GradNorm performs best on Gaussian noise, it performs worst on inverted images.

This also shows that it is important for the user to define her goal for the monitor. Not every monitor will be great at detecting a particular type of OOD, and she must carefully choose the right monitor for her setting. Monitizer eases this task. In addition, it highlights the need for a clear evaluation of new monitoring methods in scientific publications.

We illustrate further features of Monitizer using the following four monitors: Energy [32], ODIN [31], Box [20], and Gaussian [13]. The first two were proposed by the machine-learning community, and the latter two by the formal methods community.

Fig. 5.
figure 5

The monitor templates were optimized on MNIST as ID and for detecting New-World / CIFAR-10 as OOD while keeping 70% accuracy on ID. All monitors were optimized randomly.

The output produced by Monitizer in the form of tables and plots (depicted in Fig. 5) helps the user see the effect of the choice of monitor, chosen objective, and dataset on the monitor’s effectiveness. Monitizer allows users to experiment with different choices and select the one suitable for their needs. Figure 5 shows the evaluation of the mentioned monitors with the MNIST dataset as ID data and an optimization with the goal of detecting pre-selected images of the CIFAR-10 dataset as those are entirely unknown to the network. The optimization was performed randomly. This resulted in the Gaussian monitor only correctly classifying around 70% of ID data, whereas the other monitors have higher accuracy on ID data. Consequently, the other monitors perform worse than the Gaussian monitor in detecting OOD data, as there is a tradeoff between good performance on ID and OOD data. This highlights the necessity of proper optimization for each monitor. See [1] for a detailed evaluation where we report on the experiments with different monitors, optimization objectives, and datasets.

Our experiments show that different monitors have different strengths and limitations. One can tune a monitor for a specific purpose (e.g., detecting a particular OOD class with very high accuracy); however, this affects its performance in other OOD classes.

5 Conclusion

Monitizer is a tool for automating the design and evaluation of NN monitors. It supports developers of new monitoring techniques, potential users of available monitors, and researchers attempting to improve the state of the art. In particular, it optimizes the monitor for the objectives specified by the user and thoroughly evaluates it.

Monitizer provides a library of 19 monitors, accompanied by 9 datasets and 15 NNs (at least one for each dataset), and three optimization methods (random, grid-search, and gradient descent). Additionally, all these inputs can be easily customized by a few lines of Python code, allowing a user to provide their monitors, datasets, and networks. The framework is extensible so that the user can implement their custom optimization methods and objectives.

Monitizer is an open-source tool providing a freely available platform for new monitors and easing their evaluation. It is publicly available at https://gitlab.com/live-lab/software/monitizer.