Monitizer: Automating Design and Evaluation of Neural Network Monitors

Azeem, Muqsit; Grobelna, Marta; Kanav, Sudeep; Křetínský, Jan; Mohr, Stefanie; Rieder, Sabine

doi:10.1007/978-3-031-65630-9_14

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14682))

Included in the following conference series:

International Conference on Computer Aided Verification

Abstract

The behavior of neural networks (NNs) on previously unseen types of data (out-of-distribution or OOD) is typically unpredictable. This can be dangerous if the network’s output is used for decision making in a safety-critical system. Hence, detecting that an input is OOD is crucial for the safe application of the NN. Verification approaches do not scale to practical NNs, making runtime monitoring more appealing for practical use. While various monitors have been suggested recently, their optimization for a given problem, as well as comparison with each other and reproduction of results, remain challenging.

We present a tool for users and developers of NN monitors. It allows for (i) application of various types of monitors from the literature to a given input NN, (ii) optimization of the monitor’s hyperparameters, and (iii) experimental evaluation and comparison to other approaches. Besides, it facilitates the development of new monitoring approaches. We demonstrate the tool’s usability on several use cases of different types of users as well as on a case study comparing different approaches from recent literature.

This research was funded in part by the German Research Foundation (DFG) project 427755713 GOPro and the MUNI Award in Science and Humanities MUNI/I/1757/2021 of the Grant Agency of Masaryk University.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

Neural networks (NNs) are increasingly used in safety-critical applications due to their good performance even on complex problems. However, their notorious unreliability makes their safety assurance even more important. In particular, even if the NN is well trained on the data that it is given and works well on similar data (so-called in-distribution (ID) data), it is unclear what it does if presented with a significantly different input (so-called out-of-distribution (OOD) data). For instance, what if an NN for traffic signs recognition trained on pictures taken in Nevada is now presented with a traffic sign in rainy weather, a European one, or a billboard with an elephant?

To ensure safety in all situations, we must at least recognize that the input is OOD; thus, the network’s answer is unreliable, no matter its confidence. Verification, a classic approach for proving safety, is extremely costly and essentially infeasible for practical NNs [34]. Moreover, it is mainly done for ID or related data [6, 34]. For instance, robustness is typically proven for neighborhoods of essential points, which may ensure correct behavior in the presence of noise or rain, but not elephants [18, 24, 25, 35]. In contrast, runtime verification and particularly runtime monitoring provide a cheap alternative. Moreover, the industry also finds it appealing as it is currently the only formal-methods approach applicable to industrial-sized NNs.

OOD runtime monitoring methods have recently started flourishing [7, 14, 20, 22, 32, 42]. Such a runtime monitor tries to detect if the current input to the NN is OOD. To this end, it typically monitors the behavior of the network (e.g., the output probabilities or the activation values of the neurons) and evaluates whether the obtained values resemble the ones observed on known ID data. If not, the monitor raises an alarm to convey suspicion about OOD data.

Challenges: While this approach has demonstrated potential, several practical issues arise:

How can we compare two monitors and determine which one is better? Considering the example of autonomous driving, an OOD input could arise from the fact that some noise was introduced by sensors or the brightness of the environment was perturbed. A monitor might perform well on one kind of OOD input but may not on another [44], as better performance in one class of OOD data does not imply the same in another class (see Fig. 1a).
Applying a particular monitoring technology to a concrete NN involves significant tweaking and hyperparameter tuning, with no push-button technology available. OOD monitors typically compute a value from the input and the behavior of the NN. The input is considered OOD if this value is smaller than a configurable threshold \(\tau \) (see Fig. 1)b. The value of this threshold has a significant influence on the performance of the monitors. More inputs would be classified as OOD if the threshold value is high, and vice versa. Moreover, OOD monitors generally have multiple parameters that require tuning, thereby aggravating the complexity of manual configuration.
As OOD monitoring can currently be described as a search for a good heuristic, many more heuristics will appear, implying the need for streamlining their handling and fair comparison.

In this paper, we provide the infrastructure for users and developers of NN monitors aiming at detecting OOD inputs (onwards just “monitors”).

Our contributions can be summarized as follows:

We provide a modular tool called Monitizer for automatic learning/constructing, optimizing, and evaluating monitors.
Monitizer supports (i) easy practical use, providing various recent monitors from the literature, which can directly be optimized and applied to user-given networks and datasets with no further inputs required; the push-button solution offers automatic choice of the best available monitor without requiring any knowledge on the side of the user; (ii) advanced development use, with the possibility of easily integrating a new monitor or new evaluation techniques. The framework also foresees and allows for the integration of monitoring other properties than OOD.
We provide a library of 19 well-known monitors from the scientific literature to be used off-the-shelf, accompanied by 9 datasets and 15 NNs, which can be used for easy but rich automatic evaluation and comparison of monitors on various OOD categories.
We demonstrate the functionality for principled use cases accompanied by examples and a case study comparing a few recent monitoring approaches.

Altogether, we are giving users the infrastructure for automatic creation of monitors, development of new methods, and their comparison to similar approaches.

2 Related Work

NN Monitoring Frameworks. OpenOOD [47, 48] contains task-specific benchmarks for OOD detection that consist of an ID and multiple OOD datasets for specific tasks (e.g., Open Set Recognition and Anomaly Detection). Both OpenOOD and Monitizer contain several different monitors and benchmarks. Monitizer provides functionality to tune the monitors for the given objective, supports a comprehensive evaluation of monitors on a specific ID dataset by automatically providing generated OOD inputs by, e.g., the addition of noise, and can easily be extended with more datasets. OpenOOD, in contrast to Monitizer, does not support hyperparameter tuning and generation of OOD inputs.

Samuels et al. propose a framework to optimize an OOD monitor during runtime on newly experienced OOD inputs [26]. While this contains optimization, the framework is specific to one monitor and is based on active learning. Monitizer is meant to work in an offline setting and optimize a monitor before it is deployed. Additionally, Monitizer is built for extensibility and reusability, which the other tool is not, e.g., it lacks an executable.

PyTorch-OOD [27] is a library for OOD detection, yet despite its name, it is not part of the official PyTorch-library. It includes several monitors, datasets, and supports the evaluation of the integrated monitors. Both Monitizer and PyTorch-OOD provide a library of monitors and datasets. However, there are significant differences. Monitizer supports optimization of monitors, allowing us to return monitors optimal for a chosen objective, provides a more structured view of the dataset, and provides a transparent and detailed evaluation showing how a monitor performs on different OOD classes. Besides, we provide a one-click solution to easily evaluate the whole set of monitors and automatically return the best available option, fine-tuned to the case. Consequently, Monitizer is a tool that is much easier to use and extend. Last but not least, it is an alternative implementation that allows cross-checking outcomes, thereby making monitoring more trustworthy.

OOD Benchmarking. Various datasets have been published for OOD benchmarking [15, 16, 19, 37, 38], Breitenstein et al. present a classification for different types of OOD data in automated driving [5], and Ferreira et al. propose a benchmark set for OOD with several different categories [11].

3 Monitizer

Monitizer aims to assist the developers and users of NN monitors and developers of new monitoring techniques by supporting optimization and transparent evaluation of their monitors. It structures OOD data in a hierarchy of classes, and a monitor can be tuned for any (combination) of these classes. It also provides a one-click solution to evaluate a set of monitors and return the best available option optimized for the given requirement.

3.1 Overview

Monitizer offers two main building blocks, as demonstrated in Fig. 2: optimization and evaluation of NN monitors. NN monitors are typically parameterized and usually depend on the NN and dataset. Before one can evaluate them, they need to be configured and possibly tuned. We refer to monitors that are not yet configured as monitor templates. Monitizer optimizes the monitor templates and evaluates them afterward on several different OOD classes, i.e., types of OOD data.

Monitizer needs at least two inputs (see Fig. 2): an NN, and an ID-dataset. The user can also provide a monitor template and an optimization configuration (consisting of an optimization objective and optimization method). If these are not provided, Monitizer reverts to the default values (i.e., evaluating all monitors using the AUROC-score without optimization). For both inputs, the user can choose from the options we offer or provide a custom implementation.

Monitizer optimizes the provided monitor based on the optimization objectives and method on the given ID dataset. An example of optimization would be:^{Footnote 1} maximize the detection accuracy on blurry images, but keep the accuracy on ID images at least 70%. Optimization is necessary to obtain a monitor that is ready to use. However, it is possible to evaluate a monitor template on its default values for the parameters using the AUROC-score (Area Under the Receiver Operating Characteristic Curve)^{Footnote 2}.

On successful execution, Monitizer provides the user with a configuration of the monitor template and the evaluation result. This can be either a table with the accuracy of OOD detection for each OOD dataset along with a parallel coordinate plot for the same (in case of optimization) or the AUROC score.

3.2 Use Cases

We envision three different types of users for Monitizer:

1.
The End User

Context: The end user of a monitor, e.g., an engineer in the aviation industry, is interested in the end product, not in the intricacies of the underlying monitoring technique. She intends to evaluate one or all monitors provided by Monitizer for her custom NN and dataset, and wants to come to a conclusion on which one to use. She has an NN that needs to be monitored. Additionally, she has her own proprietary ID dataset, e.g., the one on which the NN was trained. She wants a monitor fulfilling some requirement, e.g., one that is optimal on average for all classes or one that can detect a specific type of OOD that her NN is not able to handle properly.

Usage: Such a user can obtain a monitor tuned to her needs using Monitizer without much effort. Monitizer supports this feature out of the box. It provides various monitors (19 at present) that can be optimized for a given network. In case she wants to use a custom NN or a dataset, she has to provide the NN as PyTorch-dump or in onnx-format [4] and add some lines of code to implement the interface for loading her data.

Required Effort: After providing the interface for her custom dataset, the user only has to trigger the execution. The execution time depends on the hardware quality, the NN’s size, the chosen monitor’s complexity, and the dataset’s size.
2.
The Developer of Monitors

Context: The developer of monitoring techniques, e.g., a researcher working in runtime verification of NNs, aims to create novel techniques and assess their performance in comparison to established methods.

Usage: Such a user can plug their novel monitor into Monitizer and evaluate it. Monitizer directly provides the most commonly used NNs and datasets for academic evaluation.

Required Effort: The code for the monitor needs to be in Python and should implement the functions specified in the interface for monitors in Monitizer. Afterward, she can trigger the evaluation of her monitoring technique.
3.
The Scholar

Context: An expert in monitoring, e.g., an experienced researcher in NN runtime verification, intends to explore beyond the current boundaries. She might want to adapt an NN monitor to properties other than OOD, or to experiment with custom NNs or datasets.

Usage: Monitizer provides interfaces, and instructions on how to integrate new NNs, datasets, monitors, custom optimization methods and objectives.

Required Effort: The required integration effort depends on the complexity of the concrete use case. For example, adding an NN would take much less time than developing a new monitor.

More detailed examples are available in [1].

3.3 Phases of Monitizer

An execution of Monitizer is typically a sequence of three phases: parse, optimize, and evaluate. As mentioned, the user can decide to skip the optimization or the evaluation.

Parse. This phase parses the input, loads the NN and dataset, and instantiates the monitor. It also performs sanity checks on the inputs, e.g., the datasets are available in the file system, the provided monitor is implemented correctly, etc.

Optimize. This phase tunes the parameters of a given monitor template to maximize an objective. It depends on two inputs, the optimization method and the optimization objective, that the user has to give.

An illustrative depiction of this process can be found in [1]. The optimization method defines the search space and generates a new candidate monitor by setting its parameters. Monitizer then uses the optimization objective to evaluate this candidate. If the objective is to optimize at least one OOD class, Monitizer evaluates the monitor on a validation set of this class, which is distinct from the test set used in the evaluation later. The optimization method obtains this result and decides whether to continue optimizing or stop and return the best monitor that it has found.

Monitizer provides three optimization methods: random, grid-search, and gradient descent. Random search tries out a specified number of random sets of parameters and returns the monitor that worked best among these. Grid-search specifies a search grid by looking at the minimal and maximal values of the parameters. It then defines a grid on the search space. The monitor is infused with these parameters for each grid vertex and evaluated on the objective. Gradient-descent follows the gradient of the objective function towards the optimum.

Monitizer supports multi-objective optimization of monitors. A user can specify a set of OOD classes to optimize for and the minimum required accuracy for ID detection. Single objective optimization is a special case when only one OOD class is specified for optimization. Based on a configuration value, Monitizer would generate a set of different weight combinations for the objectives and create and evaluate a monitor for each of these combinations. If there are two objectives, Monitizer generates a Pareto frontier plot; in the case of more than two objectives, the tool generates a table. The user obtains the performance of the optimized monitor for each weight-combination of objectives.

Evaluate. The evaluation of NN monitors in Monitizer is structured according to the OOD classification (detailed in the next section). We introduce this classification of OOD data to enable a clearer evaluation and gain knowledge about which monitor performs well on which particular class of OOD. Typically, no monitor performs well on every class of OOD [44]. We highlight this in our evaluation to ensure a fair and meaningful comparison between monitors rather than restricting to a non-transparent and possibly biased average score.

After evaluation, Monitizer reports the detection accuracy for each OOD class and can also produce a parallel-coordinates-plot displaying the reported accuracy. Monitizer can also provide confidence intervals for the evaluation quality, which is explained in [1].

3.4 Classification of Out-of-Distribution Data

We now introduce our classification of OOD data. At the top level, an OOD input can either be generated, i.e., obtained by distorting ID data [3, 14, 17, 31, 41], or it can be collected using data from some other available dataset.

The notion of generated OOD is straightforward. These classes are created by slightly distorting ID data, for example, by increasing the contrast or adding noise. An important factor is the amount of distortion, e.g., the amount of noise, as it influences the NN’s performance and needs to be high enough to transform an ID into an OOD input.

We explain the idea of collected OOD with the help of an example shown in Fig. 4. Consider an ID dataset that consists of textures (Fig. 4a). Images containing objects (Fig. 4b) differ from images showing just a texture. But, when we consider a dataset of numbers as ID (Fig. 4c), it seems much more similar to a dataset of letters (Fig. 4d) than textures are to objects. In the first case, the datasets have no common meaning or concept, as if they were belonging to a new world. In the second case, the environment and the underlying concept are similar, but an unseen object is placed in it.

Figure 3 shows our classification of the OOD data. It is based on the kind of OOD data we found in the literature (discussed in Sect. 2). [1] contains a detailed description of each class and an illustrative figure.

OOD Benchmarks Implementation. Note that the generated OOD will be automatically created by Monitizer for any given ID dataset. The collected OOD data has to be manually selected. We provide a few preselected datasets (for example, KMNIST [9] as unseen objects for MNIST [29]) in the tool. A user can easily add more when needed. However, for a user like the developer of monitors, MNIST and CIFAR-10 are often sufficient to test new monitoring methodologies, as related work has shown [13, 20].

3.5 Library of Monitors, NNs, and Datasets

Monitizer currently includes 19 monitors, accompanied by 9 datasets and 15 NNs. In the following, we give an overview of the available options.

Monitors. Monitizer provides different highly cited monitors, which are also included in other tools such as OpenOOD/Pytorch-OOD. We extended this list by adding monitors from the formal methods community (e.g., Box monitor, Gaussian monitor). The following monitors are available in Monitizer: ASH-B,ASH-P,ASH-S [10], Box-monitor [20], DICE [42], Energy [32], Entropy [33], Gaussian [13], GradNorm [23], KL Matching [15], KNN [43], MaxLogit [50], MDS [30], Softmax [17], ODIN [31], ReAct [41], Mahalanobis [39], SHE [49], Temperature [12] VIM [45].

Datasets. The following datasets are available in Monitizer: CIFAR-10, CIFAR-100 [28], DTD [8], FashionMNIST [46], GTSRB [21], ImageNet [40], K-MNIST [9], MNIST [29], SVHN [36].

Neural Networks Monitizer provides at least one pretrained NN for each available dataset. The library contains more NNs trained on commonly used datasets in academia, such as MNIST and CIFAR-10, allowing users to evaluate monitors on different architectures. [1] contains a detailed description of the pretrained NNs.

4 Summary of Evaluation by Case Study

Table 1. Comparison of the AUROC-score of all implemented monitors on different OOD datasets multiplied by 100 (and rounded to the nearest integer). All monitors were evaluated on a fully connected network trained on MNIST. The cells are colored according to the relative performance of a monitor (column) in a specific OOD class (row). The monitors are divided in three ranks and the darker color represents better performance. If several monitors have the same score, they all belong to the better group.

Full size table

We demonstrate the necessity of having a clear evaluation in Table 1. The full table containing all available OOD datasets can be found in [1]. We evaluate the available monitors on a network trained on the MNIST dataset on a GPU and depict the AUROC score. The values of MDS and Mahalanobis can differ when switching between CPU and GPU; refer to [1] for details. The Box monitor [20] is not included as it does not have a single threshold and, therefore, no AUROC score can be computed. The table shows the ranking of the monitors for the detection of Gaussian noise, increased contrast, color inversion, rotation, and a new, albeit similar dataset (KMNIST). A darker color indicates a better ranking. One can see that there is barely any common behavior among the monitors. For example, while GradNorm performs best on Gaussian noise, it performs worst on inverted images.

This also shows that it is important for the user to define her goal for the monitor. Not every monitor will be great at detecting a particular type of OOD, and she must carefully choose the right monitor for her setting. Monitizer eases this task. In addition, it highlights the need for a clear evaluation of new monitoring methods in scientific publications.

We illustrate further features of Monitizer using the following four monitors: Energy [32], ODIN [31], Box [20], and Gaussian [13]. The first two were proposed by the machine-learning community, and the latter two by the formal methods community.

The output produced by Monitizer in the form of tables and plots (depicted in Fig. 5) helps the user see the effect of the choice of monitor, chosen objective, and dataset on the monitor’s effectiveness. Monitizer allows users to experiment with different choices and select the one suitable for their needs. Figure 5 shows the evaluation of the mentioned monitors with the MNIST dataset as ID data and an optimization with the goal of detecting pre-selected images of the CIFAR-10 dataset as those are entirely unknown to the network. The optimization was performed randomly. This resulted in the Gaussian monitor only correctly classifying around 70% of ID data, whereas the other monitors have higher accuracy on ID data. Consequently, the other monitors perform worse than the Gaussian monitor in detecting OOD data, as there is a tradeoff between good performance on ID and OOD data. This highlights the necessity of proper optimization for each monitor. See [1] for a detailed evaluation where we report on the experiments with different monitors, optimization objectives, and datasets.

Our experiments show that different monitors have different strengths and limitations. One can tune a monitor for a specific purpose (e.g., detecting a particular OOD class with very high accuracy); however, this affects its performance in other OOD classes.

5 Conclusion

Monitizer is a tool for automating the design and evaluation of NN monitors. It supports developers of new monitoring techniques, potential users of available monitors, and researchers attempting to improve the state of the art. In particular, it optimizes the monitor for the objectives specified by the user and thoroughly evaluates it.

Monitizer provides a library of 19 monitors, accompanied by 9 datasets and 15 NNs (at least one for each dataset), and three optimization methods (random, grid-search, and gradient descent). Additionally, all these inputs can be easily customized by a few lines of Python code, allowing a user to provide their monitors, datasets, and networks. The framework is extensible so that the user can implement their custom optimization methods and objectives.

Monitizer is an open-source tool providing a freely available platform for new monitors and easing their evaluation. It is publicly available at https://gitlab.com/live-lab/software/monitizer.

Data Availability Statement

A reproduction package including all our results is available at Zenodo [2].

Notes

1.
Thanks to Flaticon.com for the Icons.
2.
The ROC (Receiver Operating Characteristic) curve shows the performance of a binary classifier with different decision thresholds. The AUROC computes the area under this curve. The best possible value is 1, indicating perfect prediction.

References

Azeem, M., Grobelna, M., Kanav, S., Křetínský, J., Mohr, S., Rieder, S.: Monitizer: Automating design and evaluation of neural network monitors. CoRR (2024). https://arxiv.org/abs/2405.10350
Azeem, M., Grobelna, M., Kanav, S., Křetínský, J., Mohr, S., Rieder, S.: Reproduction package for article ‘monitizer: automating design and evaluation of neural network monitors. In: Proceedings of CAV 2024, Zenodo (2024). https://doi.org/10.5281/zenodo.10933013
Bai, H., Canal, G., Du, X., Kwon, J., Nowak, R.D., Li, Y.: Feed two birds with one scone: exploiting wild data for both out-of-distribution generalization and detection. In: ICML 2023. PMLR, vol. 202, pp. 1454–1471. PMLR (2023), https://proceedings.mlr.press/v202/bai23a.html
Bai, J., Lu, F., Zhang, K., et al.: ONNX: Open neural network exchange (2019). https://github.com/onnx/onnx
Breitenstein, J., Termöhlen, J., Lipinski, D., Fingscheidt, T.: systematization of corner cases for visual perception in automated driving. In: Proceedings of IV, pp. 1257–1264. IEEE (2020). https://doi.org/10.1109/IV47402.2020.9304789
Casadio, M., Komendantskaya, E., Daggitt, M.L., Kokke, W., Katz, G., Amir, G., Refaeli, I.: Neural network robustness as a verification property: a principled case study. In: Proceedings of CAV, pp. 219–231. Springer (2022). https://doi.org/10.1007/978-3-031-13185-1_11
Cheng, C., Nührenberg, G., Yasuoka, H.: Runtime monitoring neuron activation patterns. In: Proceedings of DATE, pp. 300–303. IEEE (2019). https://doi.org/10.23919/DATE.2019.8714971
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., , Vedaldi, A.: Describing textures in the wild. In: Proceedings of CVPR (2014). https://doi.org/10.1109/CVPR.2014.461
Clanuwat, T., Bober-Irizar, M., Kitamoto, A., Lamb, A., Yamamoto, K., Ha, D.: Deep learning for classical japanese literature. CoRR (2018). https://doi.org/10.48550/arXiv.1812.01718
Djurisic, A., Bozanic, N., Ashok, A., Liu, R.: Extremely simple activation shaping for out-of-distribution detection. In: Proceedings of ICLR. OpenReview.net (2023). https://openreview.net/forum?id=ndYXTEL6cZz
Ferreira, R.S., Arlat, J., Guiochet, J., Waeselynck, H.: Benchmarking safety monitors for image classifiers with machine learning. In: PRDC 2021, pp. 7–16. IEEE (2021). https://doi.org/10.1109/PRDC53464.2021.00012
Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. In: Proc. ICML, pp. 1321–1330. PMLR (2017). https://proceedings.mlr.press/v70/guo17a.html
Hashemi, V., Křetínský, J., Mohr, S., Seferis, E.: Gaussian-based runtime detection of out-of-distribution inputs for neural networks. In: Feng, L., Fisman, D. (eds.) RV 2021. LNCS, vol. 12974, pp. 254–264. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-88494-9_14
Chapter Google Scholar
Hashemi, V., Kretínský, J., Rieder, S., Schmidt, J.: Runtime monitoring for out-of-distribution detection in object detection neural networks. In: Proc. FM. LNCS, vol. 14000, pp. 622–634. Springer (2023). https://doi.org/10.1007/978-3-031-27481-7_36
Hendrycks, D., et al.: Scaling out-of-distribution detection for real-world settings. In: Proc. ICML. PMLR, vol. 162, pp. 8759–8773. PMLR (2022). https://proceedings.mlr.press/v162/hendrycks22a.html
Hendrycks, D., Dietterich, T.: Benchmarking neural network robustness to common corruptions and perturbations. In: ICLR. OpenReview.net (2019). https://openreview.net/forum?id=HJz6tiCqYm
Hendrycks, D., Gimpel, K.: A baseline for detecting misclassified and out-of-distribution examples in neural networks. In: Proc. ICLR. OpenReview.net (2017). https://openreview.net/forum?id=Hkg4TI9xl
Henriksen, P., Lomuscio, A.R.: Efficient neural network verification via adaptive refinement and adversarial search. In: Proceedings of ECAI. FAIA, vol. 325, pp. 2513–2520. IOS Press (2020). https://doi.org/10.3233/FAIA200385
Henriksson, J., et al.: Towards structured evaluation of deep neural network supervisors. In: Proceedings of AITest, pp. 27–34. IEEE (2019). https://doi.org/10.1109/AITest.2019.00-12
Henzinger, T.A., Lukina, A., Schilling, C.: Outside the box: abstraction-based monitoring of neural networks. In: Proceedings of ECAI, FAIA, vol. 325, pp. 2433–2440. IOS Press (2020). https://doi.org/10.3233/FAIA200375
Houben, S., Stallkamp, J., Salmen, J., Schlipsing, M., Igel, C.: Detection of traffic signs in real-world images: the german traffic sign detection benchmark. In: Proceedings of IJCNN. pp. 1–8. IEEE (2013). https://doi.org/10.1109/IJCNN.2013.6706807
Hsu, Y., Shen, Y., Jin, H., Kira, Z.: Generalized ODIN: detecting out-of-distribution image without learning from out-of-distribution data. In: Proceedings of CVPR, pp. 10948–10957. IEEE/CVF (2020). https://doi.org/10.1109/CVPR42600.2020.01096
Huang, R., Geng, A., Li, Y.: On the importance of gradients for detecting distributional shifts in the wild. In: NeurIPS, vol. 34, pp. 677–689 (2021), https://proceedings.neurips.cc/paper_files/paper/2021/hash/063e26c670d07bb7c4d30e6fc69fe056-Abstract.html
Katz, G., Barrett, C.W., Dill, D.L., Julian, K., Kochenderfer, M.J.: Reluplex: a calculus for reasoning about deep neural networks. FMSD 60(1), 87–116 (2022). https://doi.org/10.1007/s10703-021-00363-7
Article Google Scholar
Katz, G., et al.: The Marabou framework for verification and analysis of deep neural networks. In: Proceedings of CAV. LNCS, vol. 11561, pp. 443–452. Springer (2019). https://doi.org/10.1007/978-3-030-25540-4_26
Katz-Samuels, J., Nakhleh, J.B., Nowak, R.D., Li, Y.: Training OOD detectors in their natural habitats. In: Proc. ICML. PMLR, vol. 162, pp. 10848–10865. PMLR (2022). https://proceedings.mlr.press/v162/katz-samuels22a.html
Kirchheim, K., Filax, M., Ortmeier, F.: PyTorch-OOD: a library for out-of-distribution detection based on PyTorch. In: CVPR Workshops 2022, pp. 4350–4359. IEEE/CVF (2022). https://doi.org/10.1109/CVPRW56347.2022.00481
Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images. Tech. rep., https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf
LeCun, Y., Cortes, C., Burges, C.: MNIST handwritten digit database 2
Google Scholar
Lee, K., Lee, K., Lee, H., Shin, J.: A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In: NeurIPS, vol. 31, pp. 7167–7177 (2018). https://proceedings.neurips.cc/paper/2018/hash/abdeb6f575ac5c6676b747bca8d09cc2-Abstract.html
Liang, S., Li, Y., Srikant, R.: Enhancing the reliability of out-of-distribution image detection in neural networks. In: Proceedings of ICLR. OpenReview.net (2018). https://openreview.net/forum?id=H1VGkIxRZ
Liu, W., Wang, X., Owens, J., Li, Y.: Energy-based out-of-distribution detection. NeurIPS 33, 21464–21475 (2020). https://proceedings.neurips.cc/paper_files/paper/2020/hash/f5496252609c43eb8a3d147ab9b9c006-Abstract.html
Macêdo, D., Ren, T.I., Zanchettin, C., Oliveira, A.L., Ludermir, T.: Entropic out-of-distribution detection. In: Proceedings of (IJCNN), pp. 1–8. IEEE (2021). https://doi.org/10.1109/IJCNN52387.2021.9533899
Müller, M.N., Brix, C., Bak, S., Liu, C., Johnson, T.T.: The third international verification of neural networks competition (VNN-COMP 2022): Summary and results. CoRR (2022). https://doi.org/10.48550/arXiv.2212.10376
Müller, M.N., Makarchuk, G., Singh, G., Püschel, M., Vechev, M.T.: PRIMA: general and precise neural network certification via scalable convex hull approximations. PACMPL 6(POPL), 1–33 (2022). https://doi.org/10.1145/3498704
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning
Google Scholar
Olber, B., Radlak, K., Popowicz, A., Szczepankiewicz, M., Chachula, K.: Detection of out-of-distribution samples using binary neuron activation patterns. In: Proceedings of CVPR, pp. 3378–3387. IEEE/CVF (2023). https://doi.org/10.1109/CVPR52729.2023.00329
Pinggera, P., Ramos, S., Gehrig, S., Franke, U., Rother, C., Mester, R.: Lost and found: detecting small road hazards for self-driving vehicles. In: Proceedings of IROS, pp. 1099–1106. IEEE (2016). https://doi.org/10.1109/IROS.2016.7759186
Ren, J., Fort, S., Liu, J., Roy, A.G., Padhy, S., Lakshminarayanan, B.: A simple fix to mahalanobis distance for improving near-ood detection. CoRR (2021). https://doi.org/10.48550/arXiv.2106.09022
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Article MathSciNet Google Scholar
Sun, Y., Guo, C., Li, Y.: ReAct: Out-of-distribution detection with rectified activations. In: NeurIPS. vol. 34, pp. 144–157 (2021). https://proceedings.neurips.cc/paper/2021/hash/01894d6f048493d2cacde3c579c315a3-Abstract.html
Sun, Y., Li, Y.: DICE: Leveraging sparsification for out-of-distribution detection. In: Proceeding of ECCV. LNCS, vol. 13684, pp. 691–708. Springer (2022). https://doi.org/10.1007/978-3-031-20053-3_40
Sun, Y., Ming, Y., Zhu, X., Li, Y.: Out-of-distribution detection with deep nearest neighbors. In: Proceedings of ICML, pp. 20827–20840. PMLR (2022). https://proceedings.mlr.press/v162/sun22d
Tajwar, F., Kumar, A., Xie, S.M., Liang, P.: No true state-of-the-art? OOD detection methods are inconsistent across datasets. CoRR (2021). https://doi.org/10.48550/arXiv.2109.05554
Wang, H., Li, Z., Feng, L., Zhang, W.: ViM: Out-of-distribution with virtual-logit matching. In: Proceedings of CVPR, pp. 4921–4930. IEEE/CVF (2022). https://doi.org/10.1109/CVPR52688.2022.00487
Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms
Google Scholar
Yang, J., et al.: OpenOOD: benchmarking generalized out-of-distribution detection. In: NeurIPS (2022). http://papers.nips.cc/paper_files/paper/2022/hash/d201587e3a84fc4761eadc743e9b3f35-Abstract-Datasets_and_Benchmarks.html
Zhang, J., et al.: OpenOOD v1.5: Enhanced benchmark for out-of-distribution detection. CoRR (2023). https://doi.org/10.48550/arXiv.2306.09301
Zhang, J., et al.: Out-of-distribution detection based on in-distribution data patterns memorization with modern hopfield energy. In: Proceedings of ICLR (2022). https://openreview.net/forum?id=KkazG4lgKL
Zhang, Z., Xiang, X.: Decoupling maxlogit for out-of-distribution detection. In: Proceedings of CVPR, pp. 3388–3397. IEEE/CVF (2023). https://doi.org/10.1109/CVPR52729.2023.00330

Download references

Author information

Authors and Affiliations

Technical University of Munich, Munich, Germany
Muqsit Azeem, Marta Grobelna, Jan Křetínský, Stefanie Mohr & Sabine Rieder
Masaryk University, Brno, Czech Republic
Sudeep Kanav, Jan Křetínský & Sabine Rieder
Audi AG, Ingolstadt, Germany
Sabine Rieder

Authors

Muqsit Azeem
View author publications
You can also search for this author in PubMed Google Scholar
Marta Grobelna
View author publications
You can also search for this author in PubMed Google Scholar
Sudeep Kanav
View author publications
You can also search for this author in PubMed Google Scholar
Jan Křetínský
View author publications
You can also search for this author in PubMed Google Scholar
Stefanie Mohr
View author publications
You can also search for this author in PubMed Google Scholar
Sabine Rieder
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jan Křetínský .

Editor information

Editors and Affiliations

University of Waterloo, Waterloo, ON, Canada
Arie Gurfinkel
Georgia Institute of Technology, Atlanta, GA, USA
Vijay Ganesh

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Azeem, M., Grobelna, M., Kanav, S., Křetínský, J., Mohr, S., Rieder, S. (2024). Monitizer: Automating Design and Evaluation of Neural Network Monitors. In: Gurfinkel, A., Ganesh, V. (eds) Computer Aided Verification. CAV 2024. Lecture Notes in Computer Science, vol 14682. Springer, Cham. https://doi.org/10.1007/978-3-031-65630-9_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-65630-9_14
Published: 25 July 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-65629-3
Online ISBN: 978-3-031-65630-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Monitizer: Automating Design and Evaluation of Neural Network Monitors