Detecting unusual input to neural networks

Evaluating a neural network on an input that differs markedly from the training data might cause erratic and flawed predictions. We study a method that judges the unusualness of an input by evaluating its informative content compared to the learned parameters. This technique can be used to judge whether a network is suitable for processing a certain input and to raise a red flag that unexpected behavior might lie ahead. We compare our approach to various methods for uncertainty evaluation from the literature for various datasets and scenarios. Specifically, we introduce a simple, effective method that allows to directly compare the output of such metrics for single input points even if these metrics live on different scales.


Introduction
The reliability and performance of a machine learning algorithm depends crucially on the data that were used for training it [2]. Having an incomplete training set or encountering input data that have unprecedented deviations might lead to unexpected, erroneous behavior [3,4,5]. In this work we consider the task of judging whether an input to a neural network, trained for classification, is unusual in the sense that it is different from the training data. For this purpose we consider a quantity based on the Fisher information matrix [6]. A similar quantity was shown by the authors to be useful for the related task of detecting adversarial examples [7]. Two kinds of scenarios are studied: • The input data were modified, e.g. by being noisy or contorted.
• The training set is incomplete, by missing some structural component that will emerge in practice.   [1] in its original form (left), after inverting its green color channel in RGB format (middle) and after adding Gaussian noise (right). Below we denote the predictionŷ of the neural network from Section 3.1 together with its error probability 1 − p(ŷ|x) and the quantity F θ (x) based on the Fisher information we introduce in Section 2. We observe in this example that the naive error probability can be misleading, while the Fisher information indicates an unusual input due to its value close to 100%.
Our considerations are carried out along various datasets and in comparison with related methods from the literature. In particular, we present an easy and efficient method to directly compare different metrics that judge the uncertainty behind a prediction. We consider neural networks that are trained for classifying inputs into C classes 1, . . . , C. After application of a softmax activation the output of a neural network for an input x can be read as a vector of probabilities (p θ (y|x)) y=1,...,C , where we wrote θ for the parameters of the neural network (usually weights and biases) [8]. The classŷ where this vector is maximal is then the prediction of the network. It is quite tempting to consider p θ (ŷ|x) as a measure for the confidence behind this prediction. The leftmost column of Figure 1 shows an image from the Intel Image Classification dataset [1] and the output of a neural network that was trained on it, details will follow in Section 3 below. The image is classified correctly as "sea" with p(ŷ|x) = 98.8%. In this article we are more interested in the probability of misclassification One might expect that the higher this probability, the more unreliable is the prediction of the network. The second column of Figure 1 shows the same image but with with an inverted green color channel. While the classification remains"sea", the probability of error rises to 14.1% which appears to be a sensible behavior. Unfortunately, the behavior of the output probability is not always meaningful. The rightmost column of Figure 1 shows the same picture with Gaussian noise added. The image is misclassified as showing a "forest" but with a quite small error probability 1 − p(ŷ|x) of around 2.1%. The fact that the softmax probability is not fully exhaustive or even misleading in judging the reliability of the prediction of the network led to various developments in the literature like a Bayesian judgement of uncertainty [9,10,11,12,13,14,15] and deep ensembles [16]. We here study the behavior of a quantity, called the Fisher form and denoted by F θ , that evaluates if the input x differs from what the network "has learned". F θ (x) was introduced by the authors, with slight modifications, for the purpose of adversarial detection [17,18] for the first time in [7]. Below Figure 1 we listed the values of F θ (x) , where . . . denotes some normalization we introduce below. As we see in Figure 1 F θ (x) shows a quite natural behavior on the depicted images, where values close to 100% indicate an unusual input.
This article is structured as follows: In Section 2 we introduce, and motivate, the method used in this article and introduce the normalization that allows us to compare different metrics. Section 3 presents the results for various datsets and splits in two halfs: in Subsection 3.1 we study the effect of modifying the input, while Subsection 3.2 considers the case where the training data lack some structural aspects. Finally we provide some conclusions and an outlook for future research.

Method
A slightly more complete summary of the output probabilities than (1) can be gained by considering the entropy instead [6,12]: If the entropy is high the output probabilities (p(y|x)) y=1,...,C are roughly equal in magnitude and the prediction is uncertain. H θ (x) suffers still from the same drawback than p θ (ŷ|x): Specifically, it completely ignores the uncertainties of θ and uses their point estimates only. One way around this problem is to use another, more elaborate distribution such as an approximate to the posterior predictive or a mixed distribution from a deep ensemble.
In [7] the authors introduced a quantity that gives a deeper insight than (2) and is based on the Fisher information. The Fisher information matrix for a neural network with output p θ (y|x) for an input x and a pdimensional parameter θ equals F θ (x) = C y=1 ∇ θ p θ (y|x)·∇ θ log p θ (y|x) T . Even  (1), the entropy H θ (x) from (2) and the Fisher form F θ (x) from (3) and ignore the fact that these quantities live on different scales. The histograms depict the behavior of these quantities for images from the Intel Image Classification dataset [1] unmodified (in green) and after inverting the green color channel (in magenta) as in the middle of Fig. 1.
a smaller network has around p = 10 6 parameters, which makes F θ a matrix with 10 12 entries and is thus, in practice, infeasible. A way out, proposed in [7], is to consider the effect of this matrix in a specific direction v, i.e. to consider the quadratic form where v has the same dimensionality as θ. We will refer to (3) as the Fisher form. The quantity (3) measures how much information is gained/lost in a (small) step in the direction v. More precisely, the Kullback-Leibler divergence between p θ (y|x) and p θ+εv (y|x) can be written as ε 2 2 F θ (x)+O(ε 3 ) [19]. After the parameter θ is learned, this divergence will be large for x that are informative, that is different from those used to infer the learned value of θ. To use (3) we have to fix a suitable direction v. A natural choice, which we will use throughout this article, is the negative gradient of the entropy where H θ (x) is as in (2). The motivation behind a choice like (4) is as follows: Once trained the network will produce for an input the highest probability for one classŷ, its "prediction", and lower probabilities for all other classes y =ŷ. The vector v then denotes the direction in parameter space that decreases the entropy, in other words that tends to increase the probability forŷ further and to decrease all other probabilities. If the input is unusual and leads to a wrong classificationŷ, then a step in the direction of v will increase the information substantially, as the used x will usually be completely different from the training data that were used for inferring θ. For another choice of v compare [7].
Taking the expression F θ (x) has the useful consequence that we can rewrite the quadratic form in (3) as which can be computed either directly or using a finite difference approximation that avoids the need for backpropagation [7].
Before analyzing how the quantity (3) performs in practice, there is one final point that needs some consideration. While quantities such as the Fisher form (3) or the entropy (2) can each be compared for two different datapoints there is no direct possibility for a comparison between the value of the entropy and the Fisher information for the same datapoint. Moreover, requiring someone applying these quantities to get an intuition first before they can judge whether a quantity is "high enough" to rise suspicion is rather unsatisfactory.
When faced with a binary decision problem, such as "will produce unexpected behavior or not", we can use the receiver operator characteristic (ROC) and the associated area under the ROC curve (AUC) to compare different metrics, cf., for example [7]. However, a comparison based on the ROC/AUC still has the drawback that it is not applicable for single dat-apoints and, moreover, that not all problems can be seen as a binary decision. We here describe a rather easy, yet efficient approach that solves these issues. Let us introduce the following normalization for a quantity q(x) and a test set T where # counts the number of elements. In other words q(x) denotes fraction of test samples x in T for which q(x ) < q(x). By construction, q(x) depends on the input x and on the test set T that is used. This normalization has a few nice properties: It is • bounded, as it lies always between 0 and 1, • invariant under strictly monotone transformations such as taking a logarithm or scaling.
This normalization allows us to compare different measures of reliability/ uncertainty, since relations such as F θ (x) = 75% and H θ (x) = 75% have a similar meaning: 75% of the test samples have a smaller value than the one for x. If we see the magnitude of a quantity such as F θ (x) as an indicator for the "unusualness" of for the softmax output, Gaussian dropout and a deep ensemble (in black, blue and cyan) and the Fisher form F θ (x λ ) (in purple) when interpolating two images from the MNIST dataset [20] using a variational autoencoder.
x λ and λ are as in (6). The small crosses indicate the classification as 4 (in green) and 9 (in blue) for the softmax output, Bernoulli dropout and a deep ensemble (in black, blue and cyan) and the Fisher form F θ (x λ ) (in purple) when pertubing an image from the Intel Image Classification dataset by Gaussian noise.
x λ and λ are as in (7). The crosses indicate whether the street has been classified correctly (in green) or not (in red).
a datapoint x, then a value of F θ (x) close to 100% can be seen as a strong indication that the reliability for a prediction based on x will be quite questionable. In particular a relation such as F θ (x) > H θ (x) for an x that causes a wrong prediction will indicate that F θ (x) detected the underlying uncertainty more clearly than H θ (x) (based on the test set), so that a normalization via . . . make measures of reliability directly comparable. Figure 2 shows the distribution of 1 − p θ (ŷ|x) (left), the normalized version of (1), the normalized entropy H θ (x) (middle) and Fisher form F θ (x) (right) for the same neural network as in Figure 1 for the original images (in green) and those with inverted green color channel (in magenta). The test set for the normalization was chosen equal to the test set from [1]. Note that only the Fisher information has a distinct peak at high values for the modified images.

Experiments
In this section we discuss the behavior of the Fisher form F θ , the entropy H θ , the entropy H DE θ for the mixture distribution predicted by a deep ensemble [16] and the entropy H GD/BD θ of an approximate to the posterior predictive. More precisely, H DE θ refers to the average of entropies produced by each member of the ensemble, and H GD/BD θ to the mean of entropies obtained by repeated predictions of the trained net using dropout. All deep ensembles were constructed using 5 networks, trained independently of each other. To approximate the posterior predictive we used Gaussian dropout [10] for MNIST and Bernoulli dropout [9] (with rate 0.5) for all other examples, as the latter seemed to converge substantially faster. The entropy for the distribution produced by the Gaussian dropout will be denoted by H GD θ and we will write H BD θ for Bernoulli dropout. The architecture of all used networks is sketched in the Appendix of this article.
In Section 3.1 we will analyze the behavior of F θ , H θ , H DE θ and H GD/BD θ on datapoints that are modified, ei-ther by adding noise or by a transformation based on a variational autoencoder. Section 3.2 studies the effect of missing certain features while training.

Modified data
As a first example we will consider the MNIST dataset [20] for digit recognition. We trained a convolutional neural network (Fig. 9a) for 25 epochs until it reached an accuracy of around 99% on both, the test and training set. Finally we trained a variational autoencoder (VAE) [11], sketched in Figure 9b, for 100 epochs until it reached a mean squared error accuracy of around 1% of the pixel range. Using a VAE is a popular form for interpolating between datapoints [21,22]. To this end, we split the VAE into an encoder E and a decoder D. For single input imagesx 0 andx 1 we then take z 0 = E(x 0 ), z 1 = E(x 1 ) and set for λ between 0 and 1: Figure 3a shows the interpolation between an image showing a 4 and an image showing a 9 for different values of λ. The curves above the reconstructed images show the dependency of H θ (x λ ) (in black), F θ (x λ ) (in magenta), the Gaussian dropout entropy H GD θ (x λ ) (in blue) and the deep ensemble entropy H DE θ (x λ ) (in cyan) on λ. We used the MNIST test set as the set T for normalization. All methods show a similar behavior with a distinct peak in the transition phase. As a next dataset we will consider the Intel Image Classification dataset [1] that contains images of 6 different classes (street, building, forest, mountain, glacier, sea). We touched this example already in the introduction of this article. We trained a convolutional neural network (Fig. 10b) on this dataset for 15 epochs until it reached an accuracy of around 80% on the test set of [1], which we also used for normalization. We here want to consider the effect of two modifica-tions on the datapoints. The effect of inverting the green color channel was already shown in Figure 1 and Figure  2. This modification lets the accuracy of the network drop from around 80% to 30%. Figure 4  Next, we consider the effect of noise. The upper plot in Figure 3b shows the evolution of H θ (x λ ) , H BD θ (x λ ) , H DE θ (x λ ) and F θ (x λ ) where x λ is now given as with x being a single image from the Intel Image Classification dataset that shows a street and with ε being an array of standard normal noise of the same shape as x. The lower row of Figure 3b displays x λ for various λ. Figure 3b depict if the network classifies the image correctly as a street (in green) or whether it chooses the wrong class (in red). We observe that while all quantities H θ (x λ ) , F θ (x λ ) , H DE θ (x λ ) and H BD θ (x λ ) share a similar trend, the Fisher form, is well above all other quantities in the transition phase where the noise starts flipping the classification. Inspecting the maximal Softmax output p(ŷ|x λ ) shows that beyond λ = 0.3 this probability is almost identical to 1, which finally pushes all quantities down towards 0.

Incomplete training data
While in Section 3.1 we looked at the behavior of F θ , H θ , H DE θ and H GD/DE θ for modified data, we will now analyze another scenario. What happens if the data we used in training was incomplete, in the sense that it lacked one or several important features that will occur in its application after training [3,24,25]?
As a first example we use the Credit Card Fraud Detection dataset [23,26]. This, extremely unbalanced, dataset contains 284,807 transactions with 492 frauds which are indicated by a binary label. The data were anonymized using a principal component analysis. The distribution of the main component, dubbed 'V1', is shown in Figure  5a. We split the data in two halfs. A training set, where V1 is below a threshold of -3.0 (in green), and a test set where V1 is bigger than the threshold (in red). On the green part we trained, balancing the classes, a small fully connected neural network (Fig.  10b) for 20 epochs until it reached an accuracy of 95% on a smaller subset of the "train" set that was excluded from the training process. On the "test" set from Figure 5a the accuracy is with 79% markedly lower. Figure 5b shows the ROC for the detection whether a datapoint x is from the red part of Figure 5a  We will now consider a problem where the splitting is more subtle and, maybe, more realistic. For this purpose we will use the DogsVsCats dataset from [27]. We trained a variational autoencoder (Fig. 11b) [11] on the combined collection of dog images from [27,28] for 30 epochs until it reached a mean squared error accuracy of around 2% of the pixel range. A reconstructed image can be seen in Figure 6a. We then selected the bottleneck node, called En from now on, that exhibits the largest variance when evaluated on these dog images. Figure 6b shows the distribution of En for dog images from [27] together with median(En) = 0.3. What is the meaning of En? Figure 6c shows the same reconstructed image as in 6a but with flipped sign of En.
We can see that the background of the depicted image becomes darker. Looking at various datapoints from the dataset hardens this impression so that we will see this as the "meaning" of En, although there is, probably, no one-to-one correspondence with a human interpretation. We trained a con-  [23]. We split this dataset in two parts, where one is used for training a neural network (in green) and the other one for testing on unusualness (in red) using the metrics from Section 2. volutional neural network (Fig. 11a) on the data from [27], but by omitting those dog images with i.e. those on the left side of the median in Figure 6b or, with the interpretation above, those images with dark background. We trained this network for 20 epochs until it reached an accuracy of around 80% on a subset of the images with En ≥ median(En) that we ignored while training. Evaluating now the network on dog images from [27,28] with En < median(En) lets the accuracy drop to 1%! In other words almost all dogs images with En < 0.3 are classified as cats. Our task is now to detect whether an input is unusual in the sense that (8) holds. Figure 7 depicts the evolution of the AUC for the quantities F θ , H θ , H DE θ and H BD θ for each epoch and the accuracy of the trained network (in dashed gray). The difference between the Fisher form F θ and all other forms is striking. The quantities based on entropy seem to completely fail in detecting the "unusualness" of images with En < 0.3. In fact they yield AUCs below 0.5, thereby marking the actual training images as more suspicious than the modified images on which the classification actually fails. Figure 6a: Output of the used VAE for a dog image from DogsVsCats dataset [27] where the variable in latent space was taken to be the mean of the distribution of the bottleneck node En. For this image the bottleneck node En is larger than the threshold 0.3.   (8) is satisfied. Note that while the background in Fig. 6a was bright, it is now turned dark.
The Fisher form F θ however is yielding a value well above 0.5 for a wide range of the training. After around 15 epochs the AUC for the Fisher form starts to drop down as well. As this coincides roughly with the point where the accuracy starts stabilizing we suspect that this is a similar effect as we observed in Figure 3b: Once the maximal probability of the softmax output becomes almost identical to one, this forces the Fisher information to decrease. In fact it turns out that after 20 epochs the maximal probabilities on the unusual data are in average around 99%. Compare this to the tiny accuracy of 1%! We will focus now on a point before this, apparently unreasonable, saturation happened, namely on the trained state after 8 epochs, marked by a gray cross. The normalization from (5), for which we here used the training data, allows us to compare the behavior of H θ (x), H DE θ (x), H BD θ (x) and F θ (x) for single datapoints x. As all entropies show a quite a similar behavior for this example, we will only compare H DE θ and F θ . We drew 1200 images all showing dogs and depicted the value of En versus H DE θ (x) and F θ (x) in Figure 8. Wrong classifications as cats are marked in red. The median 0.3 of En is drawn as a blue dashed line, so that the "unusual datapoints" are below this line. The marginal distributions of the detection quantities and of En are shown, split according to classified correctly (in black) and wrongly (in red).
Note first, that indeed most of the images satisfying (8) are classified wrong, so that the network is in fact not suitable for classifying these datapoints. Moreover, the deep ensemble entropy is not able to capture that there is something off with these data. Even worse, the corresponding points are cumulated in the lower left corner which should actually indicate a high confidence in the prediction for these points. The Fisher information on the other hand leads to a cumulation in the lower right corner and thus correctly detects many of the "unusual" datapoints. In fact, even for "normal" datapoints we observe that most of those that were classified wrong have an encoder value En close to 0.3 and possess again a high F θ .

Conclusion and outlook
We studied the Fisher form to detect whether an input is unusual, in the sense that it differs from the data that were used to infer the learned parameters. Spotting such a disparity is important, as it can substantially decrease the reliability. Several examples for this have been presented in Section 3. We observed that the Fisher form performs equally or even better than competing methods in detecting unusual inputs. In particular, we introduced a normalization that allows to directly compare these metrics for single datapoints.
The last example treated in this article showed an effect, which could be a starting point for future research. As we observed in Figure 7 the ability to detect unusual data might erode for a too long training. When only looking at the accuracy for an incomplete test set, as we did in Figure 7, such a process might happen unnoticed. This somehow different flavor of overfitting could be worth studying.
Another point that would deserve a more detailed investigation is of a  [27,28]. The ordinate shows the value of the bottleneck node En which was used to formulate the criterion (8). Red marks those datapoints that were classified wrong (as a cat). The histograms show the distribution of the marginal, again split in correct and wrong classified datapoints. more conceptual nature. For each of the examples treated in this work we used a single network to evaluate its Fisher form. One might wonder whether the effect we observed when considering unusual data will differ for another network whose training was carried out with, say, different initial conditions. As neural networks are bound to end up in local minima it is not obvious whether a datapoint will be unusual for two of these net-works, even if they are trained on the same training set. However, this is not what we observed. In fact, the behavior of the Fisher form seems rather robust in this regard. It is hard to say, whether this indicates that neural networks tend to learn similar patterns when presented with the same data. Looking at the quadratic form (3) for more than only one direction v could help to shed a bit of light on this question.  Figure 9b: Architecture for the variational autoencoder we train on the MNIST dataset and use in Section 3.1. Note that the bottleneck is random: From the 10"means" and 10 "variances" in the bottleneck we construct 10 Gaussian distributions from which we draw 10 random numbers that are than feeded through the decoder.     Figure 11b: Architecture for the variational autoencoder we train on the dog images from [27,28] and use in Section 3.2. Note that the bottleneck is random: From the 500"means" and 500 "variances" in the bottleneck we construct 500 Gaussian distributions from which we draw 500 random numbers that are than feeded through the decoder.