Popular existing UQ frameworks for DNNs place parametric densities, most commonly Gaussian densities, over the DNN parameters or predictions. Instead of using specific parametrized densities, our INN method relies on bounding distributions using intervals. This results in a flexible and modular method that can be applied post hoc to a given DNN \(\varvec{\varPhi }\) that has already been trained. A schematic illustration is provided in Fig. 1: The INN is formed by wrapping additional weight and bias intervals around the weights and biases of the underlying prediction DNN. This allows us to equip the DNN \(\varvec{\varPhi }\) with uncertainty capabilities without the need to modify \(\varvec{\varPhi }\) itself. After training the INN we obtain prediction intervals that are guaranteed to contain the original prediction of the underlying network and are easy to interpret. They provide exact upper and lower bounds for the range of possible values that the DNN prediction may take when slightly modifying the network parameters within the prescribed weight and bias intervals.
Previously, the capacity of neural networks with interval weights and biases was evaluated for fitting interval-valued functions [11]. In contrast to [11], our targets \(\varvec{x}_i\) are neither interval-valued nor univariate, leading to a different loss function which allows us to equip trained neural networks with uncertainty capabilities post hoc. For a direct comparison, see 3 in 3.2 and Equation (18) in [11]. Further, [17, 30] explored neural networks implementing interval arithmetic for robust classifications. However, in their setting, the focus is purely on representing the inputs or outputs as intervals but not the weights and biases. In contrast, our proposed INNs determine interval bounds for all network parameters with the goal of providing uncertainty scores for the predictions of an underlying DNN.
Arithmetic of Interval Neural Networks
We will now give a description of those INN mechanisms that deviate from standard DNNs. The forward propagation of a single input \(\varvec{z}\) through a DNN is replaced by the forward propagation of a component-wise interval-valued input \([\underline{\varvec{z}}, \overline{\varvec{z}}]\) through the INN. This can be expressed similarly to standard feed-forward neural networks but using interval arithmetic instead. For interval-valued weight matrices \([\underline{\varvec{W}}, \overline{\varvec{W}}]\) and bias vectors \([\underline{\varvec{b}}, \overline{\varvec{b}}]\), the propagation through the \(\ell \)-th network layer can be expressed as
$$\begin{aligned} \left[ \underline{\varvec{z}}, \overline{\varvec{z}}\right] ^{(\ell +1)} = \varrho \left( \left[ \underline{\varvec{W}}, \overline{\varvec{W}}\right] ^{(\ell )} \left[ \underline{\varvec{z}}, \overline{\varvec{z}}\right] ^{(\ell )}+ \left[ \underline{\varvec{b}}, \overline{\varvec{b}}\right] ^{(\ell )} \right) . \end{aligned}$$
(2)
For nonnegative \([\underline{\varvec{z}}, \overline{\varvec{z}}]^{(\ell )}\), for example when using a nonnegative activation function \(\varrho \) such as the ReLU in the previous layer, we can explicitly rewrite (2) as
$$\begin{aligned} \overline{\varvec{z}}^{(\ell +1)}&= \varrho \left( \min \left\{ \overline{\varvec{W}}^{(\ell )},\varvec{0}\right\} \underline{\varvec{z}}^{(\ell )} +\max \left\{ \overline{\varvec{W}}^{(\ell )}, \varvec{0}\right\} \overline{\varvec{z}}^{(\ell )}+ \overline{\varvec{b}}^{(\ell )} \right) ,\\ \underline{\varvec{z}}^{(\ell +1)}&= \varrho \left( \max \left\{ \underline{\varvec{W}}^{(\ell )}, \varvec{0}\right\} \underline{\varvec{z}}^{(\ell )}+ \min \left\{ \underline{\varvec{W}}^{(\ell )}, \varvec{0}\right\} \overline{\varvec{z}}^{(\ell )}+ \underline{\varvec{b}}^{(\ell )} \right) , \end{aligned}$$
where the maximum and minimum are computed component-wise. Similarly, for point intervals \(\underline{\varvec{z}}^{(\ell )}=\overline{\varvec{z}}^{(\ell )}=:\varvec{z}^{(\ell )}\), for example, as inputs to the first network layer, we can rewrite (2) as
$$\begin{aligned} \overline{\varvec{z}}^{(\ell +1)}&= \varrho \left( \overline{\varvec{W}}^{(\ell )} \max \{ \varvec{z}^{(\ell )},\varvec{0}\}+ \underline{\varvec{W}}^{(\ell )} \min \{ \varvec{z}^{(\ell )},\varvec{0}\}+ \overline{\varvec{b}}^{(\ell )} \right) ,\\ \underline{\varvec{z}}^{(\ell +1)}&= \varrho \left( \underline{\varvec{W}}^{(\ell )} \max \{ \varvec{z}^{(\ell )},\varvec{0}\}+ \overline{\varvec{W}}^{(\ell )} \min \{ \varvec{z}^{(\ell )},\varvec{0}\}+ \underline{\varvec{b}}^{(\ell )} \right) , \end{aligned}$$
regardless of whether \(\varvec{z}^{(\ell )}\) is nonnegative or not. Optimizing the INN parameters requires obtaining the gradients of these operations. This can be achieved using automatic differentiation (backpropagation) in the same way as for standard neural networks.
Training Interval Neural Networks
Let \(\varvec{W}^{(\ell )}\) and \(\varvec{b}^{(\ell )}\) be the weights and biases of the underlying prediction network \(\varvec{\varPhi }\) and let \(\overline{\varvec{\varPhi }}:\mathbb {R}^n \rightarrow \mathbb {R}^n\) and \(\underline{\varvec{\varPhi }}:\mathbb {R}^n \rightarrow \mathbb {R}^n\) denote the functions mapping a point interval input \(\varvec{z}\) to the upper and the lower interval bounds in the output layer of the INN respectively. Given data samples \(\left\{ \varvec{z}_i,\varvec{x}_i \right\} _{i=1}^m\) the INN parameters \([\underline{\varvec{W}}, \overline{\varvec{W}}]^{(\ell )}\) and \([\underline{\varvec{b}}, \overline{\varvec{b}}]^{(\ell )}\) are trained by minimizing the empirical loss
$$\begin{aligned}&\sum _{i=1}^{m}\big \Vert \max \{\varvec{x}_i-\overline{\varvec{\varPhi }}(\varvec{z}_i),\varvec{0}\} \big \Vert _2^2 + \big \Vert \max \{\underline{\varvec{\varPhi }}(\varvec{z}_i)-\varvec{x}_i,\varvec{0}\} \big \Vert _2^2\nonumber \\&\quad + \beta \cdot \big \Vert \overline{\varvec{\varPhi }}(\varvec{z}_i)-\underline{\varvec{\varPhi }}(\varvec{z}_i)\big \Vert _1, \end{aligned}$$
(3)
subject to the constraints \(\underline{\varvec{W}}^{(\ell )}\le \varvec{W}^{(\ell )}\le \overline{\varvec{W}}^{(\ell )}\) and \(\underline{\varvec{b}}^{(\ell )}\le \varvec{b}^{(\ell )}\le \overline{\varvec{b}}^{(\ell )}\) for each layer. This way \(\underline{\varvec{\varPhi }}(\varvec{z})\le \varvec{\varPhi }(\varvec{z})\le \overline{\varvec{\varPhi }}(\varvec{z})\) is always guaranteed. The first two terms in (3) encourage that the predicted interval \([\underline{\varvec{\varPhi }}(\varvec{z}_i),\overline{\varvec{\varPhi }}(\varvec{z}_i)]\) should contain the target signal \(\varvec{x}_i\), while penalizing each component that lies outside with the squared distance to the nearest interval bound. The second term penalizes the interval size, so that the predicted intervals cannot grow arbitrarily large. While a quadratic penalty of the interval size is also possible and leads to similar theoretical bounds as in (4), we choose to minimize the \(\ell _1\)-norm to make the intervals more outlier inclusive. In addition, the tightness parameter \(\beta > 0\) can further tune the outlier-sensitivity of the intervals. This allows for a calibration of the INN uncertainty scores according to an application specific risk-budget. In practice, we found that choosing \(\beta \) similar to the mean absolute error of the underlying prediction network yields a good trade-off between coverage [9] and tightness.
Properties of Interval Neural Networks
The uncertainty estimate of an INN is given by the width of the prediction interval, i.e., \(\varvec{u}(\varvec{z}) = \overline{\varvec{\varPhi }}(\varvec{z}) - \underline{\varvec{\varPhi }}(\varvec{z})\). In terms of computational overhead, INNs scale linearly in the cost of evaluating the underlying prediction DNN with a constant factor 2. In contrast, the popular MCDrop [10] scales linearly with a factor T which is proportional to the number of stochastic forward passes and at least \(T=10\) is recommended by the authors, see “Baseline UQ methods” section.
Further, INNs come with theoretical coverage guarantees that can be derived from the Markov inequality: Assuming that the loss (3) is optimized during training to yield an INN with vanishing expected gradient with respect to the data distribution, we obtain
$$\begin{aligned} \mathbb {P}_{(\varvec{z},\varvec{x})}\left[ \underline{\varvec{\varPhi }}(\varvec{z})_i-\lambda \beta< \varvec{x}_i < \overline{\varvec{\varPhi }}(\varvec{z})_i+\lambda \beta \right] \ge 1-\frac{1}{\lambda }, \end{aligned}$$
(4)
for any \(\lambda > 0\). In other words, for input and target pair \((\varvec{z},\varvec{x})\) the probability of any component of the target lying inside the predicted interval enlarged by \(\lambda \beta \) is at least \(1-\frac{1}{\lambda }\). As \(\beta \) is usually very small, this ensures a fast decay of the probability of the components of \(\varvec{x}\) lying outside the predicted interval bounds. Consequently, a component with a small uncertainty score was correctly reconstructed up to small error with a high probability. Of course, the training distribution needs to be well representative of the true data distribution to extrapolate this property to unseen data.
Finally, the optimization of the loss (3) yields additional information: If the prediction \(\varvec{\varPhi }(\varvec{z})\) lies closer to one boundary of the predicted interval, the true target \(\varvec{x}\) has a higher probability of lying on the other side of the interval. Consequently, INNs can provide directional uncertainty scores. A quantitative assessment of this capability is given in Fig. 3c+d. We note that it is also possible to explore asymmetric uncertainty estimates in the probabilistic setting, e.g., via exponential family distributions [29] or quantile regression [24]. In contrast to INNs, these methods cannot be applied post hoc as they require substantial modifications to the underlying prediction network.
Baseline UQ methods
In addition to our INN approach, we consider two other related and popular UQ baseline methods for comparison. First, Monte Carlo dropout (MCDrop) [10] obtains uncertainty scores as the sample variance of multiple stochastic forward passes of the same input signal. In other words, if \(\varvec{\varPhi }_1,\dots ,\varvec{\varPhi }_T\) are realizations of independent draws of random dropout masks for the same underlying network \(\varvec{\varPhi }\), the component-wise uncertainty estimate is \(\varvec{u}_{{\textsc {MCDrop}}}(\varvec{z}) = (\tfrac{1}{T-1} ( \sum _{t=1}^T \varvec{\varPhi }_t(\varvec{z})^2- \tfrac{1}{T} (\sum _{t=1}^T \varvec{\varPhi }_t(\varvec{z}))^2 ))^{1/2}\). Second, a direct variance estimation (ProbOut) was proposed in [22] and later expanded in [12]. Here, the number of output components of the prediction network is doubled and trained to approximate the mean and variance of a Gaussian distribution. The resulting network \(\varvec{\varPhi }_{{\textsc {ProbOut}}}:\mathbb {R}^n\rightarrow \mathbb {R}^n\times \mathbb {R}^n, \varvec{z}\mapsto (\varvec{\varPhi }_{\text {mean}}(\varvec{z}), \varvec{\varPhi }_{\text {var}}(\varvec{z}))\) is trained by minimizing the empirical loss \(\sum _i \Vert (\varvec{y}_i-\varvec{\varPhi }_{\text {mean}}(\varvec{z}_i)) / \sqrt{\varvec{\varPhi }_{\text {var}}(\varvec{z}_i)}\Vert _2^2 + \Vert \log \varvec{\varPhi }_{\text {var}}(\varvec{z}_i)\Vert _1\). The component-wise uncertainty score of ProbOut is \(\varvec{u}_{{\textsc {ProbOut}}}(\varvec{z}) = (\varvec{\varPhi }_{\text {var}}(\varvec{z}))^{1/2}\). Note that, in contrast to INN and MCDrop, the ProbOut approach requires the incorporation of UQ already during training. Thus, it cannot be employed as a post hoc evaluation of an already trained, underlying network \(\varvec{\varPhi }\). The role of the actual prediction network is taken by \(\varvec{\varPhi }_\text {mean}\).