Experimental data are always characterized by some level of intrinsic variability and imprecision. Models trained on those data inherit that uncertainty and are also affected by another kind of uncertainty that comes from insufficient training samples, often difficult to quantify, especially for complex models such as DNNs. For these reasons, modeling uncertainty in DNNs has recently attracted great interest and currently represents a major research direction in the field [5]. Accounting for uncertainty does not only mean outputting a confidence score for a given input. It also means changing the way predictions are made, taking into account the concept of “unknown” during the training and/or inference phases, and making sense of the resulting uncertainty estimates, which need to reflect and satisfy some principles to be really useful and trustworthy for the users. The latter point is especially important since it has been shown that modern neural networks, even though really accurate, are often “over-confident” in their output probabilities (e.g., a DNN could say that a certain image is 99% likely to be a cat, while the true observed confidence is much lower) [5]. Recent progresses on uncertainty estimation in DNN emerged in the computer vision field, mainly focusing on convolutional neural networks (CNNs). In that context, uncertainty prediction is often studied to overcome interpretability and safety limitations of modern computer vision applications such as autonomous driving [6].
When we consider scientific data and applications, uncertainty estimation assumes a unique relevance. Experimental datasets are often comparatively small (being costly to generate), sparse, and affected by various kinds of inherent imprecision such as experimental errors, lack of coherent ontologies, and misreporting [16]. On top of this, common requirements of scientific applications put particular emphasis on uncertainty. For example, drug discovery is strictly related to exploring the “uncharted” chemical space and, therefore, estimating the uncertainty over such predictions is crucial since there will always be a knowledge boundary beyond which predictions start to degrade. In such cases, uncertainty estimation becomes strictly related to the problem of defining a domain of applicability for a model [13].
We study this problem in the context of molecular property prediction, formally referred to as Quantitative Structure-Activity Relationship (QSAR). In the last few years, pioneering neural network architectures for QSAR, such as graph neural networks (GNNs), have been proposed. Such models, combined to an increasing availability of data and computational power, have led to state-of-the-art performance for this task. However, these models are still characterized by some key limitations, such as interpretability and generalization ability [16].
In this respect, we investigate how uncertainty can be modeled in DNNs, theoretically reviewing existing methods and experimentally testing them on GNNs for molecular property prediction. In parallel, we develop a framework to qualitatively and quantitatively evaluate the estimated uncertainties from multiple points of view. An overview of the methodology is shown in Fig. 1.
2.1 A Bayesian Graph Neural Network for Molecular Property Prediction
Uncertainty can be the result of inherent data noise or could be related to what the model does not yet know. These two kind of uncertainties—aleatoric and epistemic—can be combined to obtain the total predictive uncertainty of the model. We extend a GNN to model both uncertainty components.
When not explicitly modeled, the inherent observation noise is assumed constant for every observed molecule. However, this assumption does not hold in many realistic settings, where input-dependent noise needs to be modeled, such as chemistry applications. Data-dependent aleatoric uncertainty is referred to as heteroscedastic and its importance for DNNs has been recently highlighted [6]. Since aleatoric uncertainty is a property of data, it can be learned directly from the data adapting the model and the loss function. However, aleatoric uncertainty does not account for epistemic uncertainty. This can be overcome by performing Bayesian inference, through the definition of a Bayesian neural network.
In a Bayesian neural network the weights of the model \(\theta \) are distributions learned from training data \(\mathcal {D}\), instead of point estimates, and therefore it is possible to predict the output distribution \(\mathbf {y}\) of some new input \(\mathbf {x}\) through the predictive posterior distribution \(p\left( \mathbf {y}\mid \mathbf { x}, \mathcal {D} \right) = \int p\left( \mathbf {y}\mid \mathbf { x}, \theta \right) p\left( \theta \mid \mathcal {D} \right) d\theta \). Monte Carlo integration over M samples of the posterior distribution can approximate the intractable integral, however, obtaining samples from the true posterior is virtually impossible for DNNs. Therefore, an approximate posterior \({q\left( \theta \right) \approx p\left( \theta \mid \mathcal {D}\right) }\) is introduced. A common technique to derive \(q\left( \theta \right) \) is variational inference (VI). Approximate VI can scale up to the training-intensive large datasets/models of modern applications. Major examples of techniques of this type are MC-Dropout and ensembling-based methods. In [13], different techniques have been experimentally compared.
Experimental results on major public datasets [13] show that the computed uncertainty estimates allow correctly approximating the expected errors in many cases, and, in particular, when test molecules are comparatively similar with respect to training molecules. When this is not the case, uncertainty tends to be underestimated, but it still allows ranking test predictions by confidence. Moreover, experiments show how modeling both types of uncertainty is in general beneficial, and that the relative contribution of each uncertainty type to total uncertainty is dataset dependent. Additionally, it has been shown how modeling uncertainty has a consistent positive impact on model’s accuracy.
The methodology, experimental results and additional analyses are detailed in [13] and in Chap. 3 of [12].