Introduction

With recent advances in deep neural networks (DNNs), machine learning has been widely applied in molecular property prediction and has successfully facilitated the development pipelines in many different applications [1, 2], including drug design [3], chemical biology [4], retrosynthesis [5, 6], and reaction engineering [7]. However, the key to the success of machine learning is comprehensive and high-quality datasets, which can be challenging to obtain in some areas of chemistry. Though large amounts of chemical data have been accumulated in literature over the years, the heterogeneous quality of data derived from different sources can significantly impact the harmonization of information and, hence, influence model performance [8]. Moreover, given that research is performed with a clearly defined goal and question in mind, the data distributions in the literature usually focus on certain regions of chemical spaces, so the accuracy of data-driven models is not always satisfactory in new research fields [9]. Therefore, assessing when and to what extent a prediction can be considered reliable is crucial for applying machine learning in molecular property prediction, especially when targeting new chemicals that have not been investigated before [10].

Significant progress toward this end has been achieved by estimating the variance of predictions with uncertainty quantification methods [11,12,13,14,15,16,17,18,19,20,21]. In previous papers, Bayesian neural networks (BNN) have long been studied as an effective way to model uncertainty in a DNN prediction by treating weights and outputs as probability distributions [22, 23]. However, learning distributions over weights makes BNN more complicated to train and use than other neural networks. Therefore, Bayesian approximation methods such as Deep Ensembles [24], Monte Carlo dropout [25], Bayesian by Backprop [26], and Discriminative Jackknife [27] and conformal prediction methods such as Local Valid and Discriminative confidence intervals (LDV) [28] and Conformalized Quantile Regression (CQR) [29] have been proposed to quantify uncertainty in deep learning-based molecular property prediction [16, 28, 30]. These uncertainty quantification methods are designed to model either or both aleatoric and epistemic uncertainties [31,32,33], which refer to the irreducible and reducible parts of the uncertainty [32], respectively. In the context of molecular property prediction, aleatoric uncertainty usually refers to the output uncertainty induced by the inherent noise in the data caused by the limitation of the resolution of experimental techniques. When not explicitly modeled, aleatoric uncertainty is often assumed to be the same for all the samples (homoscedastic aleatoric uncertainty) [31, 34]. However, this assumption is not always true because, in chemistry applications, one often needs to collect data from multiple sources of different accuracy, which leads to a data-dependent aleatoric uncertainty and hence requires determining the degree of uncertainty in each datapoint (heteroscedastic aleatoric uncertainty) [31, 34]. On the other hand, epistemic uncertainty refers to the uncertainty arising from distributions over model parameters. In principle, epistemic uncertainty can be related to what the model does not yet know and can be reduced by observing more data for the sparse or unknown domain of the chemical space that the model has not fully learned [33]. A graphic illustration of these uncertainties is shown in Fig. 1.

Fig. 1
figure 1

An illustration of the difference between aleatoric and epistemic uncertainties. The dots on the plot represent the available data points. Aleatoric uncertainty captures varying degrees of inherent noise in the data, while epistemic uncertainty reflects the ignorance gap due to a lack of data

Although separately quantifying aleatoric and epistemic uncertainties allows one to characterize the uncertainty sources [35], rationalizing the estimated uncertainty in the prediction through the chemical structure of the query molecule remains challenging. In practice, reasoning the prediction failure on a specific molecular structure is often done manually based on human intuition. Since predictions from black box models such as deep learning methods are challenging to interpret and analyze due to their non-transparency [36], Explainable Artificial Intelligence (XAI) has recently received much attention [37,38,39]. Explainability refers to the ability to explain why an artificial intelligence model has reached a particular decision or prediction [39]. Equipped with explainability that fits human intuition, the internal mechanisms of models become more understandable and trustworthy when applied to safety-critical tasks that demand careful decision-making [38]. For molecular property predictions, significant progress has been made to obtain a better understanding of model characteristics and behaviors by analyzing molecular graphs, compounds, atoms, or feature representations [40,41,42]. For the same reason, it is highly desirable to rationalize the estimated uncertainty through chemical structures to aid in understanding the reason behind the failure of the prediction (e.g., unrecognized functional groups or chemical structures that are rare in the dataset). Explaining the estimated uncertainty through molecular structures is also useful for determining out-of-domain chemicals and improving model performance through automatic selection of informative data, which are important research topics in active learning [14, 20] and drug discovery [17, 21, 43].

In this work, we develop an explainable uncertainty quantification method for the prediction of molecular properties based on deep learning. This method can separately quantify aleatoric and epistemic uncertainties and attribute these uncertainties to atoms in the molecule, which allows one to assess the reason behind the failure of a prediction. The atom-based uncertainty quantification method proposed in this work is adapted from the Deep Ensembles method [24], which has been used in many applications [12, 13, 44, 45]. However, similar to what Busk et al. observed [12], we found that Deep Ensembles can produce poorly calibrated aleatoric uncertainty estimations. To address this issue, we propose a post-hoc calibration method to refine the aleatoric uncertainty of Deep Ensembles. Unlike previous works that emphasize finding a scaling factor for calibrating the uncertainty of out-of-domain datasets [46,47,48], we focus on fine-tuning the weights of selected layers of ensemble models for better calibrated aleatoric uncertainty estimates.

In short, the main contributions are listed as follows.

  1. (1)

    We develop an atom-based uncertainty model that can attribute the uncertainty to the atoms present in a molecule, which results in a better understanding of the chemical insight of the model.

  2. (2)

    We propose a post-hoc calibration scheme to improve the aleatoric uncertainty calculated with Deep Ensembles for better uncertainty quantification.

Methods

In this section, we first introduce how Deep Ensembles calculate aleatoric and epistemic uncertainties, and then discuss the post-hoc calibration method we propose to improve Deep Ensembles for better uncertainty quantification. Lastly, we introduce the architecture of the atom-based uncertainty model, the evaluation metrics, and the datasets used to benchmark the performance of the different uncertainty estimation models.

Approximate uncertainty with Deep Ensembles

The concept of quantifying both aleatoric and epistemic uncertainty in one framework was presented by Kendall and Gal [31]. Meanwhile, the idea of applying the ensemble method to estimate the model uncertainty of deep learning models (Deep Ensembles) was first proposed by Lakshminarayanan et al. [24]. In practice, Deep Ensembles can be considered an alternative approximation to Bayesian inference [49] and can be implemented in two approaches: ensembling and bootstrapping [13, 24], both of which are based on assembling several networks trained independently. Ensembling trains multiple networks with different initial weights such that each loss reaches different local minima, so the prediction of a query may vary across the networks. The extent of discrepancy in the predictions reflects the epistemic uncertainty of the model. On the other hand, bootstrapping trains multiple networks by randomly sampling data from the dataset with replacement. With partially different training data, each network learns to predict a certain portion of the data in the original dataset. In this work, we apply Deep Ensembles with the ensembling approach as recommended by Lakshminarayanan et al. [24].

In this study, we assume the inherent noise in the data (aleatoric uncertainty) follows a Gaussian distribution [13]. Since Deep Ensembles combine the predictions of \(M\) networks, the final predictive distribution is assumed as a uniformly-weighted mixture of Gaussian distributions [24]. We note that if the type of noise is known in advance, the output distribution does not need to be Gaussian and can be approximated with a function closer to the actual noise distribution [13, 24]. To predict a Gaussian distribution with a neural network, the last layer of the network can be modified into two parallel layers that output the mean (\(\mu (x)\)) and variance (\({\sigma }^{2}(x)\)) of the Gaussian function [50]. The objective of optimizing a set of distributions is to maximize the likelihood function of Gaussian. Given a dataset \(\mathcal{D}=\{{x}_{k}, {y}_{k}{\}}_{k=1}^{N}\) where \({y}_{k}=\mu \left({x}_{k}\right)+\epsilon \left({x}_{k}\right)\) with \(\epsilon \left({x}_{k}\right)\sim \mathcal{N}\left(0, {\sigma }^{2}\left({x}_{k}\right)\right)\)[51], the target probability distribution for input \({x}_{k}\) can be written as

$$P\left( {{ }y_{k} {|}x_{k} } \right) = \left( {2\pi \sigma^{2} \left( {x_{k} } \right)} \right)^{{ - \frac{1}{2}}} \cdot exp\left( { - \frac{{(y_{k} - \mu \left( {x_{k} } \right))^{2} }}{{2\sigma^{2} \left( {x_{k} } \right)}}} \right)$$
(1)

where the \(\mu ({x}_{k})\) is the mean and \({\sigma }^{2}\left({x}_{k}\right)\) is the variance [52].

Given a neural network model \(m\) and assuming a predictive distribution \(\widehat{{\varvec{y}}}\) consists of a mean \({\mu }_{m}({x}_{k})\) and a variance \({\sigma }_{m}^{2}({x}_{k})\) such that \(\widehat{{\varvec{y}}}\sim \mathcal{N}({\mu }_{m}({x}_{k}), {\sigma }_{m}^{2}({x}_{k}))\), the optimal weights can be found by maximizing likelihood estimation (Eq. 1), which is equivalent to minimizing the negative log-likelihood (NLL), i.e., the heteroscedastic loss, of the predictive distributions.

$$- \ln \left( L \right) \propto \mathop \sum \limits_{k = 1}^{N} \frac{1}{{2\sigma_{m}^{2} \left( {x_{k} } \right)}}(y_{k} - \mu_{m} \left( {x_{k} } \right))^{2} { } + \frac{1}{2}\ln \left( {\sigma_{m}^{2} \left( {x_{k} } \right)} \right) + \frac{1}{2}\ln \left( {2\pi } \right)$$
(2)

An uncertainty model trained with the heteroscedastic loss minimizes NLL by tuning the predicted mean \({\mu }_{m}(x)\) and variance \({\sigma }_{m}^{2}(x)\) at the same time. Since aleatoric uncertainty is the noise in data, the output variance \({\sigma }_{m}^{2}({x}_{k})\) is defined as the aleatoric uncertainty of sample k, whose value depends on the absolute error between the true value \({y}_{k}\) and the mean \({\mu }_{m}({x}_{k})\) predicted by model \(m\) (Eq. 2) [50]. The underlying assumption of this approach is that the error between \({y}_{k}\) and \({\mu }_{m}({x}_{k})\) is solely caused by the data noise in \({y}_{k}\). However, in practice, the function approximation for \({\mu }_{m}({x}_{k})\) may also contribute to the error, so the aleatoric uncertainty predicted by this method is model-dependent, and may be overestimated when the data is poorly-predicted by the model [53].

Because Deep Ensembles combine the predictions of \(M\) models, the ensemble prediction is a mixture of Gaussian \({\widehat{{\varvec{y}}}}_{{\varvec{e}}{\varvec{n}}{\varvec{s}}}=\frac{1}{M}\sum_{m=1}^{M}\widehat{{\varvec{y}}}\) where the ensemble mean value \({\mu}_{ens}\) is calculated by averaging the output means of M models

$$\mu_{ens} = \frac{1}{M}\mathop \sum \limits_{m = 1}^{M} \mu_{m}$$
(3)

and the ensemble variance \({\sigma }_{ens}^{2}\) equals to

$$\sigma_{ens}^{2} = \frac{1}{M}\mathop \sum \limits_{m}^{M} \left( {\sigma_{m}^{2} + \mu_{m}^{2} } \right) - \mu_{ens}^{2} = \frac{1}{M}\mathop \sum \limits_{m = 1}^{M} \sigma_{m}^{2} + \frac{1}{M}\mathop \sum \limits_{m = 1}^{M} \left( {\mu_{m} - \mu_{ens} } \right)^{2}$$
(4)

where \(\frac{1}{M}\sum_{m=1}^{M}{\sigma }_{m}^{2}\) and \(\frac{1}{M}\sum_{m=1}^{M}{\left({\mu}_{m}-{\mu}_{ens}\right)}^{2}\) are the aleatoric and epistemic uncertainty of the ensemble prediction [12, 24, 31]

$$\sigma_{ale}^{2} = \frac{1}{M}\mathop \sum \limits_{m = 1}^{M} \sigma_{m}^{2} ,\,\, \sigma_{epi}^{2} = \frac{1}{M}\mathop \sum \limits_{m = 1}^{M} \left( {\mu_{m} - \mu_{ens} } \right)^{2}$$
(5)

Post-hoc calibration for Deep Ensembles

Deep Ensembles proposed by Lakshminarayanan et al. is a simple and popular non-Bayesian approximation for modeling epistemic uncertainty. In Deep Ensembles, the aleatoric uncertainty is estimated by averaging the predicted variances from \(M\) models (Eq. 5). Since each model is trained with the heteroscedastic loss function (Eq. 2), the aleatoric uncertainty predicted with model \(m\) (\({\sigma }_{m}^{2}\)) should be well calibrated to represent the errors between true values and the mean calculated with model \(m\) (\({\mu }_{m}\)). As the ensemble model estimates its mean value by averaging predicted means from \(M\) models, the error of the ensemble model is expected to be reduced [12], which is theoretically accompanied by a lower aleatoric uncertainty compared with that of the individual model. However, averaging the prediction of aleatoric uncertainty made by each model (Eq. 5) does not generally reduce the magnitude of aleatoric uncertainty, which leads to an overestimation of \({\sigma }_{ale}^{2}\) and an underconfident ensemble model [12].

To address this issue, we propose an intuitive post-hoc calibration method to improve the quality of aleatoric uncertainty of the ensemble model by retraining a portion of the weights of the networks. As shown in Fig. 2A, a neural network model with predicted uncertainty contains two output layers for predicting the mean and variance of a predictive distribution. In the post-hoc calibration, the networks in the ensemble are trained with relaxation of only the weights in variance layers (VL), keeping all the other weights frozen, so the calibration only affects the value of aleatoric uncertainty (Fig. 2B). In the calibration process, the mean and aleatoric uncertainty derived from the ensemble (\({\mu }_{ens}\) and \({\sigma }_{ale}^{2}\)) are used to calculate the heteroscedastic loss to ensure that \({\sigma }_{ale}^{2}\) can correctly represent the errors between the true value and the \({\mu }_{ens}\) predicted by the ensemble model. Note that since only the weights in variance layers are retrained during post-hoc training, the output mean \({\mu }_{m}\) in each model remains unchanged so the values of \({\mu }_{ens}\) and \({\sigma }_{epi}^{2}\) stay the same after the calibration procedure.

Fig. 2
figure 2

Distribution mean and variance predicted by (A) a neural network model and (B) Deep Ensembles. Because Deep Ensembles estimate the mean value \({\mu }_{ens}\) by averaging the predicted means of \(M\) neural networks, the variance associated with Deep Ensembles (\({\sigma }_{ale}^{2}\)) is expected to be lower than the variance of one network in the ensemble (\({\sigma }_{m}^{2}\)). Therefore, one may overestimate \({\sigma }_{ale}^{2}\) by averaging \({\sigma }_{m}^{2}\) of networks in the ensemble (Eq. 5). This problem can be resolved by refining the weights in the variance layers (highlighted in yellow) to minimize the heteroscedastic loss function calculated with \({\mu }_{ens}\) and \({\sigma }_{ale}^{2}\) in the second round of training. FC, ML, and VL layers refer to the fully-connected layer, mean layer, and variance layer

Molecule- and atom-based uncertainty models

Various schemes have been proposed to encode molecular structures into vector representations suitable for conventional machine learning algorithms [54]. In this work, we adopt the Directed Message Passing Neural Network (D-MPNN) [55], a 2D graph convolutional model, to encode molecular structures. This model contains message passing and readout phases as shown in Fig. 3. The implementation of the message passing phase in this work follows the Chemprop model [1], the details of which can be found in the work of Yang et al. [1]. In brief, the input is a graph including nodes (atoms) and edges (bonds) information of a molecule. The D-MPNN concatenates atom information with bond information linked with the atom to form initial fingerprints (\({h}_{i}^{0}\)). The atom and bond features contained in \({h}_{i}^{0}\) are summarized in Additional file 1: Tables S1 and S2, which were selected following the work of Chen et al. [56]. In the bond-level message passing procedure, each atom collects information from its neighbor atoms with bond direction considered and passes through layers and activation functions with \(t\) iterations, resulting in atomic fingerprints with local, global, and directional knowledge (\({h}_{i}^{t}\)). In the original setting of Chemprop, these hidden vertex features are summed together to derive a molecular fingerprint, which is then passed to the next readout phase to predict molecular property and uncertainty (Fig. 3A) [1, 13].

Fig. 3
figure 3

The architecture of the (A) molecule-based and (B) atom-based uncertainty model. The network takes molecular graphs with initial atoms and bonds information as input (\({h}_{i}^{0}\)). With \(t\) iteration through D-MPNN message passing, each atom exchanges information with its neighbor atoms to generate the learned atomic fingerprints \({h}_{i}^{t}\). In (A), the learned atomic fingerprints are summed to form the learned molecular fingerprints. The molecular representation is passed into the fully-connected layer (FC Layer), and then into the mean layer (ML) and variance layer (VL), respectively, to obtain the mean \({\mu }_{m}\) and variance \({\sigma }_{m}^{2}\) of molecular property distribution \(\widehat{y}\). On the other hand, in (B), the learned atomic fingerprints are passed into the FC Layer, ML, and VL to predict the property distribution \({\widehat{y}}_{i}\) of each atom separately. The molecular property distribution \(\widehat{y}\) is obtained by aggregating \({\widehat{y}}_{i}\) of each atom in the molecule

In this work, we introduce the atom-based uncertainty method in which the learned atomic fingerprints are passed separately to the next readout phase to predict atom-based properties and uncertainties instead of pooled together to form the molecular fingerprint. As shown in Fig. 3B, we modified the readout phase of Chemprop to predict the atomic property contributions and the associated uncertainties, which are then aggregated to derive molecular property and molecular uncertainty. The algorithm at the readout phase is the main difference between our work and the molecule-based uncertainty model (Fig. 3A) proposed previously [1, 13]. In the atom-based uncertainty method, the molecular property distribution \(\widehat{{\varvec{y}}}\) with mean \({\mu }_{m}\) and variance \({\sigma }_{m}^{2}\) is regarded as the sum of atomic Gaussian distributions \({\widehat{{\varvec{y}}}}_{{\varvec{i}}}\) of \(n\) atoms in a molecule.

$${\widehat{\varvec{y}}} \sim N(\mu_{m} , \sigma_{m}^{2} ),\,\,{\widehat{\varvec{y}}_{\varvec{i}}} \sim N(\mu_{m,i} , \sigma_{m,i}^{2} )$$
(6)
$$\widehat{\varvec{y}} = \mathop \sum \limits_{{{i} = 1}}^{{n}} \widehat{\varvec{y}}_{{\varvec{i}}}$$
(7)

In detail, the atomic fingerprint of atom i derived from the message passing phase (\({h}_{i}^{t}\)) is passed into the fully-connected layers, which predict the atomic mean \({\mu }_{m,i}\) through the mean layer \({f}_{m}\left(\cdot \right)\) and atomic standard deviation \({\sigma }_{m,i}\) through the variance layer \({g}_{m}\left(\cdot \right)\). The mean of the molecular property \({\mu }_{m}\) is simply the summation of the atomic mean of each atom.

$$\mu_{m} = \mathop \sum \limits_{i = 1}^{n} \mu_{{m,i{ }}} = \sum_{i = 1}^{n} f_{m} \left( {h_{i}^{t} } \right)$$
(8)

On the other hand, the molecular variance \({\sigma }_{m}^{2}\) can be derived by summing the elements of a covariance matrix of which diagonal elements correspond to the atomic variance \({\sigma }_{m,i}^{2}\) and off-diagonal elements correspond to the covariance terms \(cov({a}_{i}, {a}_{j})\) between each atom in the molecule [57]

$$\sigma_{m}^{2} = \mathop \sum \limits_{i = 1}^{n} \sigma_{{m,i{ }}}^{2} + { }2\mathop \sum \limits_{i = 2}^{n} \mathop \sum \limits_{j = 1}^{i - 1} cov\left( {\widehat{{{\varvec{y}}_{{\varvec{i}}} }},{ }\widehat{{{\varvec{y}}_{{\varvec{j}}} }}} \right)$$
(9)
$$\sigma_{m,i}=g_{m} \left( {h_{i}^{t} } \right),\,cov(\widehat{{\varvec{y}_{\varvec{i}} }},\widehat{{\varvec{y}_{\varvec{j}} }}) = \rho_{ij} \cdot \sigma_{i} \cdot \sigma_{j}$$
(10)

where \({\sigma }_{i}\) and \({\sigma }_{j}\) are the standard deviation of \(\widehat{{{\varvec{y}}}_{{\varvec{i}}}}\) and \(\widehat{{{\varvec{y}}}_{{\varvec{j}}}}\), and \({\rho }_{ij}\) is the correlation coefficient between atoms \(i\) and \(j\). In this work, we use the Pearson correlation coefficient [58] of the learned atomic fingerprints \({h}_{i}^{t}\) and \({h}_{j}^{t}\) to estimate the correlation between the property values of atom i and j

$$\rho_{ij} = \rho (\widehat{{\varvec{y}_{\varvec{i}} }},\widehat{{\varvec{y}_{\varvec{j}} }}) = \rho_{m} (h_{i}^{t} ,h_{j}^{t} ) \cong \,\frac{{\sum {_{v = 1}^{d} \left( {h_{i,v}^{t} - \overline{{h_{i}^{t} }} } \right)\,\left( {h_{j,v}^{t} - \overline{{h_{j}^{t} }} } \right)} }}{{\sqrt {\sum {_{v=1}^{d} \left( {h_{i,v}^{t} - \overline{h_{i}^{t}} } \right)^{2} \, } } \sqrt {\sum {_{v=1}^{d} \left( h_{j,v}^{t} - \overline{h_{j}^{t} } \right)^{2} } \,}}}$$
(11)

where \({h}_{i,v }^{t}\) is the \(v\) th element of the fingerprint \({h}_{i}^{t}\). Similar to the molecule-based uncertainty method, one can aggregate the outputs of a number of atom-based uncertainty models to derive an ensemble mean and variance following the procedures discussed in the above subsection (Eq. 35).

Evaluation metrics

We use mean absolute error (MAE) and root mean square error (RMSE) as the evaluation metrics of property prediction accuracy

$$MAE = \frac{1}{N}\left( {\mathop \sum \limits_{k = 1}^{N} \left| {\mu_{ens} \left( {x_{k} } \right){ } - y_{k} } \right|} \right)$$
(12)
$$RMSE = \sqrt {\frac{1}{N}\left( {\mathop \sum \limits_{k = 1}^{N} \left( {\mu_{ens} \left( {x_{k} } \right){ } - y_{k} } \right)^{2} } \right)}$$
(13)

where \(N\) is the number of samples for evaluation. Since there is no ground truth of uncertainties, evaluating predicted uncertainty with traditional benchmarks is difficult. In this work, we use the expected calibration error (ECE) and expected normalized calibration error (ENCE) as the evaluation metrics of predicted uncertainties [13, 30]. The details of these two metrics are discussed below.

Confidence-based Calibration Curve and ECE The outputs of the uncertainty model are assumed to be the mean and variance of a Gaussian distribution. In principle, one can use the percentage of the samples whose true values fall within the confidence interval defined by the predictive distribution to evaluate the quality of the predicted variance. For a well-calibrated case, the probability that \({y}_{k}\) will fall within the confidence interval should equal the percentage of the confidence level. The confidence-based calibration curve examines the fraction of data that actually falls in each confidence level. The difference between the confidence level (e.g., 60% confidence level) and the empirical fraction (e.g., 57% of data fall within the confidence interval) is defined as ECE [13, 45]

$$ECE = \frac{1}{B}\left( {\mathop \sum \limits_{b = 1}^{B} \left| {CL_{b} - EF_{b} } \right|} \right)$$
(14)

where \(B\) is the number of confidence levels considered, \(C{L}_{b}\) is the percentage of confidence level \(b\), and \(E{F}_{b}\) is the fraction of data points falling within confidence interval \(b\).

Error-based Calibration Curve and ENCE The error-based calibration curve examines the consistency between the expected error (measured by mean squared error, MSE) and the predicted uncertainty \({\sigma }^{2}(x)\) under the assumption that the estimator is unbiased [59].

$$\forall \sigma , {\mkern 1mu} {\mathbb{E}}\left[ {\left( {\mu \left( x \right) - y} \right)^{2} | \sigma^{2} \left( x \right) = \sigma^{2} } \right] = \sigma^{2}$$
(15)

In practice, the testing data is sorted by the predicted uncertainty and divided into \(B\) bins with \(K\) data in each bin. The error-based calibration curve is a parity plot between the RMSE (Eq. 13) and the root mean uncertainty (RMU) of the data in each bin

$$RMU = \sqrt {\frac{1}{K}\mathop \sum \limits_{k = 1}^{K} \sigma_{k}^{2} }\,\,.$$
(16)

The difference between the expected error (RMU) and error of prediction (RMSE) is what ENCE calculates, and a lower ENCE means a better calibration

$$ENCE = \frac{1}{B}\mathop \sum \limits_{b}^{B} \frac{{\left| {RMSE_{b} - RMU_{b} } \right|}}{{RMU_{b} }}\,\,.$$
(17)

Computational details

The datasets used for benchmarks in this work include QM9 [60], Zinc15 [61], Delaney [62], and Lipophilicity [63] (Table 1), which were accessed from MoleculeNet [64]. The molecule-based uncertainty model proposed by Scalia et al. [13] (Fig. 3A) was taken as the base model to validate the applicability of the post-hoc calibration method and the performance of the atom-based uncertainty quantification method. In this work, each ensemble model contains 30 networks. We apply the heteroscedastic loss (Eq. 2) during training to acquire aleatoric uncertainty and Deep Ensembles for epistemic uncertainty. As shown in Fig. 3, we use 2 D-MPNN layers to encode input molecules and 2 fully-connected layers, where the last layer contains two parallel layers outputting the mean and variance of the predictive distribution. Each dataset is randomly split into training, validation, and testing data in a ratio of 8:1:1. The early stopping was set to halt training if heteroscedastic loss of the validation data fails to decrease for more than 15 epochs.

Table 1 Summary of the Benchmark Datasets

Results and discussion

This section is organized as follows. We first present how the post-hoc calibration scheme improves the quality of aleatoric uncertainty of the ensemble model. Next, we compare the prediction accuracy and uncertainty performance between the molecule- and atom-based uncertainty models, and discuss how the atom-based uncertainty model can help to identify the chemical structures that lead to the failure of a prediction.

Post-hoc calibration of aleatoric uncertainty

The post-hoc calibration scheme aims to fine-tune the aleatoric uncertainty overestimated by the ensemble scheme. Since each network in the ensemble was trained to minimize its own heteroscedastic loss (Eq. 2), the calibration curve based on the aleatoric uncertainty of each network is close to the diagonal line (perfect calibration), which results in a low ECE as shown in Fig. 4A. However, because the error of the ensemble model is often lower than that of the individual model, simply averaging the aleatoric uncertainty of each individual model (Eq. 5) may overestimate \({\sigma }_{ale}^{2}\) of the ensemble model. Therefore, as shown in Fig. 4B and Fig. 5A, the confidence-based and error-based calibration curves for deep ensembles are far from perfect calibration, leading to higher ECE and ENCE than the single model. This problem can be alleviated using the post-hoc calibration procedure shown in Fig. 2B, which retrains the variance layer to output a lower and more calibrated uncertainty for the ensemble scheme (Figs. 4C and 5B). See Supporting Information for more discussions of how aleatoric uncertainty varies before and after post-hoc calibration (Additional file 1: Fig. S8).

Fig. 4
figure 4

Confidence-based calibration curves and ECEs based on the aleatoric uncertainty for the Zinc15 testing set. The aleatoric uncertainty is calculated with (A) a single atom-based uncertainty model, (B) an ensemble of atom-based uncertainty models, and (C) an ensemble of atom-based uncertainty models after post-hoc calibration. The calibration procedure reduces the ECE of the ensemble method from 0.3952 to 0.0722. The shaded area shown in (A) is 95% CI calculated with 30 independent models

Fig. 5
figure 5

Error-based calibration curves and ENCEs based on the aleatoric uncertainty for the Zinc15 testing set. The aleatoric uncertainty is calculated with (A) an ensemble of atom-based uncertainty models and (B) an ensemble of atom-based uncertainty models after post-hoc calibration. The calibration procedure reduces the ENCE of the ensemble method from 0.8148 to 0.1721

Table 2 summarizes the ECE and ENCE values calculated with the atom-based and molecule-based uncertainty models for different chemical datasets before and after the post-hoc calibration. The confidence- and error-based calibration curves for these datasets can be found in the Supporting Information (Additional file 1: Figs. S1–S15). For most of the datasets we examined, the ECE and ENCE decrease after calibration, suggesting the quality of aleatoric uncertainty is generally improved through the calibration procedure. We note that the effectiveness of the post-hoc calibration depends on the error reduction that the ensemble model can achieve relative to the individual models it aggregates. When ensembling greatly reduces the error, the predicted aleatoric uncertainty of the ensemble model is largely overestimated by averaging \({\sigma }_{m}^{2}\) of individual models, and therefore the effect of the post-hoc calibration is pronounced. However, there are also cases in which ensembling does not significantly improve model performance. For example, the ensemble model does not outperform single models for the Lipophilicity dataset, so the predicted aleatoric uncertainty is not significantly overestimated before calibration (Table 2). In this case, the room for improving aleatoric uncertainty is very limited.

Table 2 ECE and ENCE performance of ensemble models before and after post-hoc calibration for different datasets

Comparison of atom- and molecule-based uncertainty models

The performance of property and uncertainty prediction of the atom- and molecule-based uncertainty models are listed in Table 3. For most of the testing sets, the MAE, RMSE, ECE, and ENCE of the atom-based uncertainty model are comparable to the molecule-based uncertainty model [13] proposed previously, which validates the usefulness of the atom-based architecture (Fig. 3B) in molecular property and uncertainty predictions. The advantage of the atom-based uncertainty model is that it provides an extra layer of chemical insight to the predicted uncertainty. Taking a molecular graph as input, the atom-based uncertainty model outputs not only the molecular property, but also the atomic contributions to the property and the associated uncertainties. With these outputs, one can better understand how the model attributes the property prediction and uncertainty to the atoms in the molecule, and therefore can quickly assess the reason behind the potential failure of a prediction. Examples to illustrate this point are given in the following subsection.

Table 3 Prediction accuracy and uncertainty performance of the atom-based uncertainty model (AtomUnc) and the molecule-based uncertainty model (MolUnc)

Analysis of atomic uncertainty

Because the atom-based uncertainty model attributes the predicted uncertainty to the atoms in a molecule, it can help to identify the chemical structures under-represented in the dataset and identify the types of species whose data potentially contain significant noise. To illustrate this point, we carried out two experiments with modified QM9 datasets to mimic scenarios in which data quality and quantity vary for different types of species. In the first experiment, artificial noises are added to data of nitrogen-containing molecules of QM9 to examine the capacity of the atom-based uncertainty model to capture the origin of data noise. On the other hand, in the second experiment, nitrogen-containing species are removed from QM9 to verify the ability of the atom-based uncertainty model to identify under-represented chemical structures. The results of these two experiments are discussed below.

Heterogeneous data quality To verify that the predicted atomic aleatoric uncertainty can recognize the source of noise in a molecule, we created a noisy dataset \({\mathcal{D}}^{noise}=\{{x}_{k}, {y}_{k}^{noise}{\}}_{k=1}^{N}\) by adding \(r\) independent Gaussian noises (mean \(=\) 0, variance \(=\) 1) to the molecules containing nitrogen atoms

$$y_{k}^{noise} = y_{k} + \sum\limits_{j = 1}^{r} { \epsilon_{j} }$$
(18)

where \({\epsilon }_{j}\sim \mathcal{N}(0, 1)\), \({y}_{k}\) is the true property value, and \(r\) is the number of nitrogen atoms in molecule k. Note that the property of molecules without nitrogen atoms remains unchanged. A nitrogen-noisy model was trained with \({\mathcal{D}}^{noise}\), and a base model was trained with the unmodified dataset \(\mathcal{D}=\{{x}_{k}, {y}_{k}{\}}_{k=1}^{N}\) for the purpose of comparison.

Aleatoric uncertainties of the testing data calculated with the nitrogen-noisy and base models are shown in Fig. 6. In the base model, most of the molecules have low predicted aleatoric uncertainties. Conversely, the distribution of aleatoric uncertainty shifts right (increases) as the number of nitrogen atoms in the molecules increases in the nitrogen-noisy model, which suggests that the model can successfully learn the artificial noise introduced in \({\mathcal{D}}^{noise}\). Figure 7 shows four test molecules with their molecular and atomic aleatoric uncertainties. Because these molecules contain nitrogen atoms, the molecular uncertainties predicted with the nitrogen-noisy model (Fig. 7B) are higher than those of the base model (Fig. 7A). Through the analysis of the atomic aleatoric uncertainty, one can see that the increase in uncertainty is concentrated at the nitrogen atoms.

Fig. 6
figure 6

Aleatoric uncertainty distributions of QM9 testing data calculated with (A) the base model and (B) the nitrogen-noisy model. The testing data are grouped by the number of nitrogen atoms in the molecule. The molecules without nitrogen atoms are denoted as 0 N, molecules containing one nitrogen atom are denoted as 1 N, and so on

Fig. 7
figure 7

Aleatoric uncertainties of molecules with nitrogen atoms predicted by (A) the base model and (B) the nitrogen-noisy model. Numbers labeled at each atom are the predicted atomic aleatoric uncertainty

Heterogeneous data quantity Epistemic uncertainty indicates how unfamiliar the model is to a molecule. To examine whether the atomic epistemic uncertainty can detect unseen chemical structures, we removed the nitrogen-containing molecules from the QM9 dataset to train a nitrogen-ignorant model, and then compared it with the base model trained with the original QM9 dataset.

Figure 8 shows the epistemic uncertainty of the test data predicted by the nitrogen-ignorant model and the base model. Because the base model has seen all types of molecules in the original dataset, most of the epistemic uncertainties of the test molecules predicted by the base model are low. On the other hand, because the nitrogen-ignorant model has not seen nitrogen-containing species, the epistemic uncertainties greatly increase for the nitrogen-containing molecules, which indicates the self-awareness of ignorance of an unseen domain. We note that because the sizes of the training and validation datasets decrease for the nitrogen-ignorant model, the overall error and uncertainty of the nitrogen-ignorant model are larger than the base model, even for molecules containing no nitrogen atoms (group 0 N in Fig. 8). Four test molecules with their atomic epistemic uncertainties are shown in Fig. 9. The nitrogen-ignorant model assigns relatively higher atomic epistemic uncertainty to the nitrogen atoms, indicating that the atom-based model is capable of identifying the unseen chemical structure.

Fig. 8
figure 8

Epistemic uncertainty distributions of QM9 testing data calculated with (A) the base model and (B) the nitrogen-ignorant model. The testing data are grouped by the number of nitrogen atoms in the molecule. The molecules without nitrogen atoms are denoted as 0 N, molecules containing one nitrogen atom are denoted as 1 N, and so on

Fig. 9
figure 9

Epistemic uncertainties of molecules with nitrogen atoms predicted by (A) the base model and (B) the nitrogen-ignorant model. Numbers labeled at each atom are the predicted atomic epistemic uncertainty

The experiments discussed above show that the model estimates a higher aleatoric uncertainty for the species whose data are associated with significant noise and a larger epistemic uncertainty for the species that are under-represented. However, we note that when one uses Deep Ensembles, a high estimate of aleatoric uncertainty is not always caused by data noise. For instance, when we removed the nitrogen-containing molecules from the QM9 dataset, we observed an increase in the estimate of aleatoric uncertainty for the nitrogen-containing species in the test set (Additional file 1: Fig. S18). This is because the weights associated with the nitrogen atom were not trained, so the network outputs for nitrogen-containing species (including the estimate of aleatoric uncertainty) were significantly mispredicted. Similarly, in Deep Ensembles, a high estimate of epistemic uncertainty is not always caused by a lack of data. For example, when there was significant noise in the data, finding the optimal fit became more challenging, which might also result in a larger discrepancy in the predictions of Deep Ensembles, and hence accidentally lead to an overestimation of epistemic uncertainty (Additional file 1: Fig. S17). Therefore, the uncertainty derived from Deep Ensembles should be interpreted with care, and further method improvement may be required.

We note that molecules with low molecular uncertainty can sometimes contain atoms with large atomic uncertainty. For example, some of the atomic uncertainties shown in Fig. 9A are larger than the value of the molecular uncertainty. In the atom-based model, the molecular property is calculated as the sum of atomic property values (Eq. 7), so the variance of the molecular property is equal to the sum of the total variances of each atom and the covariances between all possible pairs of atoms. Since the covariances between atoms may be negative, the molecular property variance can be lower than the variance of each atom. This situation mainly occurs when there are multiple ways to distribute contribution value to each atom, which results in low confidence in the atomic property but high confidence in the molecule-level prediction. More discussions on this point can be found in the Supporting Information (Additional file 1: Fig. S16).

Conclusions

In this study, we propose an atom-based uncertainty quantification method for deep learning-based molecular property prediction. This atom-based model can learn the property contributions of atoms and the associated aleatoric and epistemic uncertainties. Our experiments suggest that the atomic aleatoric uncertainty can help to identify the types of species whose data are potentially associated with significant noises, and the atomic epistemic uncertainty can help to determine the chemical structure with which the model is unfamiliar. Given the explainability and transparency of the model, one can be aware not only of the potential failure of a prediction, but also of the reasons why the prediction may fail through its atomic uncertainties. Moreover, we introduce a post-hoc calibration method to fine-tune the overestimated aleatoric uncertainty of ensemble models. The improved quality of aleatoric uncertainty is indicated through the reduction of ECE and ENCE for a wide range of molecular property prediction tasks.