We can quantify the uncertainty of the parameter 𝜃 by its posterior distribution p(𝜃|x) given the observed dataset x = x0. The posterior distribution is obtained by Bayes’ theorem as \(p(\theta |x^0) = \frac {\pi (\theta ) p(x^0|\theta )}{m(x^0)}, \) where π(𝜃), p(x0|𝜃) and \(m(x^0) = \int \limits \pi (\theta )p(x^0|\theta )d\theta \) are, correspondingly, the prior distribution on the parameter 𝜃, the likelihood function, and the marginal likelihood. If the likelihood function could be evaluated, at least up to a normalizing constant, then the posterior distribution could be approximated by drawing a representative sample of parameter values from it using (Markov chain) Monte Carlo sampling schemes (Robert and Casella, 2005). Unfortunately, the likelihood function induced by the volcanic eruption model is analytically intractable. In this setting, approximate Bayesian computation (ABC) (Lintusaari et al. 2017) offers a way to sample from an approximate posterior distribution and opens up the possibility of sound statistical inference on the parameter 𝜃. In this paper we only focus on parameter estimation/calibration and uncertainty quantification but we stress that ABC easily allows us to perform parameter hypothesis testing and model selection as well.
Approximate Bayesian Computation (ABC)
The fundamental ABC rejection sampling scheme iterates the following steps:
-
1.
Draw 𝜃 from the prior π(𝜃).
-
2.
Simulate a synthetic dataset xsim from the simulator-based model \({\mathcal{M}}(\theta )\).
-
3.
Accept the parameter value 𝜃 if d(xsim,x0) < γ. Otherwise, reject 𝜃.
See Fig. 3 for a visualization of the above algorithm.
Here, the metric on the dataspace d(xsim,x0) measures the closeness between xsim and x0. The accepted (𝜃,xsim) pairs are thus jointly sampled from a distribution proportional to π(𝜃)pd,γ(x0|𝜃), where pd,γ(x0|𝜃) is an approximation to the likelihood function p(x0|𝜃):
$$ p_{d, \gamma}(x^0|\theta) = \int p(x^{\text{sim}}|\theta) \mathbb{K}_{\gamma}(d(x^{\text{sim}},x^0)) dx^{\text{sim}}, $$
(1)
where \(\mathbb {K}_{\gamma }(d(x^{\text {sim}},x^0)) \) is in this case a probability density function proportional to
Footnote 1. Besides this choice for \(\mathbb {K}_{\gamma }(d(x^{\text {sim}},x^{0}))\), that has been exploited in several ABC algorithms (for instance (Beaumont, 2010; Drovandi and Pettitt, 2011; Del Moral et al. 2012; Lenormand et al. 2013)), ABC algorithms relying on different choices exist, for instance being proportional to \(\exp (-d(x^{\text {sim}},x^{0})/\gamma )\) in simulated-annealing ABC (SABC) (Albert et al. 2015). In general, \(\mathbb {K}_{\gamma }(\cdot )\) needs to be a probability density function with a large concentration of mass near 0, in which the parameter γ denotes the amount of concentration (the smaller γ, the more concentrated the density is). This guarantees that, in principle, the above approximate likelihood converges to the true one when γ → 0. Of course, decreasing the threshold increases the computational cost, as less simulations will be accepted.
More advanced algorithms than the simple rejection scheme detailed above are possible, for instance ones based on Sequential Monte Carlo (Del Moral et al. 2012; Lenormand et al. 2013), in which various parameter-data pairs are considered at a time and are evolved over several generations, while γ is decreased towards 0 at each generation to improve the approximation of the likelihood function, so that you are able to approximately sample from the true posterior distribution. Alternative statistical methods for calibrating models from observations exist; however, the ABC framework has the advantage of both being applicable to stochastic models and of providing the user with a rigorous uncertainty quantification. For instance, methods based on Gaussian Process (GP) emulation, and subsequent use of the emulators for calibration, have been proved to work well (O’Hagan, 2006), but mostly for deterministic models. Note also that some efforts of combining the versatility of ABC with the computational savings of using GP emulation have started appearing in the literature, see for instance (Meeds E and Welling M, 2014), in which a GP is used to emulate the simulator and the need for new model runs is determined according to the uncertainty of the emulator; however, this relies on an ad-hoc algorithm. See also Wilkinson (2014) and Gutmann and Corander (2016) for examples of using GPs to emulate respectively the likelihood function and the discrepancy function. Another possibility is the use of an Ensemble Kalman Filter approach (Iglesias et al. 2013) to get an estimate of model parameters from an observation, but this does not provide an estimate of the uncertainty.
For the inference of parameters of the volcanic eruption model, here we choose the adaptive population Monte Carlo approximate Bayesian computation (APMCABC) algorithm, proposed in Lenormand et al. (2013), based on its suitability to high performance computing systems (Dutta et al., 2017b). At the first step of this algorithm, Nsample-many parameter values are randomly drawn from the prior distribution and the value of γ is decreased adaptively depending on the pseudo data simulated from the model using those randomly sampled parameter values. In the next step, we produce Nsample-many parameter values approximately distributed from the distribution pd,γ(𝜃|x0), for the adapted γ value from last step and again decrease the γ depending on the new samples. This procedure is continued Nstep many times or until some stopping criterion is reached. We note that the adapted γ values at each step are strictly decreasing and converge to zero, therefore improving the approximation to the posterior distribution. We finally note that this algorithm is extremely suitable to parallelization, as at each step, we always need to run the same number of forward simulations from the model; therefore, we can simply use a number of samples equal to the available workers, or choose the number of workers to allocate according to the number of samples we want to use.
Distance Learning
Traditionally, distance between xsim and x0 are defined by summing over Euclidean distances between all possible pairs composed of one simulated and one observed datapoint in the corresponding datasets. Recently, distances for ABC has also been defined through accuracy of possible classification of xsim and x0 (Gutmann et al. 2017) or by Wasserstein distance (Bernton et al. 2019), under the assumption that the datapoints in each datasets are identical and independently distributed and they are present in a large number in both xsim and x0. We notice here that we only have one datapoint in the observed dataset for a volcanic eruption field study and also due to the very expensive simulation model we can only have few datapoints in the simulated dataset. Hence, here we concentrate on the definition of distances through Euclidean distance while we only have one datapoint in both xsim and x0.
While performing ABC for inference, problems may arise in cases where the data x is high-dimensional. In fact the number of simulations needed before you get close enough to the observation increases with the dimension of the data space. Therefore, a common practice in ABC literature is to define d as Euclidean distance between a lower-dimensional summary statistics S : xsim↦S(xsim), so that d(xsim,x0) < γ would be replaced by
$$ d(S(x^{\text{sim}}),S(x^{0}))<\gamma $$
in Eq. 1, boiling down to obtaining an approximation to the following likelihood function:
$$ p_{d, \gamma}(S(x^0)|\theta) = \int p(x^{\text{sim}}|\theta) \mathbb{K}_{\gamma}(d(S(x^{\text{sim}}),S(x^0))<\gamma) dx^{\text{sim}}, $$
(2)
so that now (𝜃,xsim) are jointly sampled from a distribution proportional to π(𝜃)pd,γ(S(x0)|𝜃) when performing ABC inference.
Reducing the data to suitably chosen summary statistics may also yield more robust inference with respect to noise in the data. Moreover, if the statistics is sufficient, then the above modification provides us with a consistent posterior approximation (Didelot et al. 2011), meaning that we are still guaranteed to converge to the true posterior in the limit γ → 0. As sufficient summary statistics are not known for most of the complex models, the choice of summary statistics remains a problem (Csilléry et al. 2010) and they have been previously chosen in a problem-specific manner (Blum et al. 2013; Fearnhead and Prangle, 2012; Gutmann et al. 2018). For volcanic eruption model, y can not be easily transformed into summary statistics S(y) as there is a complex spatial dependence involved between the deposited tephra at each locations. Hence, here we consider two possible ways of learning a distance directly between two datasets x1 and x2 rather than between the extracted summary statistics. The first entails constructing a Mahalanobis distance,
$$ d_{M}(x_{1},x_{2}) = \sqrt{(x_{1} - x_{2})^{T}M(x_{1}-x_{2})} $$
(3)
where M is a d × d positive semi-definite matrix.
The second approach uses instead a neural network to transform non-linearly the dataset in a new space; this is usually referred to as deep metric-learning and is a well developed field in the computer vision literature (Ge, 2018). The learned distance is the Euclidean distance between the learned embeddings:
$$ d_{NN}(x_{1},x_{2}) = ||g_{w}(x_{1}) - g_{w}(x_{2}) ||_{2}, $$
(4)
where gw(⋅) denotes the transformation applied by the network with weights w and ||⋅||2 denotes the L2 norm.
In both cases, we aim to learn a distance function between data pairs approximating, in the best possible way, the Euclidean distance between the pair of parameters that generated them. Using a very good approximation of the Euclidean distance between the pair of parameters would be highly beneficial for ABC, as in this way the algorithm would be able to accept a simulated parameter value if and only if it is actually close to the parameter value generating the observation.
This intuition can be better explained by first considering a deterministic model for which the map 𝜃↦x is bijective. In this case, it is theoretically possible to learn a distance in data space that is exactly the same as the distance in parameter space. It would be in fact enough to apply the inverse model to the data, getting the parameter values generating them, and then compute the distance between the latter (although we stress that even finding the inverse of a deterministic model may be infeasible in practice). Note that, in this setting, the true posterior distribution of the parameters given the observation is degenerate in a single point, which would be a Dirac delta function at the parameter value generating the observation itself. Therefore using the above learned distance would be optimal in the ABC inference scheme, as the accepted values of 𝜃 would actually concentrate around the parameter value generating the observation and, as γ → 0, we would get back the Dirac delta function.
For the case of a deterministic model with non-bijective map 𝜃↦x, the previous justification does not hold anymore, as a given observation could have been originated by more than a single parameter value. It is therefore theoretically impossible to build a distance function between a pair of data samples that has the same value as the Euclidean distance between the parameters generating them (say ‘true distance’). However, a reasonable model would generate a given observation for parameter values that are relatively close by, eg. constituting a closed (and relatively small) patch in parameter space. Therefore, excluding unlikely scenarios, we argue that the distance learning approach would still be able to provide meaningful information, as it would be able to find some approximation of the true distance.
Finally, for the more general case of a stochastic model, the same argument as before still holds. In this case, in fact, the map 𝜃↦x is non-bijective again; also, due to random noise, two observations generated from the same parameter value are likely to be at a positive distance, according to the learned measure. However, we argue that finding the closest distance function to the true distance is still useful, as it captures the stochastic part in the data dependent on the parameters. In fact, we expect that two samples generated from the same parameter value are assigned smaller (even if non-zero) distance than two samples generated by far apart parameter values. These heuristic justification still lacks theoretical guarantees and rigor; however, this goes beyond the scope of this work, therefore we will leave the investigation of this aspect for future works.
Finally, we note that for the two cases discussed above, learning of the distance function corresponds to learning a transformation of the data. This can be immediately seen for the neural network based distance, where we consider the Euclidean distance between the transformed data using the transformation x↦gw(x). In the Mahalanobis distance case, instead, it is sufficient to recall that for each positive semidefinite matrix M there exists a square matrix L such that M = LTL. Therefore, we can write (3) in the following way:
$$ \begin{array}{@{}rcl@{}} d_{M}(x_{1},x_{2}) &=& \sqrt{(x_{1}-x_{2})^{T} L^{T} L (x_{1}-x_{2})} = \sqrt{(L(x_{1}-x_{2}))^{T} L (x_{1}-x_{2})}\\ &=& \| L(x_{1}-x_{2})\|_{2}, \end{array} $$
(5)
from which it is clear that the above corresponds to learning the transformation x↦Lx and then computing the Euclidean distance between the transformed data.
However, we stress that our focus is different from the usual approaches of learning summary statistics as described in Section 3.3; in fact, we are motivated directly by the distance measure between pair of samples while, to the best of our knowledge, summary statistics learning is usually unrelated to the distance measure; see for instance (Prangle, 2015) for a review. For this reason, distance learning techniques consider several samples at a time (pairs, triplets or, possible, even more), while the summary statistics learning techniques mostly consider separately each (parameter-data) sample (see for instance the linear regression technique by Fearnhead and Prangle (2012)).
Learning the distance from the data
We now discuss practical ways to learn the matrix M and the weights of the network. Following the discussion in the previous section, we consider here the assumption that the geometry induced in data space by these distances should be similar to the geometry in the corresponding parameter space induced by Euclidean distance (dE).
We proceed therefore in the following way: we simulate a set of n datasets {x1,…,xn} from n parameters {𝜃1,…,𝜃n} correspondingly. In order to capture the information about the geometry of the parameter space, we define two complementary sets of pairwise similarity constraints \(\mathbb {S} = \lbrace (x_{i}, x_{j})| x_{i} \text { and } x_{j} \text { are similar}\rbrace \) and dissimilarity constraints \(\mathbb {D} = \lbrace (x_{i}, x_{j})| x_{i} \) and xj are dissimilar}, where xi and xj are considered similar if dE(𝜃i,𝜃j) < 𝜖 and dissimilar otherwise, for some 𝜖 > 0.
Learning Mahalanobis distance:
We describe now how to learn a Mahalanobis distance, as in Equation 3 under the above similarity and dissimilarity constraints. This setup falls under a well-developed field of research in metric-learning (Suárez et al. 2018). Here, we consider a l1-penalized log-determinant regularization on M (Ravikumar et al. 2011), which reduces the above distance learning problem to a l1-penalized log-det optimization problem to find M:
$$ \underset{M}{min}~\text{tr}(M_{0}^{-1}M) - \text{log det} M +\lambda \sum\limits_{i \neq j}M_{ij} + \eta \sum\limits_{i,j=1}^{n} \left( {x_{i}^{T}}Mx_{i} - {x_{i}^{T}}Mx_{j}\right) K_{ij} $$
(6)
such that M ≥ 0 (is a positive semidefinite matrix) and
$$ \begin{array}{@{}rcl@{}} K_{i,j} = \left\{\begin{array}{lll} +1, & \text{if}\ (x_{i}, x_{j}) \in \mathbb{S} \\ -1, & \text{if}\ (x_{i}, x_{j}) \in \mathbb{D}. \end{array}\right. \end{array} $$
(7)
In Eq. 6, the first term can pushes the matrix M to be similar to the inverse of M0, that can be thought of as the inverse of the prior on the final M. Moreover, the second term is a spectral regularization on the matrix, while the third one is enforcing sparsity in the off-diagonal elements of M, with λ controlling the amount of sparsity. Finally the fourth term is the one encoding the information coming from the similarity and dissimilarity sets; the trade-off between the latter and the previous terms is tuned by η. This algorithm is called Sparse Distance metric-learning (SDML) (Qi et al. 2009).
Deep metric-learning:
For the second approach, to learn the weights of the neural networks, here we consider the contrastive (Hadsell et al. 2006) and triplet (Schroff et al. 2015) losses defined on the same similarity/dissimilarity constraints as above. The learned distances will be called correspondingly contrastive loss distance and triplet loss distance. The contrastive loss considers all possible pairs of samples and penalizes a large embedding distance for similar samples while, for dissimilar ones, it penalizes them for being too close, and pushes them to be further apart than a fixed margin α. Specifically, we can write it in the following form:
$$ \begin{array}{@{}rcl@{}} L &=& \frac{2}{n(n-1)} \sum\limits_{i=1}^{n} \sum\limits_{j=i+1}^{n}\\ &&\left\{ y_{ij} \cdot ||g_{w}(x_{i}) - g_{w}(x_{j}) ||_{2}^{2} + (1- y_{ij}) \cdot [\alpha - ||g_{w}(x_{i}) - g_{w}(x_{j}) ||_{2}]_{+}^{2}\right\},\\ \end{array} $$
(8)
where \( [\cdot ]_{+} = \max \limits (0,\cdot ) \) and where \( y_{ij} = 1 \iff (x_{i}, x_{j}) \in \mathbb {S} \), \( y_{ij} = 0 \iff (x_{i}, x_{j}) \in \mathbb {D} \).
The triplet loss works instead on three samples at a time: an anchor, a positive, that is deemed similar to the anchor, and a negative, that is on the contrary dissimilar. Essentially, the loss pushes the network to find an embedding such that the distance between the anchor and the negative is larger than the one between the anchor and the positive plus a margin, that is defined a priori. By denoting \((x_{a}^{(i)}, x_{p}^{(i)}, x_{n}^{(i)})\) the anchor, positive and negative of the i-th triplet, and by denoting as N the number of all possible triplets built in this way, we can write the loss in the following way:
$$ L = \frac{1}{N} {\sum\limits_{i}^{N}} \left[|| g_{w}(x_{a}^{(i)}) - g_{w}(x_{p}^{(i)})||_{2}^{2} - || g_{w}(x_{a}^{(i)}) - g_{w}(x_{n}^{(i)})||_{2}^{2} + \alpha\right]_{+}, $$
(9)
where \(\alpha \in \mathbb {R}\) denotes again the margin. We optimize this loss with stochastic gradient descent over the parameters of the network, by drawing random triplets.
While defining the similarity and dissimilarity constraints, 𝜖 was chosen to be the 10-th percentile of the pairwise distances between the n parameters {𝜃1,…,𝜃n}, for both SDML and deep metric-learning. To optimize SDML, we use an iterative optimization scheme from Qi et al. (2009) implemented in the metric-learn Python package (de Vazelhes et al. 2019), with M0 chosen to be the sample covariance matrix, η = 0.15, λ = 0.01 and n = 400. For deep-metric-learning, we have used a 4-layers fully connected network, with 72 input neurons and 15 outputs, and with hidden layers of size 100, 80 and 40. We used ReLU non-linearity between the layers, α = 1 and Stochastic Gradient Descent for both losses, drawing random pairs or triplets. Note that the size of the embedding, 15 has been hand-tuned based on some pilot runs and sensitivity analysis. A more rigorous data-driven choice of size of embedding is left for future work.
We stress that the network we used is very small compared to the ones usually considered in computer vision applications, in which these techniques were firstly developed; also, another conceptual difference exists: in computer vision, the deep metric-learning techniques are used in a supervised setting, in which every image is assigned a label and similar pairs consist of images of the same class. Our case, instead, is what may be called a weakly-supervised context, in which the only information we have is the similarity set. Note that, in the former case, \( (x_{1}, x_{2}), (x_{2}, x_{3}) \in \mathbb {S} \implies (x_{1}, x_{3}) \in \mathbb {S}\), while this is not true in the weakly-supervised case.
Please refer to Table 1 for the number of epochs and batch size used for deep metric-learning. At each epoch, we iterate over all samples and draw another random element, in the contrastive case, or a random positive and random negative in the triplet case. For the contrastive loss, as the similar pairs are fewer than the dissimilar ones, a random pair would be more probably dissimilar than similar. In order to enhance the training, we therefore sample with probability p = 0.4 a similar sample to the considered one, and with remaining probability a dissimilar sample; in this way, the fraction of positive pairs on which the network is trained is larger than what it would be by naively sampling another random element.
Table 1 Settings for neural network training
In Fig. 4, we compare the Euclidean distance between the parameters generating the datasets (‘true distance’) with the learned distance functions on the corresponding datasets, namely the Mahalanobis one with SDML algorithm and the contrastive and triplet loss distances; we also report the Euclidean distance between outputs of the model. The comparison is done in the following way: 400 parameters-simulation pairs have been generated, with parameters drawn uniformly on the interval (30,100) [m] for R0 and (100,300) [m/s] for U0. This dataset is split into a training one (with 300 samples) and a test one (with 100 samples). We learn the distances on the training set, and then compute all the distances between a chosen element in the training set x0 and the 99 remaining samples in the same test set (‘reference samples’); we then plot the learned distances in parameter space, by using the corresponding parameter value for each observation. x0 was simulated using 𝜃0 = (173.87 m/s,84.55 m). We see that the minimum values of the distances are much more concentrated around the true parameter value for the contrastive and triplet loss distance in comparison to the SDML one and the Euclidean. Note also that the neural network based ones are able to partially reproduce the behavior of the distance between the true parameter values. However, it is not clear from this visualization which one between contrastive and triplet performs better. We therefore perform a more rigorous comparative study in the next Section, in order to find out which is the best between the two deep metric learning techniques and to evaluate different choices of 𝜖.
As a side remark, we note that the number of possible pairs and triplets grows respectively quadratically and cubically with the number of training samples. This implies, for the triplet case, the number of triplets that is seen by the network during the training is smaller than the total number of them, for our chosen number of epochs. However, the network is still capable of reproducing the behavior of the true distance, and further training did not seem to produce any improvement. Of course, these techniques are extremely inefficient when the number of samples is large; techniques have been developed in order for the network to focus on hard pairs and triplets only. However, as our training size is quite small (300) we do not discuss these in detail here, and we refer to Liu et al. (2019) and Hermans et al. (2017).
Comparison Using Kullback-Leibler Divergence
In order to obtain a quantitative comparison between different distance learning techniques discussed and different choices of 𝜖, we estimated the Kullback–Leibler (KL) divergence between a distribution induced by the learned distance function on parameter space (with respect to some reference point), and the distribution induced in the same way by the true distance. Specifically, we first learn a distance with one of the methods described above on the train set, and then compute the distances of all reference samples in the test set with respect to another simulated dataset x0 corresponding to parameter 𝜃0 (the ‘observation’), as done in the previous Section. Then, after scaling the distances to [0,1] in the considered region, we consider a Gibbs density defined on the reference values of parameters 𝜃i to be \( p(\theta _{i}) \propto e^{-\beta \cdot d(x_{i}, x_{0})^{2}} \), where d(xi,x0) is the learned distance function; we assume that the above distribution exists on all the parameter space (neglecting the fact that the map from 𝜃 to x is stochastic), but that we can evaluate it only on the reference points. We consider also the distribution defined by the true distance between reference values of parameters and the observation one: \( p^{*}(\theta _{i}) \propto e^{-\beta \cdot d_{E}(\theta _{i}, \theta _{0})^{2}/c} \), where c is a constant rescaling the distance to [0,1] in the considered region. Now, as we can evaluate the learned distance only on the reference points, we estimate the KL divergence using an importance sampling approach, that is described below; we apply this technique on the same set of n = 400 simulations with the same train-test split that we discussed above, the parameters of which were drawn independently according to a uniform on the interval (30,100) [m] for R0 and (100,300) [m/s] for U0.
Recall now the definition of the KL divergence:
$$D_{KL}(P||P^{*}) = \int p(\theta) \log \left( \frac{p(\theta)}{p^{*}(\theta)} \right) d\theta= \int q(\theta) \frac{p(\theta)}{q(\theta)} \log \left( \frac{p(\theta)}{p^{*}(\theta)} \right) d\theta, $$
where we denoted as q the density according to which the parameters are drawn (uniform in our case), and where P (respectively P∗) denotes the distribution with density p (p∗). As we do not know the normalization constants of the above densities, we need to estimate them from the data. We define therefore the unnormalized densities \(p(\theta ) = \tilde p(\theta ) /Z\) and \(p^{*}(\theta ) = \tilde p^{*}(\theta ) /Z^{*}\). Then, we can estimate the divergence by:
$$ \begin{array}{@{}rcl@{}} &&\hat D_{KL}(P||P^{*}) = \sum\limits_{i=1}^{n} w_{i} \cdot \log \left( \frac{\tilde p(\theta)/ \hat Z}{\tilde p^{*}(\theta)/ \hat Z^{*}} \right),\\ &&w_{i} = \frac{p(\theta_{i})/q(\theta_{i})}{{\sum}_{j=1}^{n} p(\theta_{j})/q(\theta_{j})} = \frac{\tilde p(\theta_{i})/q(\theta_{i})}{{\sum}_{j=1}^{n} \tilde p(\theta_{j})/q(\theta_{j})}, \quad \theta_{i} \sim q. \end{array} $$
where \(\hat Z = \frac {1}{n} {\sum }_{i=1}^{n} \frac {\tilde p(\theta _{i})}{q(\theta _{i})}\) is a consistent estimator of Z, and similarly for \(\hat Z^{*} \).
Overall, we are then left with the following consistent estimator:
$$\hat D_{KL}(P||P^{*}) = \sum\limits_{i=1}^{n} \frac{\tilde p(\theta_{i})/q(\theta_{i})}{\hat Z} \cdot \left( \log\frac{\hat Z^{*}}{\hat Z} + \beta\left( \frac{d_{E}(\theta_{i}) }{c}- d(\theta_{i})\right) \right) , $$
where we have used the explicit dependence of p and p∗ on the distance function.
In order to have better statistics for the performance of each distance learning technique, we perform Leave-One-Out cross validation on the test set: after having learned the distance on the training set, for each of the samples (xj,𝜃j) in the test set in turn, we consider it as an observation point in the computation described above, while all the other elements in the test set are taken as the reference {𝜃1,…,𝜃n} and used to estimate the KL divergence. In this way, we are able to obtain a statistics of the estimated KL divergence on 100 realizations.
We repeat this evaluation over the range of quantiles defining the similarity set over which the distances are learned. For each quantile value and technique we draw a boxplot, representing the spread of the histogram; the results can be found in Fig. 5. Recall that the KL divergence between two distributions is 0 if and only if they are the same, and can never be negative. Note that the SDML algorithm is quite unstable and was not able to converge for some of the 𝜖 values. Also, the deep metric-learning techniques algorithms are not applicable for quantiles larger than 0.74, as in this case there is at least a training sample which is considered to be similar to all the other training samples, and the training routines are not designed to operate in this case.
From the results, you can see that the triplet loss performs consistently better than the contrastive one, as the estimated KL divergence spans smaller values. Also, SDML is always worse than both of them and its performance does not show a clear trend with respect to the quantile. Regarding the deep metric-learning techniques, they capture more information with a quite large quantile value, i.e. in the case where a large fraction of all possible sample pairs are considered to be similar; this result is quite surprising. Finally, we note that the numerical value of the estimated KL divergence depends strongly on the choice of β, but the ranking between the different techniques remains the same; the results in Fig. 5 was obtained with β = 1.
We choose as best distance learning technique the one which is able to achieve the smallest median of the KL histogram obtained over the possible 100 splits. Therefore, the best distance is found to be the triplet loss one trained by using the 60th percentile as threshold for defining the similarity set.
Semiautomatic Summary Statistics Selection
We compare now the results of the distance learning approaches with the semiautomatic summary statistics learning schemes (Fearnhead and Prangle, 2012; Jiang et al. 2017). In this approach, the parameter values are regressed using some function of the corresponding simulation outputs. Namely, you assume the following model:
$$ \theta = \mathbb{E}(\theta| x) + \epsilon = f_{\beta}(x) + \epsilon, $$
(10)
where 𝜖 is a 0-mean noise and fβ(x) is a function of data parametrized by β. The authors of Jiang et al. (2017) parametrize fβ(⋅) by using a Neural Network. This regression approach was first introduced in Fearnhead and Prangle (2012) with a linearity assumpton on fβ, reducing it to a simple linear regression. We focus here on on the neural network formulation as this was shown to outperform the linear regression by Jiang et al. (2017).
In practice, before performing ABC inference, the procedure amounts to the following steps:
-
We simulate Nss data-parameter pairs \((\theta _{i}, x_{i})_{i=1}^{N_{ss}}, \ \theta _{i} \sim \pi (\theta ), \ x_{i} \sim {\mathcal{M}}(\theta _{i})\)
-
We then fit the statistical model given by Eq. 10.
-
Finally, we fix S(⋅) = fβ(⋅) in the chosen ABC inference algorithm (in Eq. 2).
In Theorem 3 of Fearnhead and Prangle (2012), the authors provide a rationale for the above procedure; namely, they show that, by using \( S(x^0) = \mathbb {E}(\theta | x^0)\) as summary statistics, the posterior mean of the ABC approximate posterior is the best possible estimator of the true parameter value with respect to the quadratic error loss, in the limit of γ → 0 in the ABC inference scheme (2). Of course, the posterior mean with respect to the true posterior \(\mathbb {E}(\theta | x^{0}) \) is not available, and hence the regression approach was proposed.
However, we highlight that the latter is only able to learn an approximation of the “ideal” summary statistics, so that the theoretical justification is not conclusive. Therefore, we believe that directly focusing on learning the distance, as described in the previous Sections, may actually perform better than the regression approach in Eq. 10 in learning the best summary statistics. Intuitively, we think that the distance learning approach is able of modeling the correlations between different parameter values, as it relies on considering pairs or triplets of samples at a time. Again, we leave theoretical guarantees of the above intuitions for future work, and we simply rely on empirical studies to make our point here.
We fit this model on the same set of datasets and simulation pairs that we used for the distance learning approach, using similar train and test split. As said above, we use a neural network to parametrize the function fβ(⋅), and that was trained by stochastic gradient descent using the loss corresponding to the regression in Eq.10:
$$ \frac{1}{N} \sum\limits_{i=1}^{N} ||f_{\beta}(x_{i}) - \theta_{i}||_{2}^{2}. $$
(11)
The neural network is composed of 4 fully connected layers, with 72 input neurons and 2 outputs, and with hidden layers of size 80, 40 and 15, with ReLU non-linearity. We remark that, in this case, the output dimension of the network (i.e. the dimension of the summary statistics) must match the number of parameters. Further details on the training settings may be found in Table 1.
We compared the performance of this technique with the best distance learning approach that we were able to find, namely the triplet loss one trained over the similarity set defined by using a threshold corresponding to the 60th percentile. In Fig. 6, we show both the distance contour plot for the same observation point used in Fig. 4 and the histogram of the estimated KL divergence over the 100 possible splits of the test validation test. For comparison, we also show the histogram of the estimated KL divergence for the Euclidean distance between model outputs. Although the contour plots look very similar, the triplet loss distance is found to slightly outperform the Semiautomatic summary selection technique with neural network according to the KL divergence measure.
Having demonstrated the stronger performance of the distance learning approach with the triplet loss, we will focus only on that in performing the subsequent ABC inference.
Computational Considerations
We stress that ABC with distance or summary statistics learning comes at the expense of a larger computational cost with respect to directly using the Euclidean distance between model outputs in ABC, case in which no training step would be needed. When applying the learning approach, instead, you need to generate the training data, and this is quite expensive given the model we are considering, and then need to perform the training. Note that, once the training data is generated, the SDML technique requires a much shorter time for fitting than the time required for training the neural network in the other cases. However, in the overall balance, when compared with data generation time and the ABC inference time, the training step has a much smaller cost no matter the chosen method, and has to be performed only once. Therefore, it does not make sense to prefer a distance learning method over another just because of a shorter training time.
During inference, the computation of the new distance only requires multiplying output of the mechanistic model by some matrices with all distance learning techniques (as transforming some data with a neural network simply consists of matrix multiplications and the application of element-wise non-linearity functions). Therefore, the impact of distance learning on the computational complexity of the inference is very small, comparable to the use of hand-chosen summary statistics.
In general, the larger computational cost is balanced by a more efficient ABC inference scheme and a better approximation of the true posterior, given the same computational budget to the inference itself. We also remark that our approach can be thought of as a pre-processing technique, as it can be used with any ABC algorithm. Also, once the training has been performed, the same learned distance can be exploited for inference on several observations of the same physical process.
Nested parallelization
For inference, we use the python package ABCpy (Dutta et al. 2017a), which implements some of the most advanced ABC algorithms. The algorithms are implemented such that the computation can be highly parallelized. This is in particular useful for computationally complex models since the time to solution for all ABC algorithms is dominated by the models forward simulation time. To be more precise, the inference algorithms usually need to start forward simulations with a lot of different model parameters to obtain an accurate posterior distribution.
ABCpy’s backends for parallelization are based on the map-reduce principle. In the map phase a set of parameters are distributed to a cluster of machines (nodes) and each node runs forward simulations on the parameters assigned to it. In the reduce phase the results are collected from the cluster to a single master node for a next iteration of forward simulations or further processing. Modern cluster nodes usually have multiple cores and by default ABCpy runs one forward simulation per core. However, if the model supports multi-threading (basic operating system threads), the backend can be configured accordingly.
ABCpy provides two different implementations of the map-reduce backend, one based on Apache Spark and the other based on the message passing interface (MPI) (MPIForum, 2017). The decision for these technologies was made to cover a broad user base, since Apache Spark is often used in industry and MPI has its user base mainly in academia. Nevertheless, MPI has its application beyond scientific communities in case high throughput and low-latency communication is required. Sufficiently complex models, as for example in the domains of meteorology, finite elements, and fluid dynamics, are often parallized using MPI.
However, previous versions of ABCpy did not support models that were parallelized using MPI, which is the case for the studied volcano model. The challenge in enabling MPI model support is the fact that MPI code uses an object called MPI communicator to control communication. In the Apache Spark backend, this communicator is just not available due to the standard system setup and thus not usable in standard installations. In the MPI backend, the communicator is available but used by the backend itself that has to coordinate the parameter distribution and forward simulations. Thus, we contribute code to ABCpy that enabled support for MPI parallelized models, broadening the field of applications beyond the volcanic model discussed here.
The communication architecture of the nested MPI parallelization is depicted in Fig. 7. Technically, ABCpy creates two types of communicators: The team communicators and the scheduler communicator. Team communicators are used by the forward simulation models as their main communicator and one process of each team communicator is part of the scheduler communicator. This allows one process, the scheduler, to provides work to the forward models as long as there are model parameters to explore.
Posterior Inference
To draw Z samples approximating the posterior distribution p(𝜃|x0), we keep all the tuning parameters for the APMCABC fixed at the default values suggested in ABCpy package, except the acceptance rate cutoff, which was chosen to be 0.03. Different number of steps and samples are used for the inference on the simulated and real data; check Section 4 for more details. We consider independent Uniform prior distributions for the parameters with a pre-specified range for each of them, \(U_{0} \sim U(100,300)\ [m/s]\), \(R_{0} \sim U(30,100)\ [m]\). To explore the parameter space of 𝜃 = (U0,R0), we use a two-dimensional truncated multivariate Gaussian distribution as the perturbation kernel. APMCABC inference scheme centers the perturbation kernel at the sample it is perturbing and updates the variance-covariance matrix of the perturbation kernel based on the samples learned from the previous step.
For this work, we used supercomputing facilities in the Swiss supercomputing center (CSCS), namely the Piz Daint supercomputer, where each compute node consisted of 36 Intel Broadwell Xeon E5-2695 v4 @ 2.10GHz cores. Each forward MPI simulation was run on one node, resulting in the use of Z cores. With this setup, it is possible to run the APMCABC algorithm with the previously described settings in 2 hours.
Parameter Estimation
Given an observed dataset x0, our main interest is to estimate the corresponding 𝜃. In decision theory, Bayes estimator minimizes the posterior expected loss, \(\mathbb {E}_{p(\theta |x^{0})}({\mathcal{L}}(\theta ,\bullet )|x^{0})\) for an already chosen loss-function \({\mathcal{L}}\). If we have Z samples \((\theta _{i})_{i=1}^{Z}\) from the posterior distribution p(𝜃|x0), the Bayes estimator can be approximated as:
$$ \begin{array}{@{}rcl@{}} \hat{\theta}_{B}= \underset{\theta}{\arg\min} \frac{1}{Z}\sum\limits_{i=1}^{Z} \mathcal{L}(\theta_{i},\theta). \end{array} $$
(12)
As we consider the Euclidean loss-function \({\mathcal{L}}(\theta ,\theta ^{\prime }) = (\theta -\theta ^{\prime })^{2}\), the Bayes estimator can be shown to be the posterior mean \( \mathbb {E}_{p(\theta |x^{0})}(\theta |x^{0}) \), corresponding to an approximate one \(\hat {\theta }\approx \frac {1}{Z}{\sum }_{i=1}^{Z} \theta _{i}\).