1 Introduction

Hydrologic analyses typically rely on a single conceptual-mathematical model. Yet hydrologic environments are open and complex, rendering them prone to multiple interpretations and mathematical descriptions. Adopting only one of these may lead to statistical bias and underestimation of uncertainty. Thus, hydrologists have developed several approaches to weigh and average predictions generated by alternative models (Neuman 2003; Neuman and Wierenga 2003; Ye et al. 2004; Poeter and Anderson 2005; Beven 2006; Refsgaard et al. 2006).

Bayesian model averaging (BMA) (Draper 1995; Kass and Raftery 1995; Hoeting et al. 1999) provides an optimal way to combine the predictions of several competing models and to assess their joint predictive uncertainty. Hoeting et al. (1999) describe BMA by noting that if Δ is a quantity one wants to predict given a discrete set of data D, then its posterior distribution is

$$ p\left( {\Updelta |{\mathbf{D}}} \right) = \sum\limits_{i = 1}^{K} {p\left( {\Updelta |M_{i} ,{\mathbf{D}}} \right)p(M_{i} |{\mathbf{D}})} $$
(1)

where K is the number of models considered, \( p\left( {\Updelta |{\mathbf{D}}} \right) \) is the average of the posterior distributions \( p\left( {\Updelta |M_{i} ,{\mathbf{D}}} \right) \) under each model, weighted by their posterior model probabilities \( p(M_{i} |{\mathbf{D}}) \). The posterior probability for model M i is given by Bayes’ rule

$$ p(M_{i} |{\mathbf{D}}) = \left[ {p({\mathbf{D}}|M_{i} )p(M_{i} )} \right]/\left[ {\sum\limits_{j = 1}^{M} {p({\mathbf{D}}|M_{j} )p(M_{j} )} } \right] $$
(2)

where \( p({\mathbf{D}}|M_{i} ) \) is the integrated likelihood of model M i . All probabilities are implicitly conditional on the set of models being considered. The posterior mean and variance of Δ are (Draper 1995)

$$ E\left[ {\Updelta |{\mathbf{D}}} \right] = \sum\limits_{i = 1}^{K} {E\left[ {\Updelta |M_{i} ,{\mathbf{D}}} \right]p(M_{i} |{\mathbf{D}})} $$
(3)
$$ {\text{Var}}\left[ {\Updelta |{\mathbf{D}}} \right] = \sum\limits_{i = 1}^{K} {{\text{Var}}\left[ {\Updelta |{\mathbf{D}},M_{i} } \right]p(M_{i} |{\mathbf{D}})} + \sum\limits_{i = 1}^{K} {\left( {E\left[ {\Updelta |{\mathbf{D}},M_{i} } \right] - E\left[ {\Updelta |{\mathbf{D}}} \right]} \right)^{2} p(M_{i} |{\mathbf{D}})} $$
(4)

Neuman (2003) proposed a maximum likelihood (ML) version of BMA (MLBMA) that renders it compatible with ML methods of model calibration (Carrera and Neuman 1986; Hernandez et al. 2003, 2006) even in cases where prior information about the parameters is not available (such information being a prerequisite for BMA). In the framework of MLBMA, \( E\left[ {\Updelta |M_{i} ,{\mathbf{D}}} \right] \) and \( {\text{Var}}\left[ {\Updelta |{\mathbf{D}},M_{i} } \right] \) are approximated by maximum likelihood estimates \( E\left[ {\Updelta |M_{i} ,{\hat{\mathbf{b}}}_{i} ,{\mathbf{D}}} \right] \) and \( {\text{Var}}\left[ {\Updelta |{\mathbf{D}},M_{i} ,{\hat{\mathbf{b}}}_{i} } \right] \), which are posterior mean and variance of Δ under the i-th alternative model and \( {\hat{\mathbf{b}}}_{i} \) is a maximum likelihood estimate of b i (given that model M i has parameters b i ) based on the likelihood \( p\left( {{\mathbf{D}}|{\hat{\mathbf{b}}}_{i} ,M_{i} } \right) \). In turn (2) is approximated by (Ye et al. 2004)

$$ p(M_{i} |{\mathbf{D}}) = \left[ {\exp \left( { - 0.5\Updelta IC_{i} } \right)p(M_{i} )} \right]/\left[ {\sum\limits_{j = 1}^{M} {\exp \left( { - 0.5\Updelta IC_{j} } \right)p(M_{j} )} } \right] $$
(5)

where \( \Updelta IC_{i} = IC_{i} - IC_{\min } , \) \( IC_{i} = KIC_{i} \) being the Kashyap (1982) information criterion for the i-th model and IC min is the minimum value among the models. Alternatively, posterior model weights are sometimes assigned by setting IC i equal to information theoretic criteria (Poeter and Anderson 2005; Ye et al. 2008) such as (see Appendix A for full expressions of these criteria) Akaike information criterion AIC (Akaike 1974), modified Akaike information criterion AICc (Hurvich and Tsai 1989) or the Bayesian information criterion BIC (Schwarz 1978). Ye et al. (2008) explain that KIC is the only one among these criteria which validly discriminates between models based not only on the quality of model fit to observed data and the number of model parameters but also on how close are the posterior parameter estimates to their prior values and the information contained in the observations.

Experience indicates (and our results below confirm) that Eq. 5 tends to assign posterior probabilities or model weights of nearly 1 to one model (the best in terms of minimum calibration error) and nearly zero to all other models. Tsai and Li (2008) argue that Occam’s window (defined by Raftery (1995) in terms of BIC as \( \Updelta BIC_{i} \le 6 \)) on which (5) is based may be too narrow to accommodate models that are not the best but still potentially acceptable. For \( \Updelta IC_{i} > 6 \) in (3) \( p(M_{i} |{\mathbf{D}}) < 0.05 \) decreasing exponentially with \( \Updelta IC_{i} \); however, \( \Updelta IC_{i} > 6 \) is common in field problems. As a remedy, they propose to rely on a broader variance window obtained upon scaling \( \Updelta IC_{i} \) in Eq. 5 by a factor α selected subjectively by the analyst based on a desired level of significance, which determines the size of the variance window: \( \alpha = c/\sqrt n , \) where n is the number of observation data and c is a coefficient which depends on the window size and desired significance level.

We test the ability of MLBMA based on both Occam’s and variance windows to predict air-pressure from pneumatic injection tests conducted in a complex, highly heterogeneous, unsaturated fractured tuff near Superior, Arizona (Illman et al. 1998; Illman and Neuman 2001). Application of MLBMA to such a complex problem and comparing the performance of Occam and variance windows in this context are two key contributions of this paper. First we use log permeabilities and porosities obtained from single-hole pneumatic packer tests to postulate, calibrate and compare five alternative variogram models of these parameters based on AIC, AICc, BIC and KIC. The variogram models are exponential, exponential with linear drift, power, truncated power based on exponential modes, and truncated power based on Gaussian modes. We rely on KIC and cross-validation to select the first three of these variogram models for permeability and only the exponential model for porosity. We then adopt the favoured models to parameterize log air permeability and porosity across the site via kriging in terms of their values at selected pilot points and, optionally, at some single-hole measurement locations. For each of the selected variogram models we estimate log air permeabilities and porosities at the pilot points by calibrating a finite volume pressure simulator against two cross-hole pressure data sets; during each cross-hole test, air was injected at a different location in each test and pressure responses were recorded in all other boreholes (Illman et al. 1998). Finally, we compare the abilities of individual models and MLBMA, based on both Occam and variance windows, to predict space–time pressure variations observed during two cross-hole tests during which injection took place into locations different than those employed for calibration.

2 The Apache Leap Research Site

The previous University of Arizona Apache Leap Research Site (ALRS) near Superior, Arizona is a block of unsaturated fractured tuff measuring 64 × 55 × 46 m (Fig. 1). The test site includes sixteen boreholes, three vertical (V1, V2, V3) and thirteen inclined at 45° (X1, X2, X3, Y1, Y2, Y3, Z1, Z2, Z3, W1, W2, W2A, W3). Several pneumatic cross-hole tests were conducted at the ALRS (Illman et al. 1998; Illman and Neuman 2001) of which we selected four, labeled PP4, PP5, PP6 and PP7; a summary of the conditions for each test is presented in Table 1. For inverse calibration we selected the cross-hole tests labelled PP4 and PP5; we validated the calibrated models by predicting pressure variations during cross-hole tests, PP6 and PP7. During each test air was injected into a given interval and responses were monitored in 13 relatively short intervals (0.5–2 m) and 24 relatively long intervals (4–42.6 m) shown in Fig. 1. The hydrologic parameters controlling airflow are air permeability k and air-filled porosity ϕ, both attributed largely to air-filled fractures transecting water-saturated porous tuff.

Fig. 1
figure 1

Borehole arrangement and location of packers during cross-hole tests at ALRS (from Vesselinov et al. 2001a)

Table 1 Cross-hole tests conditions at ALRS (Illman et al. 1998)

3 Alternative geostatistical models of air permeability and air-filled porosity

3.1 Log10 k

Ye et al. (2004) used MLBMA to investigate the geostatistical properties of log air permeability k (m2) at ALRS by postulating several alternative variogram models based on 184 data of log10 k obtained via steady-state interpretation of stable pressure data from pneumatic injection tests in 1-m long intervals along six boreholes, V2, W2A, X2, Y2, Y3 and Z2 in Fig. 1 (Guzman et al. 1996). Ye et al. (2004) fitted seven variogram models (power P, exponential E, exponential with first-order drift E1, exponential with second order drift E2, spherical S, spherical with first-order drift S1, and spherical with second order drift S2) to this data set using the adjoint state maximum likelihood cross-validation (ASMLCV) method of Samper and Neuman (1989) in conjunction with universal kriging and generalized least squares. They found that the first three models (P, E and E1) consistently dominated in terms of their posterior model probability. We expanded their list of best models to include truncated power models based on Gaussian (Tpg) and exponential (Tpe) modes (Di Federico and Neuman 1997; a brief review of these models is found in Appendix B), fitted the variogram models using the same data set and the same procedure, computed the values of four model selection criteria (AIC, AICc, BIC and KIC) and computed the corresponding posterior model probability. Table 2 lists the results of this analysis, where posterior probabilities or (in the case of AIC and AICc) model weights are based on equal prior probabilities p(M k ) (the neutral choice) for all models. Model E1 is associated with the smallest negative log-likelihood (NLL) value (e.g. Carrera and Neuman 1986) and thus provides the best fit to the data. When using Occam’s window, model ranking varies depending on the information criterion. Whereas AIC and AICc strongly prefer E1 and P in this order over all other models, BIC strongly prefers P and ranks E1 the worst because it penalizes the model with more parameters in proportion to ln N s (N s being the number of observations; see Appendix A). On the other hand, KIC shows a slight preference for E1 over P while considering E to be a not much less promising option. Whereas in terms of NLL the truncated power models, Tpg and Tpe, fit the sample variogram as well as does P (Fig. 2), they are ranked lower by all four model selection criteria due to their larger number of parameters. KIC is the only such criterion showing a clear preference for Tpg over Tpe. Alternatively, a variance window of size 4σ D and a significance level of 5%, leads to α = 0.078 and posterior probabilities that are distributed more evenly among all models, and the difference in magnitude between probabilities based on different information criteria is reduced.

Table 2 ASMLCV results for log10 k
Fig. 2
figure 2

Variogram models for log10 k

3.2 Log10 ϕ

We conducted a similar geostatistical analysis of 109 log air-filled porosity (log10 ϕ) data obtained by type-curve interpretation of the recovery phase of single-hole tests conducted on a nominal scale of 1 m (Illman 2005). As there appears to be no discernible cross-correlation between the log10 ϕ and log10 k data we analyzed each set separately. Four alternative variogram models were postulated for log10 ϕ: exponential E, spherical S, truncated power based on Gaussian Tpg and exponential Tpe modes. Figure 3 depicts the models fitted to the sample variogram and Table 3 lists the corresponding statistics. In terms of NLL the truncated power models Tpe and Tpg fit the data almost equally well and somewhat more closely than do E and S. Posterior probabilities based on Occam’s window and AIC, AICc and BIC rank the two truncated power models as best. However, KIC ranks E much higher than all other models. By using a variance window of size 4σ D at a significance level of 5% (α = 0.1), posterior probabilities are distributed more evenly among the models but the ranking is not changed.

Fig. 3
figure 3

Variogram models for log10 ϕ

Table 3 ASMLCV results for log10 ϕ

3.3 Predictive capability of variogram models

KIC has been shown theoretically and empirically to have some advantages over AIC, AICc and BIC (Ye et al. 2008); KIC-based posterior probabilities from Tables 2 and 3 suggest retaining models P, E1 and E for log10k and model E for log10ϕ while eliminating the remaining models from further consideration. We test this by analyzing the predictive capability of each variogram model for log10k and log10ϕ through log scores of the cross-validation errors in the manner of Ye et al. (2004). The data set was split into two parts, eliminating data corresponding to one borehole at a time, obtaining ML estimates of the parameters and using these to predict the eliminated data. We repeated the procedure for each data set for log10k and log10ϕ. The log score \( - \ln p({\mathbf{D}}^{v} |M_{k} ,{\mathbf{D}}^{c} ) \) (Volinsky et al. 1997), approximated by \( - \ln p({\mathbf{D}}^{v} |M_{k} ,{\hat{\mathbf{b}}}_{k} ,{\mathbf{D}}^{c} ) \) (Ye et al. 2008), is a measure of the predictive capability of a model. The lower the predictive log score of model M k based on data Dc (the calibration data set), the higher the amount of information in Dv (the validation data set) recovered by model M k based on Dc. The log score of a model is given by

$$ - \ln p({\mathbf{D}}^{v} |M_{k} ,{\hat{\varvec{\theta }}}_{k} ,{\mathbf{D}}^{c} ) = \frac{Nv}{2}\ln \left( {2\pi } \right) + \frac{1}{2}\sum\limits_{i = 1}^{Nv} {\sigma_{i}^{2} } + \frac{1}{2}\sum\limits_{i = 1}^{Nv} {{\frac{{\left( {\hat{D}_{i}^{v} - D_{i}^{v} } \right)^{2} }}{{\sigma_{i}^{2} }}}} $$
(6)

where Nv is the number of data points in Dv, \( \hat{D}_{i}^{v} \) and \( \sigma_{i}^{2} \) are the i-th kriged variable and the kriging variance, respectively, based on the parameter estimates \( {\hat{\mathbf{b}}}_{k} \) for model M k . The results for average predictive log scores are listed in Table 4 for log10k and log10ϕ. For log10k, models E, E1 and P have average log scores ranging from 47.8 to 49.6, while the log scores of Tpg and Tpe are considerable larger, 53 and almost 70, respectively; for each individual borehole cross-validated, models E, E1 and P consistently have the lowest log scores (except model P for borehole Z2), outperforming Tpg and Tpe. For log10ϕ, model E has the lowest log score for all cross-validation data set except borehole V2, outperforming the rest of the models; the average log score of model E is 36.2, while models S and Tpg have log scores of about 40 and Tpe has largest log score. Based on posterior probabilities from KIC and the log scores we retain only models E, E1 and P to parameterize log10k while we retain only model E to parameterize log10ϕ.

Table 4 Log scores for cross-validation of variogram models for log10 k and log10 ϕ

4 Calibration of airflow models

Following Vesselinov et al. (2001a, b) we calibrate a finite volume pressure simulator (FEHM; Zyvoloski et al. 1999) against cross-hole pressure data using a parameter estimation code (PEST, Doherty 1994). Additional elements of the calibration process include geostatistical interpolation of log10 k and log10 ϕ via kriging (GSTAT; Pebesma and Wesseling 1998) and a posteriori averaging of pressure at grid nodes along packed-off pressure monitoring intervals. Details of the simulation grid, the airflow equation and its solution can be found in Vesselinov et al. (2001a); here we merely mention that the upper boundary condition was set to constant barometric pressure; monitoring intervals in which observed pressure showed a clear influence of atmospheric pressure fluctuations are not considered in the analysis.

We parameterize log10 k and log10 ϕ geostatistically and estimate their values by inverse calibration at selected pilot points (de Marsily et al. 1984). We then project these estimates (with the available 184 1-m scale log10 k measurements) by kriging onto a grid. In the case of y = log10 k the projection is done through \( y^{*} = \sum\nolimits_{i = 1}^{{N_{pp} }} {\lambda_{i} y_{i} } + \sum\nolimits_{j = 1}^{{N_{a} }} {\lambda_{j} y_{j} } \) where y* is the value at any point within the simulated block, y i are unknown values (parameters) at N pp pilot points, y j are known values at N a measurement points, and λ i and λ j are kriging weights. Following Vesselinov et al. (2001a, b) we set N pp  = 32; 29 pilot points are placed at the centers of pressure monitoring intervals (Fig. 1) and 3 are offset from the center of the injection interval to better represent airflow. Of the 184 1-m log10 k data 18 correspond to locations at pilot points and are included as priors in the manner discussed below, thus N a  = 166.

Inversion entails minimizing the negative log-likelihood criterion (Carrera and Neuman 1986)

$$ NLL({\mathbf{b}}) = {\frac{{\Upphi_{s} }}{{\sigma_{s}^{2} }}} + {\frac{{\Upphi_{p} }}{{\sigma_{p}^{2} }}} + \left( {N_{s} + N_{p} } \right)\ln (2\pi ) + N_{s} \ln \sigma_{s}^{2} + \ln \left| {{\mathbf{Q}}_{s}^{ - 1} } \right| + N_{p} \ln \sigma_{p}^{2} + \ln \left| {{\mathbf{Q}}_{p}^{ - 1} } \right| $$
(7)

where b is a vector of M parameters to be estimated, N s is the number of observed state variables, N p is the number of prior parameter values, \( \Upphi_{s} = {\mathbf{r}}_{s}^{T} {\mathbf{Q}}_{s} {\mathbf{r}}_{s} \) is a generalized sum of square residuals of the state variable, \( \Upphi_{p} = {\mathbf{r}}_{p}^{T} {\mathbf{Q}}_{p} {\mathbf{r}}_{p} \) is a generalized sum of square residuals of the parameters, Q s and Q p are corresponding weight matrices (considered known), and \( \sigma_{s}^{2} \) and \( \sigma_{p}^{2} \) are scalar multipliers (nominal variances, considered unknown) of the covariance matrices \( {\mathbf{C}}_{s} = \sigma_{s}^{2} {\mathbf{Q}}_{s}^{ - 1} \) and \( {\mathbf{C}}_{p} = \sigma_{p}^{2} {\mathbf{Q}}_{p}^{ - 1} \) of measurements errors associated with state variables and prior parameter values, respectively. Whereas it is possible to consider temporal correlations between pressure measurements in each monitoring interval, we presently treat them as being uncorrelated with zero mean and a uniform variance. We adopt a similar assumption with regard to log permeability measurements, disregarding spatial or cross-correlations between any of the data, thereby rendering Q s and Q p diagonal.

Since \( \sigma_{s}^{2} \) and \( \sigma_{p}^{2} \) are independent of log10 k and log10 ϕ values (parameters) at the pilot points, minimizing (7) with respect to these latter parameters is equivalent to minimizing \( \Upphi = \Upphi_{s} + \mu \Upphi_{p} \) while treating \( {{\mu}} = \sigma_{s}^{2} /\sigma_{p}^{2} \) as an unknown. We perform this minimization using the regularization capability of PEST. In regularization mode (Doherty 1994) PEST minimizes \( \Upphi_{p}^{\mu } = \mu \Upphi_{p} \) subject to \( \Upphi_{s} \le \Upphi_{s}^{l} \) (in practice \( \Upphi_{s} = \Upphi_{s}^{l} \)) where \( \Upphi_{s}^{l} \) is typically set by the user to a value slightly higher than the minimum value of \( \Upphi_{s} \) obtained without regularization (i.e., upon setting μ = 0). During each optimization step the program computes iteratively a value of μ (treating it as a reciprocal Lagrange multiplier) which insures that \( \Upphi_{s} = \Upphi_{s}^{l} \) and then minimizes \( \Upphi_{p}^{\mu } \). We repeat the process for various \( \Upphi_{s}^{l} \) till NLL attains its minimum, yielding ML estimates of μ and the pilot point values.

A first-order approximation of the covariance Σ of parameter estimates \( {\hat{\mathbf{b}}} \) is given by (Carrera and Neuman 1986)

$$ {\varvec{\Upsigma}}({\hat{\mathbf{b}}}) = \left[ {{\frac{1}{{\sigma_{s}^{2} }}}{\mathbf{J}}^{T} {\mathbf{Q}}_{s} {\mathbf{J}} + {\frac{{{\mathbf{Q}}_{p} }}{{\sigma_{p}^{2} }}}} \right]_{{\;{\mathbf{b}} = {\hat{\mathbf{b}}}}}^{\; - 1} $$
(8)

where J is a Jacobian matrix. If the estimate \( \hat{\mu } \) of μ is optimal (as we take it to be) then ML estimates of the nominal variances are given by \( \hat{\sigma }_{s}^{2} = \Upphi_{s} \left( {{\hat{\mathbf{b}}}} \right)/\left( {N_{s} - N_{p} } \right) \) and \( \hat{\sigma }_{p}^{2} = \hat{\sigma }_{s}^{2} /\hat{\mu } \). An alternative (not employed here) would be to specify \( \hat{\mu } \), compute \( {\hat{\mathbf{b}}} \) by minimizing \( \Upphi = \Upphi_{s} + \hat{\mu }\Upphi_{p} \), obtain ML estimates of the nominal variances according to \( \hat{\sigma }_{s}^{2} = \Upphi_{s} \left( {{\hat{\mathbf{b}}}} \right)/N_{s} \) and \( \hat{\sigma }_{p}^{2} = \Upphi_{p} \left( {{\hat{\mathbf{b}}}} \right)/N_{p} \), recompute \( \hat{\mu } = \hat{\sigma }_{s}^{2} /\hat{\sigma }_{p}^{2} \) and repeat the process till NLL attains its minimum (Carrera and Neuman 1986).

Elsewhere we have tested three approaches to the calibration of airflow models with and without prior information (Morales-Casique et al. 2008). Here we focus on the use of prior information during the calibration process. We calibrate log10 k and log10 ϕ at 32 pilot points against observed pressures, fixing variogram parameters from Tables 2 and 3, including 18 measurements of log10 k at pilot points as priors in Φ p and incorporating the remaining 166 of log10 k values in the kriging process. The kriged log10 k field is based on three alternative variogram models E1, E and P, while the kriging of log10 ϕ is based only on E. We calibrate the model jointly against pressure data from cross-hole tests PP4 and PP5. As noted earlier, we set Q s  = I and Q p  = I where I is the identity matrix. Computed pressures are compared with measured values during each test in Figs. 4 and 5. Overall, the calibrated models fit the observed data reasonably well in most intervals.

Fig. 4
figure 4

Pressure buildup (kPa) versus time (days) during cross-hole test PP4. Calibrated response based on each of the selected variogram models

Fig. 5
figure 5

Pressure buildup (kPa) versus time (days) during cross-hole test PP5. Calibrated response based on each of the selected variogram models

Table 5 shows the results of calibrating our models jointly against pressure data from cross-hole tests PP4 and PP5. In terms of NLL the best fit was obtained with log k variogram model P and the worst with model E1. Whereas AIC, AICc and BIC rank the models in this same order, KIC ranks E1 higher than E. Posterior probabilities based on AIC, AICc and BIC are similar and so we list only those corresponding to BIC and KIC. Using Occam’s window leads to a preference for P at the virtual exclusion of the remaining two models regardless of which criterion is used. Using a variance window (α = 0.049 which corresponds to a variance window of size 4σ D and a significance level of 5%) also leads to a similar preference for P by BIC but a less pronounced preference for this model by KIC. Below we use both sets of posterior probabilities obtained with KIC to test the abilities of individual models, and MLBMA, to predict pressures observed during cross-hole tests PP6 and PP7.

Table 5 Results of joint calibration of cross-hole tests PP4 and PP5

5 Prediction of pressures during cross-hole tests PP6 and PP7

Air injection during cross-hole tests PP6 and PP7 (Illman et al. 1998) took place into different intervals, and at different rates, other than those in tests PP4 and PP5 (Table 1). Inverse calibration against pressure data from the latter two tests yielded ML estimates \( {\hat{\mathbf{b}}} \) of the parameters and a covariance matrix of the corresponding estimation errors (8). To obtain corresponding statistics of the state variable, in this case air-pressure, one must either linearize the flow equation or solve it for numerous random realizations of the parameter vector b about its ML estimate \( {\hat{\mathbf{b}}} \). We have chosen the second option and conducted Monte Carlo simulations assuming the estimation error \( \left( {{\hat{\mathbf{b}}} - {\mathbf{b}}} \right) \) to be multivariate Gaussian with zero mean and covariance \( {\varvec{\Upsigma}}({\hat{\mathbf{b}}}) \) in the vicinity of \( {\hat{\mathbf{b}}} \). This allowed us to generate random realizations of b using standard methods such as Cholesky factorization of \( {\varvec{\Upsigma}}({\hat{\mathbf{b}}}) = {\mathbf{U}}^{T} {\mathbf{U}} \) followed by random draws of b =  where ζ is a vector of standard uncorrelated normal variables (Clifton and Neuman 1982). Following this procedure we have generated 150 realizations of the parameter vector and solved the forward problem for each of them. In some cases the nonlinear solver failed to converge; the corresponding partial results were discarded. Our results are thus based on 119, 67 and 97 MC runs with models E1, E and P, respectively, for test PP6 and 104, 62 and 92 runs for test PP7. In addition to predict pressure with individual models, we also generated MLBMA predictions by (3) and (4) based on posterior model probabilities in Table 5 obtained with a variance window.

Figures 6 and 7 compare predicted pressures averaged over all MC simulations against observed pressure for cross-hole tests PP6 and PP7. Each plot includes average predicted pressure from models E1, E and P plus the MLBMA estimate. For some data records average predicted pressure is close to the observed data; in other cases the prediction is poor, particularly at the injection interval (Z32 for PP6 and W32 for PP7) where models E and P over-predict pressure by orders of magnitude while model E1 under-predicts it. In addition, prediction is poor for all models at interval X1 in test PP7 (Fig. 7), were observed pressure shows a large pressure response to injection in interval W32; evidence of this connectivity was absent in the calibration tests PP4 and PP5, and thus was not captured in the estimated parameters. We attribute this poor prediction in part to the extreme heterogeneity of the fractured tuff at the site and our disregard of barometric pressure fluctuations during the tests. We also predicted pressure for both tests, PP6 and PP7 based on a single model run with the best parameter estimates \( {\hat{\mathbf{b}}} \). Predicted pressures from a single run constitute a biased estimate of the ensemble mean pressure and do not provide information about the variance of the estimate. The results are shown in Figs. 8 and 9 for tests PP6 and PP7, respectively. As before the prediction is poor at the injection intervals Z32 for PP6 and W32 for PP7, and at X1 for PP7, but now all models consistently under predict pressure at those intervals. Table 6 compares both estimates of pressure based on the sum of the squared errors SSE. Average predicted pressure based on MC simulations leads to one model clearly outperforming the other two by orders of magnitude. Results from a single run on the other hand show SSEs of the same order of magnitude. Excluding intervals with poor predictions (marked as B in Table 6) leads to model P being the most accurate in test PP6 and model E1 in test PP7. Results for MLBMA are mixed; excluding records from Z32 in PP6 and W32 and X1 in PP7, MLBMA is second in test PP6 while is third (MC simulations) and first (single run) in test PP7. Thus, in terms of SSE the average of model predictions does not perform better than the best individual model because the individual models in the collection do not produce very different forecasts (Winter and Nychka 2009).

Fig. 6
figure 6

Pressure buildup (kPa) versus time (days) during cross-hole test PP6. Predicted results averaged from MC simulations using each of the selected variogram models and MLBMA

Fig. 7
figure 7

Pressure buildup (kPa) versus time (days) during cross-hole tests PP7. Predicted results averaged from MC simulations using each of the selected variogram models and MLBMA

Fig. 8
figure 8

Pressure buildup (kPa) versus time (days) during cross-hole test PP6. Predicted results for each of the selected variogram models (single model run based on ML estimates \( {\hat{\mathbf{b}}} \)) and MLBMA

Fig. 9
figure 9

Pressure buildup (kPa) versus time (days) during cross-hole test PP7. Predicted results for each of the selected variogram models (single model run based on ML estimates \( {\hat{\mathbf{b}}} \)) and MLBMA

Table 6 Sum of squared errors (SSE)

We evaluate the predictive capabilities of each model and of MLBMA by computing their log scores and predictive coverage. The log score is computed by (6) with \( \hat{D}_{i}^{v} \) and \( \sigma_{i}^{2} \) are the i-th sample mean and variance of predicted pressure based on MC realizations of the parameter estimates \( {\hat{\mathbf{b}}}_{k} \) for model M k . The predictive log score of MLBMA is (Ye et al. 2004)

$$ - \ln p({\mathbf{D}}^{v} |{\mathbf{D}}^{c} ) = - \ln \left\{ {\sum\limits_{k = 1}^{K} {p({\mathbf{D}}^{v} |M_{k} ,{\hat{\varvec{\theta }}}_{k} ,{\mathbf{D}}^{c} )p(M_{k} |{\mathbf{D}}^{c} )} } \right\} $$
(9)

Table 7 lists the predictive log score of each model and MLBMA based on the variance window approach for both validation tests PP6 and PP7. Overall model E1 has the lowest log score of the models and MLBMA for both validation tests, despite being ranked second by KIC and third, with 0.1% posterior probability, by BIC (Table 5). The main source of predictive error for model E1 is the injection interval (Z32) in test PP6, while for test PP7 the main source are intervals X1 (large predictive errors) and Z1 (very small variance, \( \sigma_{Z1}^{2} \sim 10^{ - 9} \) and significant predictive error, thus the log score penalizes it). For the remaining models and MLBMA the ranking changes for each validation test; MLBMA ranks second and third in test PP6 and PP7, respectively. The largest log score for MLBMA and models P and E comes from the injection intervals (Z32 in test PP6 and W32 in test PP7) and X1 for test PP7 where these models and MLBMA have large prediction errors (Figs. 6, 7). Excluding low prediction intervals (Z32 in PP6 and W32, Z1 and X1 in PP7, results denoted by Total B in Table 7) model E1 ranks first in test PP6 and last in PP7; in turn, MLBMA ranks second and first in PP6 and PP7, respectively.

Table 7 Predictive log scores for validation tests PP6 and PP7

Another measure of the predictive capabilities of a model is its predictive coverage, the percentage of observed data that fall within a given prediction interval around average predicted pressure. Prediction intervals were computed by confidence intervals assuming the errors are normally distributed and for a confidence level of 95%. Table 8 lists the results obtained based on 776 observed data for test PP6 and 829 for test PP7. Among individual models, model P has the best predictive coverage for test PP6, while for test PP7 it is second to model E. While MLBMA has a superior predictive coverage than any of the three individual models for test PP7, it is second to last for test PP6. Excluding as before low prediction intervals (Z32 in PP6 and W32, Z1 and X1 in PP7) increases the predictive coverage of the models and MLBMA but does not change the rankings.

Table 8 Predictive coverage for validation tests PP6 and PP7

We conclude this section with some remarks on the predictive capabilities of individual models. We recall that after calibration model P was ranked best followed by E1 and E. During validation, E1 provided the closest prediction to the data for both validation tests when all data were considered (Tables 6, 7). However, when the low prediction intervals were excluded P performed best in one test and second best in the other (Tables 6, 7); P also ranked first and second among individual models in terms of predictive coverage during validation (Table 8). This suggests that KIC identified P as a ‘good’ model but there are factors that were not included in the model, such as high permeability regions that were not apparent in the calibration data set but were manifested in the validation data set. The ALRS is a highly heterogeneous medium and the dependency of hydrologic parameters on the flow pattern is more acute.

6 Conclusions

We have shown that it is possible to employ MLBMA with complex models with Occam and variance windows, illustrated how to include prior information in the process and applied the method to airflow models in unsaturated tuff. We calibrated log10 k and log10 ϕ at selected pilot points against observed pressures in two pneumatic injection tests (PP4 and PP5) including prior information about log10 k. All of the calibrated models reproduced the observed data set reasonably well. Use of Occam’s window led to selecting the model with the lowest fitting error with probability of close to 1 and disregarding all remaining models. A variance window of 4σ D gave more evenly distributed posterior probabilities based on KIC. Doing the same based on AIC, AICc or BIC led to one model being assigned a posterior probability of about 1.

The results of the calibration were validated against an independent data set obtained from two cross-hole tests (PP6 and PP7) where injection took place into different borehole intervals than those used for calibration. Best results were obtained with a model ranked second by KIC but very low by AIC, AICc and BIC. We also evaluated the predictive capabilities of MLBMA based on tests PP6 and PP7. Predicted pressures using MLBMA were less accurate than those obtained with some individual models because the individual model with the largest posterior probability was the worst or second worst predictor in both validation cases. In terms of predictive coverage, MLBMA was far superior to any of the individual models in one validation test and second to last in the other validation test.

We attribute these mixed results to inability of any of our models to capture in a satisfactory manner the complex nature of the ALRS fractured rock system and pressure distribution in it with the available data.