Maximum likelihood Bayesian averaging of airflow models in unsaturated fractured tuff using Occam and variance windows

Morales-Casique, Eric; Neuman, Shlomo P.; Vesselinov, Velimir V.

doi:10.1007/s00477-010-0383-2

Maximum likelihood Bayesian averaging of airflow models in unsaturated fractured tuff using Occam and variance windows

Original Paper
Open access
Published: 18 March 2010

Volume 24, pages 863–880, (2010)
Cite this article

Download PDF

You have full access to this open access article

Stochastic Environmental Research and Risk Assessment Aims and scope Submit manuscript

Maximum likelihood Bayesian averaging of airflow models in unsaturated fractured tuff using Occam and variance windows

Download PDF

Eric Morales-Casique¹,
Shlomo P. Neuman² &
Velimir V. Vesselinov³

1600 Accesses
3 Altmetric
Explore all metrics

Abstract

We use log permeability and porosity data obtained from single-hole pneumatic packer tests in six boreholes drilled into unsaturated fractured tuff near Superior, Arizona, to postulate, calibrate and compare five alternative variogram models (exponential, exponential with linear drift, power, truncated power based on exponential modes, and truncated power based on Gaussian modes) of these parameters based on four model selection criteria (AIC, AICc, BIC and KIC). Relying primarily on KIC and cross-validation we select the first three of these variogram models and use them to parameterize log air permeability and porosity across the site via kriging in terms of their values at selected pilot points and at some single-hole measurement locations. For each of the three variogram models we estimate log air permeabilities and porosities at the pilot points by calibrating a finite volume pressure simulator against two cross-hole pressure data sets from sixteen boreholes at the site. The traditional Occam’s window approach in conjunction with AIC, AICc, BIC and KIC assigns a posterior probability of nearly 1 to the power model. A recently proposed variance window approach does the same when applied in conjunction with AIC, AICc, BIC but spreads the posterior probability more evenly among the three models when used in conjunction with KIC. We compare the abilities of individual models and MLBMA, based on both Occam and variance windows, to predict space–time pressure variations observed during two cross-hole tests other than those employed for calibration. Individual models with the largest posterior probabilities turned out to be the worst or second worst predictors of pressure in both validation cases. Some individual models predicted pressures more accurately than did MLBMA. MLBMA was far superior to any of the individual models in one validation test and second to last in the other validation test in terms of predictive coverage and log scores.

Hydrologic interpretation of machine learning models for 10-daily streamflow simulation in climate sensitive upper Indus catchments

Article 10 April 2024

Computing Relative Permeability and Capillary Pressure of Heterogeneous Rocks Using Realistic Boundary Conditions

Article Open access 19 June 2024

Enhanced daily streamflow forecasting in Northeastern Algeria: integrating hybrid machine learning with advanced wavelet transformation techniques

Article 22 June 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Hydrologic analyses typically rely on a single conceptual-mathematical model. Yet hydrologic environments are open and complex, rendering them prone to multiple interpretations and mathematical descriptions. Adopting only one of these may lead to statistical bias and underestimation of uncertainty. Thus, hydrologists have developed several approaches to weigh and average predictions generated by alternative models (Neuman 2003; Neuman and Wierenga 2003; Ye et al. 2004; Poeter and Anderson 2005; Beven 2006; Refsgaard et al. 2006).

Bayesian model averaging (BMA) (Draper 1995; Kass and Raftery 1995; Hoeting et al. 1999) provides an optimal way to combine the predictions of several competing models and to assess their joint predictive uncertainty. Hoeting et al. (1999) describe BMA by noting that if Δ is a quantity one wants to predict given a discrete set of data D, then its posterior distribution is

$$ p\left( {\Updelta |{\mathbf{D}}} \right) = \sum\limits_{i = 1}^{K} {p\left( {\Updelta |M_{i} ,{\mathbf{D}}} \right)p(M_{i} |{\mathbf{D}})} $$

(1)

where K is the number of models considered, $ p\left( {\Updelta |{\mathbf{D}}} \right) $ is the average of the posterior distributions $ p\left( {\Updelta |M_{i} ,{\mathbf{D}}} \right) $ under each model, weighted by their posterior model probabilities $ p(M_{i} |{\mathbf{D}}) $. The posterior probability for model M _i is given by Bayes’ rule

$$ p(M_{i} |{\mathbf{D}}) = \left[ {p({\mathbf{D}}|M_{i} )p(M_{i} )} \right]/\left[ {\sum\limits_{j = 1}^{M} {p({\mathbf{D}}|M_{j} )p(M_{j} )} } \right] $$

(2)

where $ p({\mathbf{D}}|M_{i} ) $ is the integrated likelihood of model M _i. All probabilities are implicitly conditional on the set of models being considered. The posterior mean and variance of Δ are (Draper 1995)

$$ E\left[ {\Updelta |{\mathbf{D}}} \right] = \sum\limits_{i = 1}^{K} {E\left[ {\Updelta |M_{i} ,{\mathbf{D}}} \right]p(M_{i} |{\mathbf{D}})} $$

(3)

$$ {\text{Var}}\left[ {\Updelta |{\mathbf{D}}} \right] = \sum\limits_{i = 1}^{K} {{\text{Var}}\left[ {\Updelta |{\mathbf{D}},M_{i} } \right]p(M_{i} |{\mathbf{D}})} + \sum\limits_{i = 1}^{K} {\left( {E\left[ {\Updelta |{\mathbf{D}},M_{i} } \right] - E\left[ {\Updelta |{\mathbf{D}}} \right]} \right)^{2} p(M_{i} |{\mathbf{D}})} $$

(4)

Neuman (2003) proposed a maximum likelihood (ML) version of BMA (MLBMA) that renders it compatible with ML methods of model calibration (Carrera and Neuman 1986; Hernandez et al. 2003, 2006) even in cases where prior information about the parameters is not available (such information being a prerequisite for BMA). In the framework of MLBMA, $ E\left[ {\Updelta |M_{i} ,{\mathbf{D}}} \right] $ and $ {\text{Var}}\left[ {\Updelta |{\mathbf{D}},M_{i} } \right] $ are approximated by maximum likelihood estimates $ E\left[ {\Updelta |M_{i} ,{\hat{\mathbf{b}}}_{i} ,{\mathbf{D}}} \right] $ and $ {\text{Var}}\left[ {\Updelta |{\mathbf{D}},M_{i} ,{\hat{\mathbf{b}}}_{i} } \right] $, which are posterior mean and variance of Δ under the i-th alternative model and $ {\hat{\mathbf{b}}}_{i} $ is a maximum likelihood estimate of b _i (given that model M _i has parameters b _i) based on the likelihood $ p\left( {{\mathbf{D}}|{\hat{\mathbf{b}}}_{i} ,M_{i} } \right) $. In turn (2) is approximated by (Ye et al. 2004)

$$ p(M_{i} |{\mathbf{D}}) = \left[ {\exp \left( { - 0.5\Updelta IC_{i} } \right)p(M_{i} )} \right]/\left[ {\sum\limits_{j = 1}^{M} {\exp \left( { - 0.5\Updelta IC_{j} } \right)p(M_{j} )} } \right] $$

(5)

where $ \Updelta IC_{i} = IC_{i} - IC_{\min } , $ $ IC_{i} = KIC_{i} $ being the Kashyap (1982) information criterion for the i-th model and IC _min is the minimum value among the models. Alternatively, posterior model weights are sometimes assigned by setting IC _i equal to information theoretic criteria (Poeter and Anderson 2005; Ye et al. 2008) such as (see Appendix A for full expressions of these criteria) Akaike information criterion AIC (Akaike 1974), modified Akaike information criterion AICc (Hurvich and Tsai 1989) or the Bayesian information criterion BIC (Schwarz 1978). Ye et al. (2008) explain that KIC is the only one among these criteria which validly discriminates between models based not only on the quality of model fit to observed data and the number of model parameters but also on how close are the posterior parameter estimates to their prior values and the information contained in the observations.

Experience indicates (and our results below confirm) that Eq. 5 tends to assign posterior probabilities or model weights of nearly 1 to one model (the best in terms of minimum calibration error) and nearly zero to all other models. Tsai and Li (2008) argue that Occam’s window (defined by Raftery (1995) in terms of BIC as $ \Updelta BIC_{i} \le 6 $) on which (5) is based may be too narrow to accommodate models that are not the best but still potentially acceptable. For $ \Updelta IC_{i} > 6 $ in (3) $ p(M_{i} |{\mathbf{D}}) < 0.05 $ decreasing exponentially with $ \Updelta IC_{i} $; however, $ \Updelta IC_{i} > 6 $ is common in field problems. As a remedy, they propose to rely on a broader variance window obtained upon scaling $ \Updelta IC_{i} $ in Eq. 5 by a factor α selected subjectively by the analyst based on a desired level of significance, which determines the size of the variance window: $ \alpha = c/\sqrt n , $ where n is the number of observation data and c is a coefficient which depends on the window size and desired significance level.

We test the ability of MLBMA based on both Occam’s and variance windows to predict air-pressure from pneumatic injection tests conducted in a complex, highly heterogeneous, unsaturated fractured tuff near Superior, Arizona (Illman et al. 1998; Illman and Neuman 2001). Application of MLBMA to such a complex problem and comparing the performance of Occam and variance windows in this context are two key contributions of this paper. First we use log permeabilities and porosities obtained from single-hole pneumatic packer tests to postulate, calibrate and compare five alternative variogram models of these parameters based on AIC, AICc, BIC and KIC. The variogram models are exponential, exponential with linear drift, power, truncated power based on exponential modes, and truncated power based on Gaussian modes. We rely on KIC and cross-validation to select the first three of these variogram models for permeability and only the exponential model for porosity. We then adopt the favoured models to parameterize log air permeability and porosity across the site via kriging in terms of their values at selected pilot points and, optionally, at some single-hole measurement locations. For each of the selected variogram models we estimate log air permeabilities and porosities at the pilot points by calibrating a finite volume pressure simulator against two cross-hole pressure data sets; during each cross-hole test, air was injected at a different location in each test and pressure responses were recorded in all other boreholes (Illman et al. 1998). Finally, we compare the abilities of individual models and MLBMA, based on both Occam and variance windows, to predict space–time pressure variations observed during two cross-hole tests during which injection took place into locations different than those employed for calibration.

2 The Apache Leap Research Site

The previous University of Arizona Apache Leap Research Site (ALRS) near Superior, Arizona is a block of unsaturated fractured tuff measuring 64 × 55 × 46 m (Fig. 1). The test site includes sixteen boreholes, three vertical (V1, V2, V3) and thirteen inclined at 45° (X1, X2, X3, Y1, Y2, Y3, Z1, Z2, Z3, W1, W2, W2A, W3). Several pneumatic cross-hole tests were conducted at the ALRS (Illman et al. 1998; Illman and Neuman 2001) of which we selected four, labeled PP4, PP5, PP6 and PP7; a summary of the conditions for each test is presented in Table 1. For inverse calibration we selected the cross-hole tests labelled PP4 and PP5; we validated the calibrated models by predicting pressure variations during cross-hole tests, PP6 and PP7. During each test air was injected into a given interval and responses were monitored in 13 relatively short intervals (0.5–2 m) and 24 relatively long intervals (4–42.6 m) shown in Fig. 1. The hydrologic parameters controlling airflow are air permeability k and air-filled porosity ϕ, both attributed largely to air-filled fractures transecting water-saturated porous tuff.

Table 1 Cross-hole tests conditions at ALRS (Illman et al. 1998)

Full size table

3 Alternative geostatistical models of air permeability and air-filled porosity

3.1 Log₁₀ k

Ye et al. (2004) used MLBMA to investigate the geostatistical properties of log air permeability k (m²) at ALRS by postulating several alternative variogram models based on 184 data of log₁₀ k obtained via steady-state interpretation of stable pressure data from pneumatic injection tests in 1-m long intervals along six boreholes, V2, W2A, X2, Y2, Y3 and Z2 in Fig. 1 (Guzman et al. 1996). Ye et al. (2004) fitted seven variogram models (power P, exponential E, exponential with first-order drift E1, exponential with second order drift E2, spherical S, spherical with first-order drift S1, and spherical with second order drift S2) to this data set using the adjoint state maximum likelihood cross-validation (ASMLCV) method of Samper and Neuman (1989) in conjunction with universal kriging and generalized least squares. They found that the first three models (P, E and E1) consistently dominated in terms of their posterior model probability. We expanded their list of best models to include truncated power models based on Gaussian (Tpg) and exponential (Tpe) modes (Di Federico and Neuman 1997; a brief review of these models is found in Appendix B), fitted the variogram models using the same data set and the same procedure, computed the values of four model selection criteria (AIC, AICc, BIC and KIC) and computed the corresponding posterior model probability. Table 2 lists the results of this analysis, where posterior probabilities or (in the case of AIC and AICc) model weights are based on equal prior probabilities p(M _k) (the neutral choice) for all models. Model E1 is associated with the smallest negative log-likelihood (NLL) value (e.g. Carrera and Neuman 1986) and thus provides the best fit to the data. When using Occam’s window, model ranking varies depending on the information criterion. Whereas AIC and AICc strongly prefer E1 and P in this order over all other models, BIC strongly prefers P and ranks E1 the worst because it penalizes the model with more parameters in proportion to ln N _s (N _s being the number of observations; see Appendix A). On the other hand, KIC shows a slight preference for E1 over P while considering E to be a not much less promising option. Whereas in terms of NLL the truncated power models, Tpg and Tpe, fit the sample variogram as well as does P (Fig. 2), they are ranked lower by all four model selection criteria due to their larger number of parameters. KIC is the only such criterion showing a clear preference for Tpg over Tpe. Alternatively, a variance window of size 4σ _D and a significance level of 5%, leads to α = 0.078 and posterior probabilities that are distributed more evenly among all models, and the difference in magnitude between probabilities based on different information criteria is reduced.

Table 2 ASMLCV results for log₁₀ k

Full size table

3.2 Log₁₀ ϕ

We conducted a similar geostatistical analysis of 109 log air-filled porosity (log₁₀ ϕ) data obtained by type-curve interpretation of the recovery phase of single-hole tests conducted on a nominal scale of 1 m (Illman 2005). As there appears to be no discernible cross-correlation between the log₁₀ ϕ and log₁₀ k data we analyzed each set separately. Four alternative variogram models were postulated for log₁₀ ϕ: exponential E, spherical S, truncated power based on Gaussian Tpg and exponential Tpe modes. Figure 3 depicts the models fitted to the sample variogram and Table 3 lists the corresponding statistics. In terms of NLL the truncated power models Tpe and Tpg fit the data almost equally well and somewhat more closely than do E and S. Posterior probabilities based on Occam’s window and AIC, AICc and BIC rank the two truncated power models as best. However, KIC ranks E much higher than all other models. By using a variance window of size 4σ _D at a significance level of 5% (α = 0.1), posterior probabilities are distributed more evenly among the models but the ranking is not changed.

Table 3 ASMLCV results for log₁₀ ϕ

Full size table

3.3 Predictive capability of variogram models

KIC has been shown theoretically and empirically to have some advantages over AIC, AICc and BIC (Ye et al. 2008); KIC-based posterior probabilities from Tables 2 and 3 suggest retaining models P, E1 and E for log₁₀k and model E for log₁₀ϕ while eliminating the remaining models from further consideration. We test this by analyzing the predictive capability of each variogram model for log₁₀k and log₁₀ϕ through log scores of the cross-validation errors in the manner of Ye et al. (2004). The data set was split into two parts, eliminating data corresponding to one borehole at a time, obtaining ML estimates of the parameters and using these to predict the eliminated data. We repeated the procedure for each data set for log₁₀k and log₁₀ϕ. The log score $ - \ln p({\mathbf{D}}^{v} |M_{k} ,{\mathbf{D}}^{c} ) $ (Volinsky et al. 1997), approximated by $ - \ln p({\mathbf{D}}^{v} |M_{k} ,{\hat{\mathbf{b}}}_{k} ,{\mathbf{D}}^{c} ) $ (Ye et al. 2008), is a measure of the predictive capability of a model. The lower the predictive log score of model M_k based on data D^c (the calibration data set), the higher the amount of information in D^v (the validation data set) recovered by model M_k based on D^c. The log score of a model is given by

$$ - \ln p({\mathbf{D}}^{v} |M_{k} ,{\hat{\varvec{\theta }}}_{k} ,{\mathbf{D}}^{c} ) = \frac{Nv}{2}\ln \left( {2\pi } \right) + \frac{1}{2}\sum\limits_{i = 1}^{Nv} {\sigma_{i}^{2} } + \frac{1}{2}\sum\limits_{i = 1}^{Nv} {{\frac{{\left( {\hat{D}_{i}^{v} - D_{i}^{v} } \right)^{2} }}{{\sigma_{i}^{2} }}}} $$

(6)

where Nv is the number of data points in D^v, $ \hat{D}_{i}^{v} $ and $ \sigma_{i}^{2} $ are the i-th kriged variable and the kriging variance, respectively, based on the parameter estimates $ {\hat{\mathbf{b}}}_{k} $ for model M_k. The results for average predictive log scores are listed in Table 4 for log₁₀k and log₁₀ϕ. For log₁₀k, models E, E1 and P have average log scores ranging from 47.8 to 49.6, while the log scores of Tpg and Tpe are considerable larger, 53 and almost 70, respectively; for each individual borehole cross-validated, models E, E1 and P consistently have the lowest log scores (except model P for borehole Z2), outperforming Tpg and Tpe. For log₁₀ϕ, model E has the lowest log score for all cross-validation data set except borehole V2, outperforming the rest of the models; the average log score of model E is 36.2, while models S and Tpg have log scores of about 40 and Tpe has largest log score. Based on posterior probabilities from KIC and the log scores we retain only models E, E1 and P to parameterize log₁₀k while we retain only model E to parameterize log₁₀ϕ.

Table 4 Log scores for cross-validation of variogram models for log₁₀ k and log₁₀ ϕ

Full size table

4 Calibration of airflow models

Following Vesselinov et al. (2001a, b) we calibrate a finite volume pressure simulator (FEHM; Zyvoloski et al. 1999) against cross-hole pressure data using a parameter estimation code (PEST, Doherty 1994). Additional elements of the calibration process include geostatistical interpolation of log₁₀ k and log₁₀ ϕ via kriging (GSTAT; Pebesma and Wesseling 1998) and a posteriori averaging of pressure at grid nodes along packed-off pressure monitoring intervals. Details of the simulation grid, the airflow equation and its solution can be found in Vesselinov et al. (2001a); here we merely mention that the upper boundary condition was set to constant barometric pressure; monitoring intervals in which observed pressure showed a clear influence of atmospheric pressure fluctuations are not considered in the analysis.

We parameterize log₁₀ k and log₁₀ ϕ geostatistically and estimate their values by inverse calibration at selected pilot points (de Marsily et al. 1984). We then project these estimates (with the available 184 1-m scale log₁₀ k measurements) by kriging onto a grid. In the case of y = log₁₀ k the projection is done through $ y^{*} = \sum\nolimits_{i = 1}^{{N_{pp} }} {\lambda_{i} y_{i} } + \sum\nolimits_{j = 1}^{{N_{a} }} {\lambda_{j} y_{j} } $ where y* is the value at any point within the simulated block, y _i are unknown values (parameters) at N _pp pilot points, y _j are known values at N _a measurement points, and λ _i and λ _j are kriging weights. Following Vesselinov et al. (2001a, b) we set N _pp = 32; 29 pilot points are placed at the centers of pressure monitoring intervals (Fig. 1) and 3 are offset from the center of the injection interval to better represent airflow. Of the 184 1-m log₁₀ k data 18 correspond to locations at pilot points and are included as priors in the manner discussed below, thus N _a = 166.

Inversion entails minimizing the negative log-likelihood criterion (Carrera and Neuman 1986)

$$ NLL({\mathbf{b}}) = {\frac{{\Upphi_{s} }}{{\sigma_{s}^{2} }}} + {\frac{{\Upphi_{p} }}{{\sigma_{p}^{2} }}} + \left( {N_{s} + N_{p} } \right)\ln (2\pi ) + N_{s} \ln \sigma_{s}^{2} + \ln \left| {{\mathbf{Q}}_{s}^{ - 1} } \right| + N_{p} \ln \sigma_{p}^{2} + \ln \left| {{\mathbf{Q}}_{p}^{ - 1} } \right| $$

(7)

where b is a vector of M parameters to be estimated, N _s is the number of observed state variables, N _p is the number of prior parameter values, $ \Upphi_{s} = {\mathbf{r}}_{s}^{T} {\mathbf{Q}}_{s} {\mathbf{r}}_{s} $ is a generalized sum of square residuals of the state variable, $ \Upphi_{p} = {\mathbf{r}}_{p}^{T} {\mathbf{Q}}_{p} {\mathbf{r}}_{p} $ is a generalized sum of square residuals of the parameters, Q _s and Q _p are corresponding weight matrices (considered known), and $ \sigma_{s}^{2} $ and $ \sigma_{p}^{2} $ are scalar multipliers (nominal variances, considered unknown) of the covariance matrices $ {\mathbf{C}}_{s} = \sigma_{s}^{2} {\mathbf{Q}}_{s}^{ - 1} $ and $ {\mathbf{C}}_{p} = \sigma_{p}^{2} {\mathbf{Q}}_{p}^{ - 1} $ of measurements errors associated with state variables and prior parameter values, respectively. Whereas it is possible to consider temporal correlations between pressure measurements in each monitoring interval, we presently treat them as being uncorrelated with zero mean and a uniform variance. We adopt a similar assumption with regard to log permeability measurements, disregarding spatial or cross-correlations between any of the data, thereby rendering Q _s and Q _p diagonal.

Since $ \sigma_{s}^{2} $ and $ \sigma_{p}^{2} $ are independent of log₁₀ k and log₁₀ ϕ values (parameters) at the pilot points, minimizing (7) with respect to these latter parameters is equivalent to minimizing $ \Upphi = \Upphi_{s} + \mu \Upphi_{p} $ while treating $ {{\mu}} = \sigma_{s}^{2} /\sigma_{p}^{2} $ as an unknown. We perform this minimization using the regularization capability of PEST. In regularization mode (Doherty 1994) PEST minimizes $ \Upphi_{p}^{\mu } = \mu \Upphi_{p} $ subject to $ \Upphi_{s} \le \Upphi_{s}^{l} $ (in practice $ \Upphi_{s} = \Upphi_{s}^{l} $) where $ \Upphi_{s}^{l} $ is typically set by the user to a value slightly higher than the minimum value of $ \Upphi_{s} $ obtained without regularization (i.e., upon setting μ = 0). During each optimization step the program computes iteratively a value of μ (treating it as a reciprocal Lagrange multiplier) which insures that $ \Upphi_{s} = \Upphi_{s}^{l} $ and then minimizes $ \Upphi_{p}^{\mu } $. We repeat the process for various $ \Upphi_{s}^{l} $ till NLL attains its minimum, yielding ML estimates of μ and the pilot point values.

A first-order approximation of the covariance Σ of parameter estimates $ {\hat{\mathbf{b}}} $ is given by (Carrera and Neuman 1986)

$$ {\varvec{\Upsigma}}({\hat{\mathbf{b}}}) = \left[ {{\frac{1}{{\sigma_{s}^{2} }}}{\mathbf{J}}^{T} {\mathbf{Q}}_{s} {\mathbf{J}} + {\frac{{{\mathbf{Q}}_{p} }}{{\sigma_{p}^{2} }}}} \right]_{{\;{\mathbf{b}} = {\hat{\mathbf{b}}}}}^{\; - 1} $$

(8)

where J is a Jacobian matrix. If the estimate $ \hat{\mu } $ of μ is optimal (as we take it to be) then ML estimates of the nominal variances are given by $ \hat{\sigma }_{s}^{2} = \Upphi_{s} \left( {{\hat{\mathbf{b}}}} \right)/\left( {N_{s} - N_{p} } \right) $ and $ \hat{\sigma }_{p}^{2} = \hat{\sigma }_{s}^{2} /\hat{\mu } $. An alternative (not employed here) would be to specify $ \hat{\mu } $, compute $ {\hat{\mathbf{b}}} $ by minimizing $ \Upphi = \Upphi_{s} + \hat{\mu }\Upphi_{p} $, obtain ML estimates of the nominal variances according to $ \hat{\sigma }_{s}^{2} = \Upphi_{s} \left( {{\hat{\mathbf{b}}}} \right)/N_{s} $ and $ \hat{\sigma }_{p}^{2} = \Upphi_{p} \left( {{\hat{\mathbf{b}}}} \right)/N_{p} $, recompute $ \hat{\mu } = \hat{\sigma }_{s}^{2} /\hat{\sigma }_{p}^{2} $ and repeat the process till NLL attains its minimum (Carrera and Neuman 1986).

Elsewhere we have tested three approaches to the calibration of airflow models with and without prior information (Morales-Casique et al. 2008). Here we focus on the use of prior information during the calibration process. We calibrate log₁₀ k and log₁₀ ϕ at 32 pilot points against observed pressures, fixing variogram parameters from Tables 2 and 3, including 18 measurements of log₁₀ k at pilot points as priors in Φ _p and incorporating the remaining 166 of log₁₀ k values in the kriging process. The kriged log₁₀ k field is based on three alternative variogram models E1, E and P, while the kriging of log₁₀ ϕ is based only on E. We calibrate the model jointly against pressure data from cross-hole tests PP4 and PP5. As noted earlier, we set Q _s = I and Q _p = I where I is the identity matrix. Computed pressures are compared with measured values during each test in Figs. 4 and 5. Overall, the calibrated models fit the observed data reasonably well in most intervals.

Table 5 shows the results of calibrating our models jointly against pressure data from cross-hole tests PP4 and PP5. In terms of NLL the best fit was obtained with log k variogram model P and the worst with model E1. Whereas AIC, AICc and BIC rank the models in this same order, KIC ranks E1 higher than E. Posterior probabilities based on AIC, AICc and BIC are similar and so we list only those corresponding to BIC and KIC. Using Occam’s window leads to a preference for P at the virtual exclusion of the remaining two models regardless of which criterion is used. Using a variance window (α = 0.049 which corresponds to a variance window of size 4σ _D and a significance level of 5%) also leads to a similar preference for P by BIC but a less pronounced preference for this model by KIC. Below we use both sets of posterior probabilities obtained with KIC to test the abilities of individual models, and MLBMA, to predict pressures observed during cross-hole tests PP6 and PP7.

Table 5 Results of joint calibration of cross-hole tests PP4 and PP5

Full size table

5 Prediction of pressures during cross-hole tests PP6 and PP7

Air injection during cross-hole tests PP6 and PP7 (Illman et al. 1998) took place into different intervals, and at different rates, other than those in tests PP4 and PP5 (Table 1). Inverse calibration against pressure data from the latter two tests yielded ML estimates $ {\hat{\mathbf{b}}} $ of the parameters and a covariance matrix of the corresponding estimation errors (8). To obtain corresponding statistics of the state variable, in this case air-pressure, one must either linearize the flow equation or solve it for numerous random realizations of the parameter vector b about its ML estimate $ {\hat{\mathbf{b}}} $. We have chosen the second option and conducted Monte Carlo simulations assuming the estimation error $ \left( {{\hat{\mathbf{b}}} - {\mathbf{b}}} \right) $ to be multivariate Gaussian with zero mean and covariance $ {\varvec{\Upsigma}}({\hat{\mathbf{b}}}) $ in the vicinity of $ {\hat{\mathbf{b}}} $. This allowed us to generate random realizations of b using standard methods such as Cholesky factorization of $ {\varvec{\Upsigma}}({\hat{\mathbf{b}}}) = {\mathbf{U}}^{T} {\mathbf{U}} $ followed by random draws of b = Uζ where ζ is a vector of standard uncorrelated normal variables (Clifton and Neuman 1982). Following this procedure we have generated 150 realizations of the parameter vector and solved the forward problem for each of them. In some cases the nonlinear solver failed to converge; the corresponding partial results were discarded. Our results are thus based on 119, 67 and 97 MC runs with models E1, E and P, respectively, for test PP6 and 104, 62 and 92 runs for test PP7. In addition to predict pressure with individual models, we also generated MLBMA predictions by (3) and (4) based on posterior model probabilities in Table 5 obtained with a variance window.

Figures 6 and 7 compare predicted pressures averaged over all MC simulations against observed pressure for cross-hole tests PP6 and PP7. Each plot includes average predicted pressure from models E1, E and P plus the MLBMA estimate. For some data records average predicted pressure is close to the observed data; in other cases the prediction is poor, particularly at the injection interval (Z32 for PP6 and W32 for PP7) where models E and P over-predict pressure by orders of magnitude while model E1 under-predicts it. In addition, prediction is poor for all models at interval X1 in test PP7 (Fig. 7), were observed pressure shows a large pressure response to injection in interval W32; evidence of this connectivity was absent in the calibration tests PP4 and PP5, and thus was not captured in the estimated parameters. We attribute this poor prediction in part to the extreme heterogeneity of the fractured tuff at the site and our disregard of barometric pressure fluctuations during the tests. We also predicted pressure for both tests, PP6 and PP7 based on a single model run with the best parameter estimates $ {\hat{\mathbf{b}}} $. Predicted pressures from a single run constitute a biased estimate of the ensemble mean pressure and do not provide information about the variance of the estimate. The results are shown in Figs. 8 and 9 for tests PP6 and PP7, respectively. As before the prediction is poor at the injection intervals Z32 for PP6 and W32 for PP7, and at X1 for PP7, but now all models consistently under predict pressure at those intervals. Table 6 compares both estimates of pressure based on the sum of the squared errors SSE. Average predicted pressure based on MC simulations leads to one model clearly outperforming the other two by orders of magnitude. Results from a single run on the other hand show SSEs of the same order of magnitude. Excluding intervals with poor predictions (marked as B in Table 6) leads to model P being the most accurate in test PP6 and model E1 in test PP7. Results for MLBMA are mixed; excluding records from Z32 in PP6 and W32 and X1 in PP7, MLBMA is second in test PP6 while is third (MC simulations) and first (single run) in test PP7. Thus, in terms of SSE the average of model predictions does not perform better than the best individual model because the individual models in the collection do not produce very different forecasts (Winter and Nychka 2009).

Table 6 Sum of squared errors (SSE)

Full size table

We evaluate the predictive capabilities of each model and of MLBMA by computing their log scores and predictive coverage. The log score is computed by (6) with $ \hat{D}_{i}^{v} $ and $ \sigma_{i}^{2} $ are the i-th sample mean and variance of predicted pressure based on MC realizations of the parameter estimates $ {\hat{\mathbf{b}}}_{k} $ for model M _k. The predictive log score of MLBMA is (Ye et al. 2004)

$$ - \ln p({\mathbf{D}}^{v} |{\mathbf{D}}^{c} ) = - \ln \left\{ {\sum\limits_{k = 1}^{K} {p({\mathbf{D}}^{v} |M_{k} ,{\hat{\varvec{\theta }}}_{k} ,{\mathbf{D}}^{c} )p(M_{k} |{\mathbf{D}}^{c} )} } \right\} $$

(9)

Table 7 lists the predictive log score of each model and MLBMA based on the variance window approach for both validation tests PP6 and PP7. Overall model E1 has the lowest log score of the models and MLBMA for both validation tests, despite being ranked second by KIC and third, with 0.1% posterior probability, by BIC (Table 5). The main source of predictive error for model E1 is the injection interval (Z32) in test PP6, while for test PP7 the main source are intervals X1 (large predictive errors) and Z1 (very small variance, $ \sigma_{Z1}^{2} \sim 10^{ - 9} $ and significant predictive error, thus the log score penalizes it). For the remaining models and MLBMA the ranking changes for each validation test; MLBMA ranks second and third in test PP6 and PP7, respectively. The largest log score for MLBMA and models P and E comes from the injection intervals (Z32 in test PP6 and W32 in test PP7) and X1 for test PP7 where these models and MLBMA have large prediction errors (Figs. 6, 7). Excluding low prediction intervals (Z32 in PP6 and W32, Z1 and X1 in PP7, results denoted by Total B in Table 7) model E1 ranks first in test PP6 and last in PP7; in turn, MLBMA ranks second and first in PP6 and PP7, respectively.

Table 7 Predictive log scores for validation tests PP6 and PP7

Full size table

Another measure of the predictive capabilities of a model is its predictive coverage, the percentage of observed data that fall within a given prediction interval around average predicted pressure. Prediction intervals were computed by confidence intervals assuming the errors are normally distributed and for a confidence level of 95%. Table 8 lists the results obtained based on 776 observed data for test PP6 and 829 for test PP7. Among individual models, model P has the best predictive coverage for test PP6, while for test PP7 it is second to model E. While MLBMA has a superior predictive coverage than any of the three individual models for test PP7, it is second to last for test PP6. Excluding as before low prediction intervals (Z32 in PP6 and W32, Z1 and X1 in PP7) increases the predictive coverage of the models and MLBMA but does not change the rankings.

Table 8 Predictive coverage for validation tests PP6 and PP7

Full size table

We conclude this section with some remarks on the predictive capabilities of individual models. We recall that after calibration model P was ranked best followed by E1 and E. During validation, E1 provided the closest prediction to the data for both validation tests when all data were considered (Tables 6, 7). However, when the low prediction intervals were excluded P performed best in one test and second best in the other (Tables 6, 7); P also ranked first and second among individual models in terms of predictive coverage during validation (Table 8). This suggests that KIC identified P as a ‘good’ model but there are factors that were not included in the model, such as high permeability regions that were not apparent in the calibration data set but were manifested in the validation data set. The ALRS is a highly heterogeneous medium and the dependency of hydrologic parameters on the flow pattern is more acute.

6 Conclusions

We have shown that it is possible to employ MLBMA with complex models with Occam and variance windows, illustrated how to include prior information in the process and applied the method to airflow models in unsaturated tuff. We calibrated log₁₀ k and log₁₀ ϕ at selected pilot points against observed pressures in two pneumatic injection tests (PP4 and PP5) including prior information about log₁₀ k. All of the calibrated models reproduced the observed data set reasonably well. Use of Occam’s window led to selecting the model with the lowest fitting error with probability of close to 1 and disregarding all remaining models. A variance window of 4σ _D gave more evenly distributed posterior probabilities based on KIC. Doing the same based on AIC, AICc or BIC led to one model being assigned a posterior probability of about 1.

The results of the calibration were validated against an independent data set obtained from two cross-hole tests (PP6 and PP7) where injection took place into different borehole intervals than those used for calibration. Best results were obtained with a model ranked second by KIC but very low by AIC, AICc and BIC. We also evaluated the predictive capabilities of MLBMA based on tests PP6 and PP7. Predicted pressures using MLBMA were less accurate than those obtained with some individual models because the individual model with the largest posterior probability was the worst or second worst predictor in both validation cases. In terms of predictive coverage, MLBMA was far superior to any of the individual models in one validation test and second to last in the other validation test.

We attribute these mixed results to inability of any of our models to capture in a satisfactory manner the complex nature of the ALRS fractured rock system and pressure distribution in it with the available data.

References

Akaike H (1974) A new look at statistical model identification. IEEE Trans Autom Control AC-19:716–722
Article Google Scholar
Beven K (2006) A manifesto for the equifinality thesis. J Hydrol 320:18–36
Article Google Scholar
Carrera J, Neuman SP (1986) Estimation of aquifer parameters under transient and steady state conditions: 1. Maximum likelihood method incorporating prior information. Water Resour Res 22:199–210
Article Google Scholar
Clifton P, Neuman SP (1982) Effects of kriging and inverse modeling on conditional simulation of the Avra valley aquifer in southern Arizona. Water Resour Res 18(4):1215–1234
Article Google Scholar
de Marsily G, Lavedan C, Boucher M, Fasanino G (1984) Interpretation of interference tests in a well field using geostatistical techniques to fit the permeability distribution in a reservoir model. In: Verly G, David M, Journel AG, Marechal A (eds), Geostatistics for natural resources characterization, NATO ASI Ser., Ser. C, vol 182. D. Reidel Publishing Company, Dordrecht, pp 831–849
Di Federico V, Neuman SP (1997) Scaling of random fields by means of truncated power variograms and associated spectra. Water Resour Res 33(5):1075–1086
Article Google Scholar
Doherty J (1994) PEST Model-Independent Parameter Estimation, User Manual, 5th edn. Watermark Numerical Computing, Brisbane, Australia
Draper D (1995) Assessment and propagation of model uncertainty. J R Stat Soc Ser B 57(1):45–97
Google Scholar
Guzman AG, Geddis AM, Henrich MJ, Lohrstorfer CF, Neuman SP (1996) Summary of air permeability data from single-hole injection tests in unsaturated fractured tuffs at the apache leap research site: results of steady-state test interpretation, NUREG/CR-6360. U.S. Nuclear Regulatory Commission Report, Washington, DC
Google Scholar
Hernandez AF, Neuman SP, Guadagnini A, Carrera J (2003) Conditioning mean steady state flow on hydraulic head and conductivity through geostatistical inversion. Stochastic Environ Res Risk Assess 17(5):329–338. doi:10.1007/s00477-003-0154-4
Article Google Scholar
Hernandez AF, Neuman SP, Guadagnini A, Carrera J (2006) Inverse stochastic moment analysis of steady state flow in randomly heterogeneous media. Water Resour Res 42:W05425. doi:10.1029/2005WR004449
Article Google Scholar
Hoeting JA, Madigan D, Raftery AE, Volinsky CT (1999) Bayesian model averaging: a tutorial. Stat Sci 14(4):382–417
Article Google Scholar
Hurvich CM, Tsai CL (1989) Regression and time series model selection in small sample. Biometrika 76(2):99–104
Article Google Scholar
Illman WA (2005) Type curve analyses of pneumatic single-hole tests in unsaturated fractured tuff: direct evidence for a porosity scale effect. Water Resour Res 41:W04018. doi:10.1029/2004WR003703
Article Google Scholar
Illman WA, Neuman SP (2001) Type curve interpretation of a cross-hole pneumatic injection test in unsaturated fractured tuff. Water Resour Res 37(3):583–604
Article Google Scholar
Illman WA, Thompson DL, Vesselinov VV, Chen G, Neuman SP (1998) Single- and cross-hole pneumatic tests in unsaturated fractured tuffs at the apache leap research site: phenomenology, spatial variability, connectivity and scale. NUREG/CR-5559. U.S. Nuclear Regulatory Commission Report, Washington, DC
Google Scholar
Kashyap RL (1982) Optimal choice of AR and MA parts in autoregressive moving average models. IEEE Trans Pattern Anal Mach Intell 4(2):99–104
Article Google Scholar
Kass RE, Raftery AE (1995) Bayesian factor. J Am Stat Assoc 90:773–795
Article Google Scholar
Morales-Casique E, Neuman SP, Vesselinov VV (2008) Maximum likelihood Bayesian averaging of air flow models in unsaturated fractured tuff. In: Refsgaard JC et al (eds) Calibration and reliability in groundwater modelling: credibility of modelling, Proceedings of ModelCARE 2007 Conference held in Denmark, September 2007, IAHS Publ. vol 320, pp 70–75
Neuman SP (2003) Maximum likelihood Bayesian averaging of alternative conceptual-mathematical models. Stochastic Environ Res Risk Assess 17(5):291–305. doi:10.1007/s00477-003-0151-7
Article Google Scholar
Neuman SP, Wierenga PJ (2003) A comprehensive strategy of hydrogeologic modeling and uncertainty analysis for nuclear facilities and sites. NUREG/CR-6805. U.S. Nuclear Regulatory Commission Report, Washington, D. C
Google Scholar
Pebesma EJ, Wesseling CG (1998) Gstat: a program for geostatistical modelling, prediction and simulation. Comput Geosci 24(1):17–31
Article Google Scholar
Poeter EP, Anderson DA (2005) Multimodel ranking and inference in ground water modeling. Ground Water 43(4):597–605
Article CAS Google Scholar
Raftery AE (1995) Bayesian model selection in social research. Sociol Methodol 25:111–163. doi:10.2307/271063
Article Google Scholar
Refsgaard JC, van der Sluijs JP, Brown J, van der Keur P (2006) A framework for dealing with uncertainty due to model structure error. Adv Water Resour 29:1586–1597
Article Google Scholar
Samper FJ, Neuman SP (1989) Estimation of spatial covariance structures by adjoint state maximum likelihood cross validation: 1. Theory. Water Resour Res 25(3):351–362
Article Google Scholar
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Article Google Scholar
Tsai FTC, Li X (2008) Inverse groundwater modeling for hydraulic conductivity estimation using Bayesian model averaging and variance window. Water Resour Res 44:W09434. doi:10.1029/2007WR006576
Article Google Scholar
Vesselinov VV, Neuman SP, Illman WA (2001a) Three-dimensional numerical inversion of pneumatic cross-hole tests in unsaturated fractured tuff 1. Methodology and borehole effects. Water Resour Res 37(12):3001–3018
Article Google Scholar
Vesselinov VV, Neuman SP, Illman WA (2001b) Three-dimensional numerical inversion of pneumatic cross-hole tests in unsaturated fractured tuff 2. Equivalent parameters, high-resolution stochastic imaging and scale effects. Water Resour Res 37(12):3019–3042
Article Google Scholar
Volinsky CT, Madigan D, Raftery AE, Kronmal RA (1997) Bayesian model averaging in proportional hazard models: assessing the risk of a stroke. Appl Stat 46(4):433–448
Google Scholar
Winter CL, Nychka D (2009) Forecasting skill of model averages. Stochastic Environ Res Risk Assess. doi:10.1007/s00477-009-0350-y (published online: 10 October 2009)
Ye M, Neuman SP, Meyer PD (2004) Maximum likelihood Bayesian averaging of spatial variability models in unsaturated fractured tuff. Water Resour Res 40:W05113. doi:10.1029/2003WR002557
Article Google Scholar
Ye M, Meyer PD, Neuman SP (2008) On model selection criteria in multimodel analysis. Water Resour Res 44:W03428. doi:10.1029/2008WR006803
Article Google Scholar
Zyvoloski GA, Robinson BA, Dash ZV, Trease LL (1999) Models and methods summary for the FEHM application: a finite-element heat- and mass-transfer code. Los Alamos National Laboratory Report SC-194, Los Alamos, NM
Google Scholar

Download references

Acknowledgements

This work was supported jointly by U.S. National Science Foundation Grant EAR-0407123 and the U.S. Department of Energy through a contract with Vanderbilt University under the Consortium for Risk Evaluation with Stakeholder Participation (CRESP) III.

Open Access

This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Author information

Authors and Affiliations

Department of Earth Sciences, Faculty of Geosciences, Utrecht University, P. O. Box 80021, 3508 TA, Utrecht, The Netherlands
Eric Morales-Casique
Department of Hydrology and Water Resources, The University of Arizona, 1133 E James E. Rogers Way, Tucson, AZ, 85721, USA
Shlomo P. Neuman
Los Alamos National Laboratory, EES-6, MS T003, Los Alamos, NM, 87545, USA
Velimir V. Vesselinov

Authors

Eric Morales-Casique
View author publications
You can also search for this author in PubMed Google Scholar
Shlomo P. Neuman
View author publications
You can also search for this author in PubMed Google Scholar
Velimir V. Vesselinov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eric Morales-Casique.

Appendices

Appendix A

AIC, AICc, BIC and KIC are defined for model M_k as

$$ AIC = NLL + 2N_{k} $$

(10)

$$ AICc = NLL + 2N_{k} + {\frac{{2N_{k} (N_{k} + 1)}}{{N_{s} - N_{k} - 1}}} $$

(11)

$$ BIC = NLL + N_{k} \ln N_{s} $$

(12)

$$ KIC = NLL + N_{k} \ln \left( {N_{s} /2\pi } \right) + \ln \left| {\overline{{\mathbf{F}}}_{k} } \right| $$

(13)

where NLL is the negative log-likelihood function, N_k is number of parameters of model M_k, N_s is number of observations and $ \overline{{\mathbf{F}}}_{k} $ is the normalized (by N_s) observed Fisher information matrix. Equation 13 assumes equal prior probability for the models being compared.

Appendix B

A power variogram $ \gamma (s) = C_{0} s^{2H} $ can be constructed as, or decomposed into, an infinite hierarchy of exponential (for 0 < H < 0.5) or Gaussian (for 0 < H < 1) variograms representing mutually uncorrelated statistically homogeneous random fields or “modes” (Di Federico and Neuman 1997). Filtering out (truncating) low-frequency (large scale) modes, or high-frequency (small scale) modes, or both modes from the infinite hierarchy renders it statistically homogeneous with a positive spatial autocovariance function that decays monotonically with separation distance s. The integral scale λ _l of the low-frequency cutoff mode is related to the length scale L _l of the sampling window, while the integral scale λ _u of the high-frequency (small scale) cutoff is related to the length scale L _u of the data support. Assuming λ _u = 0, the truncated power variograms for exponential (Tpe) and Gaussian (Tpg) modes are respectively (Di Federico and Neuman 1997)

$$ \gamma (s,n_{l} ) = {\frac{C}{{2Hn_{l}^{2H} }}}\left[ {1 - e^{{ - n_{l} s}} + (n_{l} s)^{2H} \Upgamma (1 - 2H,n_{l} s)} \right];\quad \left( {0 < H < 0. 5} \right) $$

(14)

$$ \gamma (s,n_{l} ) = {\frac{C}{{2Hn_{l}^{2H} }}}\left[ {1 - e^{{ - \pi n_{l}^{2} s^{2} /4}} + \left( {\pi n_{l}^{2} s^{2} /4} \right)^{H} \Upgamma \left( {1 - H,\pi n_{l}^{2} s^{2} /4} \right)} \right];\quad \left( {0 < H < 1} \right) $$

(15)

where n _l = 1/λ _l and Γ(a, x) is the incomplete gamma function.

Rights and permissions

Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Reprints and permissions

About this article

Cite this article

Morales-Casique, E., Neuman, S.P. & Vesselinov, V.V. Maximum likelihood Bayesian averaging of airflow models in unsaturated fractured tuff using Occam and variance windows. Stoch Environ Res Risk Assess 24, 863–880 (2010). https://doi.org/10.1007/s00477-010-0383-2

Download citation

Published: 18 March 2010
Issue Date: August 2010
DOI: https://doi.org/10.1007/s00477-010-0383-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Maximum likelihood Bayesian averaging of airflow models in unsaturated fractured tuff using Occam and variance windows

Abstract

Similar content being viewed by others

Hydrologic interpretation of machine learning models for 10-daily streamflow simulation in climate sensitive upper Indus catchments

Computing Relative Permeability and Capillary Pressure of Heterogeneous Rocks Using Realistic Boundary Conditions

Enhanced daily streamflow forecasting in Northeastern Algeria: integrating hybrid machine learning with advanced wavelet transformation techniques

1 Introduction

2 The Apache Leap Research Site