Impact of geostatistical reconstruction approaches on model calibration for flow in highly heterogeneous aquifers

Our study is aimed at assessing the extent at which relying on differing geostatistical approaches may affect characterization of the connectivity of geomaterials (or facies) and, in turn, model calibration outputs in highly heterogeneous aquifers. We set our study within a probabilistic framework, by relying on a numerical Monte Carlo (MC) approach. The reconstruction of the spatial distribution of geomaterials and flow simulations are patterned after a field scenario corresponding to the aquifer system serving the city of Bologna (Northern Italy). Two collections of MC realizations of facies distributions, conditional on available lithological data, are generated through two alternative geostatistically-based techniques, i.e., Sequential Indicator and Transition-Probability simulation. Hydraulic conductivity values of the least- and most-conductive facies are estimated within each MC simulation in the context of a Maximum Likelihood (ML) approach by considering available piezometric data. We provide evidence that the choice of the facies reconstruction technique (1) impacts the degree of connectivity of facies whose proportions are close to the percolation threshold while (2) is not sensibly affecting the connectivity associated with facies whose proportions are much larger than the percolation threshold. By relying on the unique (lithological and hydrological) data-set at our disposal, we also explore the performance of ML-based model identification criteria to (1) discriminate amongst competitive facies reconstruction geostatistical models and (2) quantify the (posterior probabilistic) weight associated with each model. We then show that ML-based model averaging provides estimates of hydraulic heads which are slightly more in agreement with available data when compared to the best-performing realization in the T-PROGS set than considering its counterpart associated with the SISIM-based collection.


Introduction
Probabilistic reconstruction of complex aquifer systems is nowadays considered a common practice to account for uncertainty resulting from our lack of knowledge of highlyheterogeneous subsurface structures and spatial distribution of attributes of aquifer systems. In a stochastic framework, aquifer heterogeneity can be conceptualized by considering system attributes, such as hydraulic conductivity, as random functions. Widely employed geostatistical methods (e.g., Sequential Gaussian Simulation; Deutsch and Journel 1998) describe hydraulic properties as multivariate Gaussian random fields and quantify the degree of aquifer heterogeneity upon relying on a covariance (or variogram) structure that can be inferred from available data. In the presence of sharp interfaces between high and low-conductivity regions, a composite media approach, considering the system as composed by diverse lithological units (or facies), is typically adopted (Guadagnini et al. 2003;Winter et al. 2006). Categorical rather than continuous random fields can be considered in order to represent the distributions of facies. For this purpose, Sequential Indicator Simulation (SISIM; Goovaerts 1997;Deutsch and Journel 1998), which relies on indicator variograms (inferred, e.g., from borehole lithological data), is one of the most extensively applied methods (Deutsch 2006;Felletti et al. 2006;Marini et al. 2019). While being characterized by ease of implementation, this method may not honor volumetric proportions of facies, as inferred from available data, because it tends to underestimate less-prevalent facies (Emery 2004;He et al. 2009), and/or may not fully capture patterns of facies connectivity Kerrou et al. 2008), an issue which is particularly noticeable in three-dimensional systems (Dell' Arciprete et al. 2012Arciprete et al. , 2014). An alternative approach has been embedded in the Transition Probability Geostatistical Simulation approach, T-PROGS (Carle andFogg 1996, 1997). The latter makes use of available lithological/ sedimentological data to evaluate transition probabilities between facies, which are then interpreted through Markov-chain modeling techniques. This approach, as well as other Markov-chain based methods (e.g., Elfeki and Dekking 2001;Li 2007;Langousis et al. 2018), also (1) enables one to include information on spatial juxtapositional tendencies and soft conditioning data and (2) is deemed as more accurate than its covariance-based counterparts to represent volumetric proportions, mean lengths, and connectivity patterns driven by facies distributions (Weissmann et al. 1999;He et al. 2014He et al. , 2015Koch et al. 2014).
Characterization of connectivity is key to quantify the degree of flow heterogeneity and the ensuing early (or late) breakthrough of dissolved chemicals at critical targets (Cvetkovic et al. 2014;Henri et al. 2015;Zinn and Harvey 2003), with direct implications in several contexts, including, e.g., remediation of contaminated sites and environmental risk assessment. Increasing efforts have been devoted to investigate the effects of various geostatistical approaches on connectivity metrics (Bakshevskaia and Pozdniakov 2016;Dell'Arciprete et al. 2012Lee et al. 2007; Mohammadi et al. 2020;Sharifzadehlari et al. 2018;Vassena et al. 2010). Other studies addressed the impact different geostatistical methods on the accuracy of estimated facies distribution (He et al. 2009, Kessler et al. 2013Marini et al. 2019;Park et al. 2007), hydraulic head and flux fields (Lee et al. 2007;Bianchi et al. 2015), or spreading of dissolved chemicals migrating in aquifer systems (Maghrebi et al. 2015;Siirila-Woodburn and Maxwell 2015).
Here, we investigate the extent at which the degree of connectivity resulting from a selected reconstruction method may alter the outcomes of a model calibration process. In this context, we rely on suites of numerical simulations of groundwater flow in the aquifer system serving the city Bologna (Northern Italy), whose internal architecture (in terms of spatial arrangement of geomaterials, or facies) is modeled within a probabilistic Monte Carlo (MC) framework. In this context, we rely on generating collections of equally-probable realizations of spatial distributions of facies conditional on a unique and extensive dataset of lithological information. We rest on the approaches associated with SISIM and T-PROGS and generate two collections (or ensembles) of MC realizations. We then evaluate (1) the variability of connectivity within each of the two sets of facies distributions (hereafter termed intra-ensemble variability), and (2) the extent at which the generation method affects facies connectivity (i.e., the inter-ensemble variability). A groundwater flow model is then calibrated for each MC realization by making use of available piezometric data. Ensuing estimates of hydraulic conductivity, K, are obtained through a Maximum Likelihood (ML) inverse modeling approach (Carrera and Neuman 1986). Our primary goal is to assess whether intra-or inter-ensemble variability of connectivity may have an effect on estimated values of K and their uncertainty. We rely on the ML framework to evaluate the model identification criterion proposed by Kashyap (1982) to (1) rank realizations within each ensemble according to their relative skill to interpret available data and (2) compute the likelihood associated with each realization which is then used for the implementation of a multi-model approach. The latter scheme relies on viewing each of the alternative MC realizations as a candidate hydrogeological model of the investigated site.
An additional goal of our study is the identification of effects that the adopted geostatistical reconstruction method might have on (1) the uncertainty associated with a desired modeling goal and (2) on the predictive performance of the outcomes of a Maximum Likelihood Bayesian Model Averaging (MLBMA; Ye et al. 2004;Neuman 2003), a framework of analysis that has been increasingly adopted to cope with the high degree of uncertainty related to field-scale flow and transport models (see, e.g., Lu et al. 2015;Samani et al. 2018). In essence, MLBMA enables the joint use of multiple candidate models by considering a weighted average of their predictions of the target modeling goal and making use of the ML-based posterior probability of each model. While such an approach might improve our predictive capability, with a more robust assessment of prediction uncertainty with respect to the output of a single model (Ye et al. 2004), it has also been observed (Winter and Nychka 2010) that the average of an ensemble of models can actually provide predictions of higher quality than those associated with best performing single model only if the individual models in the ensemble give rise to a wide range of outcomes.

Site description
The study area lies between the Apennines margin and the alluvial plain surrounding the city of Bologna, in the Emilia Romagna Region, Northern Italy (Fig. 1a). The area is part of the Po River Basin, which is a syntectonic sedimentary wedge forming the infill of the Pliocene-Pleistocene foredeep (Ricci Lucchi 1984).
As shown schematically in Fig. 1b, the basin is structured into three main groundwater bodies: Group A (with elevations ranging between 10 m a.s.l. and 150-200 m a.s.l.; which is, in turn, composed by four sub-units, namely A1, A2, A3 and A4); Group B (from 150-200 m to 300-450 m a.s.l.); and Group C (below 450 m a.s.l.). A recent classification (Regione Emilia-Romagna 2010), in compliance with Directive 2000/60/CE, considers the exposure to anthropogenic impacts and the paleo-geographical signatures of the diverse groups and identifies (1) an upper confined aquifer, formed by A1 and A2; and (2) a lower confined unit, which includes A3, A4, B and C. These aquifers are separated by discontinuous aquitards of variable thickness and composed by diverse lithological units (see also Guadagnini et al. 2004;Short et al. 2010;Molinari et al. 2012).
The domain investigated here extends across a surface of 20 9 23 km 2 in the horizontal plane and from -450 m to 100 m a.s.l. along the vertical direction. Lithostratigraphic information are available at about 1300 boreholes, whose planar location is displayed in Fig. 2a. Borehole depths range between 10 and 600 m (see Fig. 2b). These data allowed identifying four main lithofacies within the area. These correspond to clay, gravel, silt, and sand, whose volumetric proportion (p k , with k = c, g, si, sa, for clay, gravel, silt and sand, respectively) are p c = 52.3%, p g = 28.1%, p si = 13.3%, and p sa = 6.3%. The least-(clay) and the most-(gravel) permeable facies are associated with the two largest p k values, encompassing (in total) about 80% of the aquifer. The area is characterized by an intense anthropic activity, water supplies being drawn mainly from the underlying aquifer. The intense exploitation of groundwater resources has led to the formation of a massive cone of depression superimposed to the mean natural subsurface flow, which is aligned (on average) along a South-West (i.e. from the Apennines) -North-East (i.e., towards the alluvial plain of the Po River) direction. The location of the three major well fields assisting urban water supply (i.e., Borgo Panigale, San Vitale and Tiro a Segno) is illustrated in Fig. 2c together with all pumping wells exploited for industrial/agricultural needs. The total volume of water withdrawn is estimated as 4.3 9 10 7 m 3 /year and encompasses civil (78%), industrial (18%) and agricultural (4%) uses. Figure 2c also includes 20 monitoring wells, at which hydraulic head measurements are available and are employed for groundwater flow inverse modeling (see Sect. 2.4).

Geostatistical reconstruction
The geological structure of the aquifer has been reconstructed by relying on lithostratigraphic data available at 1300 boreholes (see Fig. 2a) and by making use of two geostatistically-based methods, corresponding to (1) a Sequential Indicator simulation (SISIM; e.g., Goovaerts 1997;Remy et al. 2009) and (2) a Transition Probability simulation (T-PROGS; e.g., Carle andFogg 1996, 1997) approach. These techniques have been extensively used to reconstruct spatial distributions of a correlated categorical variable, Z ¼ 1; . . .; n facies È É (n facies being the number of facies/categories in the system). Here we summarize for each method the key steps of the generation procedure.

SISIM
1. We define Z(x) = {1, 2, 3, 4} as the discrete variable indicating the facies that can be found at location x and the indicator function, I k x ð Þ (k = 1, …, 4), such that For each facies, we compute the sample directional variogram, c k;a ðs a Þ, where s a is the separation distance along direction a (with a = x, y, z).

The variogram c
k;a ðs a Þ is interpreted via a Maximum Likelihood (ML) approach with an exponential model (see Eq. 6). 4. Three-dimensional conditional Monte Carlo realizations of the spatial distribution of the four identified facies are obtained by making use of (1) the conditioning data available at 1300 boreholes included in the investigated domain (see Fig. 2a) and (2) the directional variogram model selected at step (3) via a sequential simulation approach (see, e.g., Remy et al. 2009).

T-PROGS
1. We define the transition probability matrix, TðsÞ, with entries t jk ðsÞ ¼ Pr Zðx þ sÞ ¼ k ZðxÞ ¼ j j f g , representing the probability for facies k to be found at (x ? s) conditioned by the presence of facies j at x. 2. Sample transiograms, representing the variation of t jk ðsÞ with s, are evaluated for all possible pairs of categories ðj; kÞ on the basis of all available lithological data. 3. Sample transiograms are interpreted by a Markovchain model as TðsÞ ¼ exp Rs ð Þ, R being the transition rate matrix. Elements r jk of R are inferred from sample transiograms as r jk ¼ ot jk os s!0 . 4. A sequential procedure is applied to assign a category in each unsampled point (see, e.g., Weissmann et al. 1999). 5. Facies distribution resulting after the sequential procedure is adjusted by a simulated quenching algorithm (Carle 1997) to minimize the discrepancy between the resulting experimental transiograms and the theoretical Markov-chain model inferred from the data.
For each geostatistical reconstruction technique, we generate a collection of n = 100 Monte Carlo (MC) realizations of facies distributions. We then investigate the impact of the type of geostatistical simulation approach on the degree of facies connectivity through the set of metrics defined in the following section.

Connectivity metrics
Let X be a three-dimensional domain which is discretized by n tot (grid) cells and X k be the subset of n k cells associated with the k-th facies. Two cells of X k , identified by the coordinates of their centroid (here denoted as x A and x B , respectively) are defined as connected if there is a sequence of neighboring cells (from x A to x B ) completely included in X k . A group of connected cells within X k is denoted as a cluster. Connectivity indicators that are commonly used (e.g., Lee et al. 2007;Hovadik and Larue 2007;Renard and Allard 2013) for the characterization of three-dimensional subsurface systems include: (1) the total number of clusters composing X k , here denoted as N C;k ; (2) n tot , where C max;k is the number of cells in the largest cluster of facies k; and (3) N I;k , corresponding to the number of isolated cells of facies k. Clearly, the connectivity of facies k increases as H k increases and as N C;k and N I;k decrease. Additional insights on facies connectivity can be offered by the analysis of the connectivity function (Renard and Allard 2013; Western et al. 2001), s k;a s a ð Þ. The latter represents the probability for two cells of category/facies k separated by a distance s a along direction a to be connected, i.e., where e a is the unit vector along direction a, ð Þis the number of such pairs which also belong to the same cluster (this condition being expressed by The behavior of the connectivity function, as well as of the other indicators mentioned above, can be interpreted in the framework of percolation theory. The latter was originally formulated for uncorrelated Bernoulli random fields defined on infinite domains (e.g., Stauffer and Aharony 1992 and references therein) and considers a system composed by two phases (e.g., a solid and a void phase) in which, at any node of the grid, the field can be either 1 (void phase) or 0 (solid phase) with probability p and (1p), respectively. A realization of such a system would hence be formed by X 1 , i.e., the set of grid cells where the field is equal to 1, and its complementary set, X 0 . The basic principle of percolation theory is that there is a critical value of p, termed percolation threshold (p t ), such that the probability P for X 1 to form a unique percolating cluster is equal to 1 if p ! p t and P = 0 if p\p t . As highlighted by Stauffer and Aharony (1992), the value of p t decreases as (1) the grid dimension and (2) the number of neighbors of a grid cell increase. For a three-dimensional cubic grid with six neighbors, p t ¼ 31%.
The application of percolation theory to geological settings is not straightforward, since facies distributions (1) are defined on a finite domain and (2) are spatially correlated. As a consequence, the probability of occurrence of a unique percolating cluster of facies k would not be a step function (i.e., P = 0 if p k \p t and P = 1 if p k ! p t ), and would rather increase gradually from 0 to 1 over a range of p k values. As illustrated by Harter (2005) and Hovadik and Larue (2007), this range widens as the analyzed set-up departs from the theoretical conditions, i.e., as the number of grid cells decreases and as the facies correlation length increases. Moreover, spatial correlation enhances connectivity, resulting in a decrease of the value of p k at which percolation may occur. As highlighted by Renard and Allard (2013), percolation theory can also assist characterizing the behavior of the connectivity function on finite grids. For example one can note that (1) when p k \p t then s k;j will rapidly decrease to 0 as the separation distance s increases; and (2) when p k ! p t , s k;j will tend asymptoti- All of these aspects are investigated in Sect. 3.

Groundwater flow model and model calibration
We evaluate three-dimensional steady state groundwater flow for each MC realization of facies distribution upon relying on the numerical code MODFLOW-2005(Harbaugh 2005. On the basis of the available lithostratigraphic information, the system is discretized into n tot = 40 9 46 9 110 cells of uniform size Dx ¼ Dy ¼ 500 m (along the horizontal) and Dz ¼ 5 m (along the vertical). Figure 2d-e illustrate the simulation domain together with the adopted boundary conditions, where noflow is imposed along the south-west boundary (i.e., corresponding to the Apennines margin) and along the bottom boundary, which coincides with the base of unit B (see Fig. 1b). Values of prescribed hydraulic head, h, are set along the remaining lateral boundaries (see Fig. 2d), as inferred from available piezometric data. Recharge, R, at the top of the domain is modeled as where P is rainfall, Q is surface runoff (land-use dependent, evaluated on the basis of the classical Curve Number method), E represents evapotranspiration (evaluated according to a modified Hargreaves method based on temperature data; Hargreaves and Allen 2003), and L quantifies water-pipe losses in the urban areas. The spatial distribution of R employed in the model is depicted in Fig. 2f. A preliminary sensitivity analysis (details not shown) highlights that spatial distribution of hydraulic head is not significantly affected by conductivity values associated with silt, K si , and sand, K sa (the two categories with the smallest volumetric fraction, see Sect. 2.1). Therefore, we set K si = 10 -6 m/s and K sa = 10 -5 m/s (corresponding to intermediate values of conductivity for these geomaterials) and calibrate hydraulic conductivities of clay, K c , and gravel, K g , for each MC realization of geomaterial distribution on the basis of hydraulic head data, h Ã , available at N h ¼ 20 monitoring wells (see Sect. 2.1 and Fig. 2c). We view the diverse MC realizations as a set of alternative/competing hydrogeological models of the investigated site. Model selection (or quality, information, discrimination) criteria have been proposed and compared to rank alternative models (e.g., Ye et al. 2008;Lu et al. 2011;Riva et al. 2011 and references therein). Among these, we focus on the KIC model identification criterion (Kashyap 1982). Upon dropping constant terms, the latter can be evaluated as where N P is the number of model parameters (here, N P-= 2) and J M i is the sum of squared residuals for realization where KIC M min is the minimum value of KIC M i across all n models, and p M i ð Þ is the prior probability of M i . Here, we set p M i ð Þ ¼ 1=n, as all realizations of set M are equally likely. Equation (4) allows evaluating the posterior estimate of hydraulic heads at observation wells as In a similar way, it is also possible to evaluate the posterior probability p S i ; T i jh Ã ð Þ of model i upon jointly considering both ensembles (i.e., T-PROGS and SISIM) while setting KIC M min in Eq. (4) as the minimum between KIC S min and KIC T min and replacing n with 2n.

Results and discussion
A preliminary analysis of the generated fields has been performed by evaluating the mean value, l M k , and standard deviation, r M k , (with k = c, g, si, sa) of facies volumetric proportions within each ensemble M (with M = S, T). The results of this analysis are listed in Table 1. Both ensembles render values of l M k close to those inferred from the conditioning data, i.e., p k . While T-PROGS realizations provide l T k values that almost coincide with p k , SISIM results slightly overestimate clay (percentage error * 3%) and underestimate gravel (percentage error * 5%) proportions. The SISIM ensemble is also characterized by a considerably larger variability (among the realizations, i.e., the intra-ensemble variability) of the facies volumetric proportions as compared to the T-PROGS counterpart, r T k being two orders of magnitude smaller than r S k . Figure 3a, b depict facies distributions across a vertical cross-section in a sample realization generated with SISIM and T-PROGS, respectively. A qualitative comparison of these results reveals that the SISIM simulation is characterized by more fragmented lithological units than what can be observed within its T-PROGS counterpart. We assess quantitatively this aspect by computing the connectivity metrics introduced in Sect. 2.3 for both ensembles. We do so by focusing only on clay and gravel because (1) these two categories constitute about 80% of the aquifer system and (2) the connectivity of facies characterized by the largest/smallest conductivity values has been shown to play a major role in driving field-scale flow and transport processes (Wen and Gomez-Hernandez 1998 and references therein). Figure 4a collects boxplots of N M C;k values and highlights that both facies are considerably more fragmented (i.e., formed by a larger number of clusters) across the SISIM than the T-PROGS realizations. This feature is particularly evident for gravel, values of N S C;g being almost one order of magnitude larger than those of N T C;g . Boxplots of H M k (rescaled by l M k ) are collected in Fig. 4b to provide a depiction of the relative size of the largest cluster with respect to X k . Values of   l S g ranges between 0.53 and 0.75). Moreover, Fig. 4c evidences that gravel can be found in a considerably larger number of isolated cells in the SISIM set, as compared to the T-PROGS ensemble. The SISIM set is characterized by a variability of all connectivity metrics that is slightly larger than that associated with the T-PROGS collection. This result is related to the different degree of variability of facies proportions across the realizations of the two ensembles, and resulting in r S k ) r T k (see Table 1). In summary, all indicators inferred from the analysis of clusters document quantitatively that facies distributions generated upon relying on SISIM are characterized by a lower degree of connectivity than their T-PROGS counterparts. Values of H M k l M k for both ensembles indicate that clay is characterized by a higher degree of connectivity then gravel. These results are consistent with elements from percolation theory, as: (1) a standard variography analysis (see ''Appendix'') suggests that clay and gravel exhibit horizontal and vertical spatial correlation lengths which are larger for the T-PROGS than for the SISIM set; (2) clay has a high probability to form a unique percolating cluster, its volumetric fraction (p c = 52.3%) being significantly larger than the (theoretical) percolation threshold (p t = 31%, see Sect. 2.3). The spatial correlation of gravel contributes to yield a volumetric proportion (p g = 28.1% \ p t ) which is large enough for a percolating cluster to occur. However, consistent with the observation that p g is close to (albeit smaller than) p t , one can note a high variability of the connectivity of gravel from one realizations to the other (i.e., intra-ensemble variability). This issue has been further investigated by evaluating the connectivity function defined in Eq. (1). Figure 5 depicts s M k;a versus the (dimensionless) lag evaluated, within each realization (dotted curves), for clay and gravel along direction a = x, y, z and for both generated sets (i.e.,. M = S, T). Each plot in the figure also includes (1) the ensemble-averaged connectivity function (solid curves) and (2) the asymptotic sills (dashed horizontal lines) computed as s 1 k ¼ C max;k . n k 2 ( ) , according to percolation theory. As expected, the connectivity function of clay is larger than the one of gravel regardless the generation approach. Considering clay, values attained by s S c;a and s T c;a along all directions are quite similar. The curves slightly deviate from the asymptotic value, s 1 c , for large separation distances, an effect which is arguably related to the finite extent of the domain. On the other hand, values of s T g;a (Fig. 5b-d-f) for gravel are considerably larger than their s S g;a counterparts ( Fig. 5a-c-e). This observation is consistent with the results depicted in Fig. 4 and provides additional evidence that gravel is less connected (more fragmented) in SISIM than in T-PROGS realizations. Figure 5 also suggests that gravel connectivity is characterized by a remarkably wider variability than what can be observed for clay for both generation methods, resulting also in a more pronounced deviation of the results from s 1 g . We quantitatively assess this feature by evaluating the integral connectivity scale (Western et al. 2001), defined as Þds a , for each realization of the two sets. The variability of C s k;a within each ensemble is quantified by its coefficient of variation, CV C s k;a . For both ensembles and both facies analyzed, we find that CV C s k;a is (almost) independent of direction (i.e., isotropic) and equal to 0.4% (SISIM) or 0.7% (T-PROGS) for clay and 10.5% (SISIM) or 4.5% (T-PROGS) for gravel.  Figure 6a suggests that the two ensembles are characterized by almost identical mean values of clay conductivity estimates (* 4:6 Â 10 À8 m/s) even as SISIM estimates are generally characterized by larger estimation errors (i.e., wider 95% CIs) than their T-PROGS counterparts. Considering gravel (Fig. 6b),K S g;i is (generally) larger thanK T g;i , mean values z (e-f) axes, evaluated for clay and gravel. Ensemble-averaged connectivity functions (solid curves) and (mean) asymptotic values predicted by percolation theory (red dashed lines) are also reported being equal to 1:3 Â 10 À3 m/s and 6:4 Â 10 À4 m/s, respectively. Results embedded in Fig. 6 are strictly related to the degree of facies connectivity observed for the two ensembles (as clarified by Figs. 4 and 5). Clay is characterized by similar connectivity as well as by similar values of hydraulic conductivity estimates in the two ensembles. Gravel exhibits a considerably smaller degree of connectivity in SISIM than in T-PROGS realizations. Consistently, conductivity estimates,K S g;i , are significantly larger thanK T g;i . Conductivity estimates exhibit a larger intraensemble variability in the SISIM than they do in the T-PROGS set. This aspect is particularly evident for clay, the coefficients of variation CVK S c À Á and CVK T c À Á being equal to 80% and 54%, respectively. The intra-ensemble variability of gravel estimates is smaller, values of CV for both ensembles being about 50%. We observe that the intraensemble variability of conductivity is not related to the intra-ensemble variability of connectivity: for both sets, we This latter result is further supported by inspection of Fig. 7, where estimatesK M k;i are plotted versus the integral connectivity scale evaluated along the mean flow direction, C s k;y . Results of similar quality (not shown) have been obtained by plottingK M k;i versus C s k;x and C s k;z . The effect of variability of connectivity between the two ensembles (inter-ensemble variability) can be inferred by comparing the pattern of the points associated with the SISIM and T-PROGS realizations in Fig. 7: when considering clay (Fig. 7a), the results of the analysis yield two clouds of points which are practically overlapped, whereas the results for gravel (Fig. 7b) related to SISIM are characterized by values of C s g;y andK M g D E which are respectively lower and higher than the corresponding T-PROGS results. When analyzing intra-ensemble effects, one can see that the results of Fig. 7 do not provide a clear indication of correlation between connectivity and hydraulic conductivity within each ensemble.  favor the best T-PROGS realization over its SISIM counterpart. Figure 8 illustrates the posterior probability, p M i jh Ã ð Þ, evaluated for both ensembles according to Eq. 4. It can be observed that the range of variability of p M i jh Ã ð Þ is narrower for the SISIM than for the T-PROGS set, suggesting that the performance of diverse realizations associated with the former approach is more uniform than in the latter case. The best model in the T-PROGS set is associated with a larger posterior probability (34%) than its SISIM counterpart (28%). Figure 9a 5)) obtained by analyzing jointly the two ensembles (Fig. 9c) are very similar to those obtained upon relying solely on the T-PROGS set. Figure 10 allows comparing (in terms of J) the predictive skill of h POST against results of single SISIM (Fig. 10a) and T-PROGS (Fig. 10b) realizations. Posterior model averaging provides slightly more/less accurate predictions than the best-performing single realization in the T-PROGS/SISIM set. This result is related to the observation that, as inferred from Fig. 8, the performance of diverse models in the SISIM ensemble is more uniform that what can be observed with reference to the T-PROGS set. As such, our results provide further documentation that when individual models provide results which are very similar, the performance of model-averaged results is worse than the one of the most skillful single model, whose effectiveness is essentially dampened in the averaging process (see also Winter and Nychka, 2010). The predictive capability in terms of flow-model outputs associated with the two aquifer reconstruction approaches is also tested through a validation procedure, as described in the following. Hydraulic head measurements at three  monitoring wells have been taken out (one at a time) from the calibration dataset to compare the local estimate, h MOD , resulting from the numerical model with the value (h Ã ) observed at these locations. The selection of the monitoring wells to be used for validation is aimed at encompassing different regions of the aquifer. In this sense, we consider the data associated with locations which are close to the major well fields (see Fig. 2c) and corresponding to the shallowest (H1, with z = -21.1 m) and the deepest (H2, with z = -191.7 m) locations and one, farther from the well fields, at an intermediate depth (H3, with z = -104 m). We repeat the process of model calibration considering the resulting data-sets on a subset of 30 realizations from each ensemble. Figure 11 depicts, in the form of box plots, the results obtained at these three selected locations across all realizations of the selected SISIM and T-PROGS subsets. It can be observed that T-PROGS clearly outperforms SISIM for the prediction of head at the H1 monitoring well and slightly outperforms SISIM in the

Conclusions
We explore the extent at which the geostatistical reconstruction method may affect the calibration of hydraulic parameters and the predictive capability of a groundwater flow model. For this purpose, we perform flow simulations on a domain patterned after the Bologna aquifer system by relying on two Monte Carlo ensembles of equally-likely realizations of facies distributions, respectively obtained on the basis of (1) indicator variograms (SISIM) and (2) transition probabilities (T-PROGS) inferred from lithological data. The latter reveal a high degree of heterogeneity in the aquifer, in which the least-(clay) and the most-(gravel) conductive facies represent almost 80% of the total aquifer. We take advantage of a Maximum Likelihood, ML, framework and of model discrimination criteria to (1) calibrate clay and gravel hydraulic conductivities in each realization of the two ensembles, on the basis of available piezometric data; (2) compute the posterior probability associated to each model and (3) evaluate the performance of ML-based model averaging approaches. Our results can be summarized as follows: 1. Clay, whose degree of connectivity is similar between the two ensembles, is also characterized by similar results in terms of calibrated conductivities. 2. Gravel exhibits an appreciably lower degree of connectivity in SISIM realizations compared to T-PROGS ones. As a consequence, hydraulic conductivity estimates of gravel obtained in the SISIM realizations are generally larger than their T-PROGS counterparts.
3. The degree of variability exhibited by connectivity indicators among realizations of the same ensemble seems not to be directly related with the variability of conductivity estimates. 4. The variability of the performance (in terms of sum of squared residuals) of alternative models in the SISIM set is smaller than in the T-PROGS set. This results in smaller variability of the posterior probability of models belonging to the SISIM ensemble respect to T-PROGS one. 5. The best individual model (as identified by model discrimination criteria) as well as the ML-based average model in the T-PROGS set are more skillful (in reproducing hydraulic head data) than their counterparts obtained for the SISIM set.
The findings of this study form the basis upon which one can obtained enhanced understanding of complex flow and transport dynamics. These elements are currently under study and will be the subject of future works.

Appendix
Correlation lengths of clay and gravel have been evaluated in SISIM and T-PROGS sets by means of a three-dimensional variographic analysis. Figure 12 collects ensemble directional indicator variograms, c k;a (with a = x, y, z), of clay and gravel computed over all MC realizations in each set along the x (c k;x , Fig. 12a), y (c k;y , Fig. 12b), and z (c k;z , Fig. 12c) axes. All indicator variograms can be interpreted by exponential models, defined as c k;a ðs a Þ ¼ r 2 where r 2 k is the sill of facies k, which is very close to l M k (1l M k ), being l M k the mean proportion of facies k in ensemble  Table 1); s a and r a respectively are lag and range along direction a. Results in Fig. 12 and in Table 3 highlight that clay and gravel exhibit horizontal and vertical correlation lengths which are larger for the T-PROGS than for the SISIM set. Fig. 12 Ensemble indicator variograms, c k,a , of clay and gravel versus (dimensionless) separation distance along a x, b y, and c z axes computed over all MC realizations in each set. Interpreted models evaluated according to a ML-fit of Eq. 6 are also depicted (dashed lines)