The influence of layer and voxel geological modelling strategy on groundwater modelling results

Reliable groundwater model predictions are dependent on representative models of the geological environment, which can be modelled using several different techniques. In order to inform the choice of the geological modelling technique, the differences between a layer modelling approach and a voxel modelling approach were analyzed. The layer model consists of stratigraphically ordered surfaces, while the voxel model consists of a structured mesh of volumetric pixels. Groundwater models based on the two models were developed to investigate their impact on groundwater model predictions. The study was conducted in the relatively data-dense area Egebjerg, Denmark, where both a layer model and a voxel model have been developed based on the same data and geological conceptualization. The characteristics of the two methodologies for developing the geological models were shown to have a direct impact on the resulting models. The differences between the layer and the voxel models were, however, shown to be diverse and not related to larger conceptual elements, with few exceptions. The analysis showed that the geological modelling approaches had an influence on preferred parameter values and thereby groundwater model predictions of hydraulic head, groundwater budget terms and particle tracking results. A significance test taking into account the predictive distributions showed, that for many predictions, the differences between the models were significant. The results suggest that the geological modelling strategy has an influence on groundwater model predictions even if based on the same geological conceptualization.


Introduction
Realistic groundwater models require a good understanding of the hydrogeological conditions of the subsurface. The conceptual hydrogeological model summarizes the current knowledge about a specific groundwater system describing the dominating processes and the overall physical structure of the system (Gupta et al. 2012), including geological architecture of the subsurface. The conceptual model is often a two-dimensional (2D) sketch based on field data, and expert geological knowledge. In a numerical groundwater model, this conceptual understanding has to be translated into a hydrogeological model, describing the spatial distribution of hydrogeological parameters that controls the storage and movement of groundwater. Integration of geological knowledge is necessary because geological data are often sparse and the subsurface therefore not sufficiently sampled to generate a geologically consistent model only based on the available data (Jessell et al. 2014).
Hydrogeological models can be divided into two main groups. The first group is the 'cognitive' (knowledgedriven) models, where a geologist builds the model manually in a visualization software by integrating all available information. In this work, knowledge regarding both the geophysical methods and geological processes are utilized during interpretation. The modelling approach results in a single model that represents the geologist's best estimation of the geological setting (e.g., Royse 2010;Wycisk et al. 2009). The two most common approaches for manual, cognitive modelling are the layer modelling approach (e.g., Moya et al. 2014;Lekula et al. 2018;Herzog et al. 2003) and the voxel modelling approach (e.g., Jørgensen et al. 2015;Prinds et al. 2020;Stafleu et al. 2011). Layer models consist of layers between interpreted surfaces stacked in stratigraphic order, while voxel models consist of a structured mesh of volumetric pixels (voxels) that are assigned specific lithological units (Turner 2006).
The second group of hydrogeological models is the (data-driven) geostatistical methods, where a number of equally plausible realizations of the geological setting is calculated based on a given set of data, conditions and assumptions. Recently, the use of geostatistics for hydrogeological modelling of areas with geophysical data has been a focus of research using several different geostatistical approaches (e.g., Vilhelmsen et al. 2019; Barfod et al. 2018;Madsen et al. 2021;Gunnink and Siemon 2015). Geostatistical modelling has proven to be relevant in relation to numerical groundwater modelling and research (He et al. 2013;Refsgaard et al. 2012;Feyen and Caers 2006). Most of the geostatistical modelling approaches result in a set of realizations that are defined within voxel grids. This project focuses on the models produced in the knowledge-driven cognitive approach; however, many of the considerations regarding handling of the voxel grid in the groundwater modelling workflow is also relevant and applicable when geostatistical voxel models are used in groundwater modelling.
The voxel model generally allows a higher detail in terms of number of units and shape of geological structures than the layer model. The layer model is here restricted by an initial choice of number and order of continuous layers. The number of layers is constrained due to difficulties in handling a high number of surfaces when interpreting the geology. The voxel model is, on the other hand, restricted by an initial choice of grid cell size and therefore cannot describe the precise location of layer boundaries. This affects the ability to describe geological units with small thicknesses and units, which are pinching out. In this regard, the layer model allows a higher precision in the location of a layer boundary. As the voxel model allows for higher detail in geological variability, the workload associated with a voxel model is significantly higher than the workload associated with the generation of a layer model.
The voxel modelling approach has typically been chosen in studies where highly detailed models are needed (e.g., Høyer et al. 2019) and/or in areas with high geological complexity that are difficult to model satisfactorily with the layer modelling approach . It is assumed that because the heterogeneity of the voxel model is not limited by an initial choice of number of layers, it is more appropriate in areas of complex geology, whereas the layer modelling approach is only appropriate in areas of layered sedimentary strata.
The aim of this study is to investigate the impact of layer and voxel geological modelling techniques on groundwater model predictions. The aim is not to investigate how the resulting geological models translate into a numerical grid of the groundwater model. As a field site, the Egebjerg area, Denmark, is selected. This area is characterized by an abundance of geological and especially geophysical data in a typical glacial geological setting dominated by Quaternary deposits on top of Miocene, Paleogene and Cretaceous deposits. The geological structure is rather complex with several buried valleys, cross cutting each other, glacial tectonic disturbances and a deep tectonic fault zone in the catchment. The landscape is characterized partly by some of the steepest slopes in Denmark, partly by end moraines and dead-ice reliefs. Both a layer model (Andersen and Sandersen 2020) and a voxel model (Jørgensen et al. 2010) have been developed for the area. Groundwater models are developed based on the two different geological models. Results from the groundwater modelling of hydraulic head, flow budget, recharge area for a well field and particle tracking ages are compared for the two geological models. To the authors' knowledge, a comparison study of the impact of the geological modelling approach on groundwater flow predictions has not been presented before.
The remainder of this paper is organized as follows. First, the area in which this study has been performed is presented, then the methodology by which the geological models have been developed and how they are compared is described. The results of the direct comparison of the geological models and the groundwater models of the different geological models are presented. Finally, the results are discussed, as well as the limitations and implications of the study.

Study area and data
The 127-km 2 Egebjerg area is situated in the western part of Denmark (Fig. 1d). The catchment is delineated by the topography, ranging from 0 m above sea level (asl) in the south to 170 m asl in the north (Fig. 1a). The catchment drains towards Lake Nørrestrand and Horsens Fjord in the south. Approximately 2.7 Mm 3 /year of groundwater is abstracted for drinking water supply from Quaternary meltwater sand. Højballegård well field, situated in the middle of the catchment (Fig. 1a) is responsible for about 90% of groundwater abstraction.

Conceptual model
A conceptual sketch of the geology in Egebjerg is shown in Fig. 2. The southern part of the study area is influenced by a deep graben structure (Fig. 1b), which is particularly visible on the surface of the Danian limestone (Top Chalk Group; Fig. 2). According to Lykke-Andersen (1979), the graben structure is related to either salt movement or tectonic movement along faults in the Ringkøbing-Fyn High during the Cenozoic. Low-permeability Paleogene clays and marls that show great thickness, except where tunnel valleys have been eroded into the deposits, overlay the limestone. Miocene deposits are present as erosional remnants in parts of the study area and consist of Miocene sand, silt and clay (Jørgensen et al. 2010

Data used for hydrogeological modelling
Geologic information in the area includes borehole information and geophysics (Fig. 1c). The borehole information originates from a data extraction from November 2009 in the Danish borehole database, Jupiter (Møller et al. 2009). The database contains information from a total of 794 boreholes within the study area. The information regards both the drillings (e.g., drilling method, drilling depth, purpose and age) and the samples (e.g., lithology, stratigraphy and geochemistry).  (Møller et al. 2009). Both AEM and PACES data deliver resistivity information to be used for lithological interpretation. The AEM information is available as smooth and few-layer inversions, whereas the PACES data were inverted as three-layer models.

Methodology
This section first presents the geological modelling methodology. Next, the model setup and the simulation method for the groundwater model are presented.

Hydrogeological models
The layer model and the voxel model are based on the same conceptual model (section 'Conceptual model') and data (section 'Data used for hydrogeological modelling'). They have been constructed using the geological modelling software GeoScene3D (I-GIS 2021), where data are visualized and interpreted. Borehole data are color-coded according to lithology. The seismic line has been depth converted and is shown as a bitmap and the AEM and PACES resistivity information is shown both as onedimensional (1D) soundings and three-dimensional (3D) resistivity grids of the inversion results. Since borehole information at depth is sparse, the interpretation of the deeper parts of the models are mainly based on geophysical data. The PACES data provide the best resolution in the topmost 30 m, while the AEM data are used for interpretation of the layers below.

Layer modelling
The layer model has been constructed following the national guideline (Sandersen et al. 2018). The layer model is constructed as a hydrostratigraphic model, focusing on layers relevant for groundwater modelling, such that hydraulically connected geological units with comparable hydraulic properties are interpreted as the same units. The model contains 14 surfaces from the surface of the Top Chalk Group to the land surface with alternating layers of aquitards and aquifers, respectively. The surfaces Top Chalk Group and Top Paleogene are based on the corresponding units in the voxel model (Jørgensen et al. 2010) and have only been subject to minor modifications.
The data are interpreted along an approximately 1 × 1 km grid of stationary cross sections that are carefully placed in order to show data optimally and such that they cross important geological features. Moreover, moveable cross-sections are used to interpret borehole data between the stationary cross sections. Interpretation points placed at layer boundaries with assigned uncertainties are placed along the profiles. Where possible, the interpretation points are snapped to layer boundaries in the boreholes or resistivity models. The distance between interpretation points is a function of data density. In areas with no data, there is a typical maximum distance of 1 km between the points. The interpretation points are interpolated using kriging with a grid discretization of 100 × 100 m and later adjusted in GeoScene3D such that no layer overlaps occur. Layers are defined as the volume between the upper and the lower bounding interpolated surfaces. Discontinuous units are assigned a thickness of 0.1 m in areas where they are assessed not to be present.

Voxel modelling
The voxel model (Jørgensen et al. 2010) was a manual cognitive voxel model developed in the modelling software GeoScene3D ). The voxel model was constructed as a 3D stratigraphic voxel model, where the geological history of the model area was in focus. Thus, in the stratigraphic model, geological elements, structures, and stratigraphic boundaries were modelled and subdivided according to their origin. The stratigraphic model covers 72 units, where, for instance, meltwater sands in the different buried valleys have their own individual categories. An effort was made to model the infill of the buried valleys individually based on the geological understanding of the tunnel valley generations in the area as mapped by Jørgensen and Sandersen (2009).
Layer boundaries found during layer modelling were interpreted and used to delimit the main units in the model, while voxel modelling tools  were applied to fill the geology between the boundaries. The model covers the interval from 250 m below sea level to the land surface and is discretized in voxels with dimensions of 100 m laterally and 5 m vertically. Attributes such as lithology, stratigraphy and uncertainty were assigned to each voxel in the voxel grid. The voxel modelling tools for selecting the voxels include, e.g., digitizing of polygons on cross sections and in horizontal view and tools to select voxels based on the values in the corresponding resistivity grids .

Translation of geological model structures
When constructing the groundwater model, the grid and the subdivision of geological units are modified. An illustration of the translation of the models is seen in Fig. 3 showing a random profile through the models. The top row represents the original voxel and layer model, whereas the second row represents the modified versions of the two models.
The number of geological units is reduced to six from 14 and 72 geological units in the layer model and voxel model, respectively-Table S1 of the electronic supplementary material (ESM). The justification for reducing the number of units to six is that this is the largest number of common units that will allow for a direct comparison of the lithological units; hence, the parameterization of the modified models is the same. In the layer model, five Quaternary sand units, four Quaternary clay units and two Miocene clay units are respectively grouped together, while the remaining units are described by separate categories. Using a comparable approach, the voxel units are grouped according to these six categories. The Quaternary sand unit is a combination of 61 voxel units and the Quaternary clay is a combination of six voxel units, while the remaining voxel units are described by dedicated categories.
In models LL (layer model-layer grid) and VV (voxel model-voxel grid), the grids of the original models have been preserved in the modified models. To ensure that the differences between the model responses are not an artifact of the model discretization, model LV (layer model-voxel grid) has been developed. Here, the layer model is translated The colors represent different geological units. From the original models to the modified models, grey units and green units are lumped together. Three groundwater models are developed based on the modified geological structures into the grid of the voxel model; hence, the numerical grid in layer model LV and voxel model VV is the same. In this translation, layers with a thickness of less than a voxel grid cell will not be represented (compare model LL to model LV in Fig. 3). Finally, the extent of all models is adjusted to the area of the groundwater model, both vertically and horizontally (section 'Model setup').

Model setup
To evaluate the impact of the chosen geological model structure on groundwater model predictions, a groundwater model is developed for each model structure visualized in the second row of Fig Parameterization The geological models are imported into MODFLOW using the Upstream Weighting package. Each geological unit is parameterized by a horizontal hydraulic conductivity, a vertical anisotropy and a porosity. The vertical anisotropy is set to 3, while the porosity is set to 0.3 for all hydrogeological units.

Model discretization
The horizontal discretization is specified to 100 m × 100 m based on the grid of the geological models. Depending on the applied model structure (Fig. 3) the number of grid layers in the groundwater model is either 83 (LV and VV) or 17 (LL) from 165 to -250 m asl. In the voxel grid (LV and VV), a grid layer may be assigned many different geological units, while in the layer grid (LL), a grid layer is only assigned a single geological unit. The vertical extent is based on the extent of the voxel model. In the voxel grid (LV and VV), the vertical discretization is 5 m. The thickness of the topmost active numerical layer is set to a minimum of 5 m and the top is defined at the land surface. In the layer grid (LL), the vertical discretization is based on the vertical extent of the hydrostratigraphic units of the layer model. Each geological layer represents a single numerical layer in the model except for Paleogene clay, which is subdivided into three numerical layers, and limestone, which is subdivided into two layers, resulting in 17 layers in total. These two geological units are subdivided as they have a relatively large thickness compared to the other units, and initial runs showed that the convergence rate increased by subdividing these units.

Boundary conditions
The extent of the model is based on the configurations of hydraulic head and the distribution of aquifers in the geological voxel model. The eastern and western boundaries are placed perpendicular to isopotential lines and are therefore assumed to represent no-flow boundaries and the same assumption is made for the groundwater divide in the northern area. In the southern part of the model area, at the coast and in Lake Nørrestrand, a head dependent flux boundary is specified through the general head boundary (GHB). The boundary is specified in the top-most active layer with a head specified at 0 m asl. The total model area is 127 km 2 .
The Recharge package (RCH) is used to simulate groundwater recharge to the topmost active cell. The specified recharge to the saturated zone is retrieved from the Danish National Water Resources Model (DK-model) (Henriksen et al. 2003) for the period 2000-2004. The grid size is downscaled from the original 500 m × 500 m to 100 m × 100 m through bilinear resampling to match the discretization of the model.
The Well package (WEL) is used to simulate groundwater abstraction from 50 wells. For the period 2000-2004 the average annual pumping rate was 7,322 m 3 /day (2.7 Mm 3 / year), where 90% is pumped from the Højballegård well field (Fig. 1a).
The Drain package (DRN) is applied to simulate inflow to both streams (DRN-riv, Fig. 4) and subsurface tile drains and smaller ditches (DRN-drn, Fig. 4). The drain cells representing subsurface drains and smaller ditches are specified in all of the topmost active cells. Agricultural land is the dominant land use of the catchment and most fields have subsurface drains installed. The elevation at which the drain becomes active is specified to 1 m below elevation.

Observations
The observation data consist of hydraulic head and stream discharge. Hydraulic head observations are obtained from the Jupiter database (Møller et al. 2009). Observations within 500 m of the edge of the catchment outline or a well field have been excluded, as the model may not give realistic simulations at these locations. Additionally, a few observations of poor quality have been excluded, resulting in 109 observation wells (Fig. 4). Most of the observation wells are screened in the Quaternary units.
A single station located in the southwestern corner of the catchment (Fig. 4) measures stream discharge. The daily average discharge for the period 2000-2004 is 0.68 m 3 /s (21.5 Mm 3 /year).

Simulations
Current practices for groundwater modelling are often based on a single, best-fitting model with a single set of parameters (Barnett et al. 2012;Henriksen et al. 2017). The best-fitting model may be found by auto-calibration in a least-squares sense, e.g., by using the parameter estimation software PEST (Doherty 2015). The optimized parameter values may be accompanied by an estimate of the influence of the parameter uncertainty by assuming that the model behaves linearly for parameter values around the calibrated values and that the uncertainty can be approximated by either normal or lognormal distributions. However, the calibrated set of parameter values is commonly only one of many sets of parameter values that obtain similar performances but could give rise to different predictions.
For the insights gained in this study to be beneficial to current practices, a number of best-fitting models are compared. To ensure a comprehensive comparison of the model structures, an ensemble of best-fitting realizations for each model will be considered rather than a single, arbitrary, bestfitting model for each model structure.
To obtain an ensemble of best-fitting models, a generalized likelihood uncertainty estimation (GLUE) approach is applied (Beven and Binley 1992). GLUE is a Monte Carlo simulation technique that seeks to identify a number of behavioral models. In GLUE, it is assumed that many parameter sets will provide equally good representations of the observed response. Parameter sets are sampled from a prior range of possible values and the model responses are compared to observations. Based on a subjective threshold, behavioral models are separated from nonbehavioral models.
Parameters The values defining the prior parameter distributions are presented in Table 1. Uniform distributions are assumed, described by maximum and minimum values both defined from experience and literature values. These distributions represent the prior understanding of the plausible values of the parameters. For parameters ranging over several orders of magnitude, sampling was performed from log-uniform distributions.

Sampling
The parameter values for the realizations are sampled from the prior parameter distributions using the Latin hypercube approach implemented using LHS class from the open source Python framework pyDOE (Baudin 2013). The Latin hypercube designs obtained from pyDOE are transformed to uniform and log-uniform distributions using the values in Table 1 by applying the classes uniform and log-uniform, respectively, from the open source Python   (Jones et al. 2001). All parameters are sampled and their distributions transformed before the start of the groundwater model simulations.
Parameter values are sampled from the prior parameter space until the predictive distribution of the retained realizations has stabilized for the predictions of interest. An analysis of the stabilization of the predictions of interest (Fig. S1 of the ESM) shows that predictions stabilized in less than 500 retained realizations for all models. Therefore, from the initial parameter samples, the first 500 behavioral realizations are retained.

Thresholds
The threshold values between behavioral and nonbehavioral models are set at a predefined value based on the Danish groundwater modelling guidelines (Henriksen et al. 2017). A threshold on the mean error of hydraulic head, root mean square (RMS) error of hydraulic head, and river observation error-criteria 1, 4 and 6, respectively in Henriksen et al. (2017)-is applied. Table 2 presents the suggested values in the groundwater modelling guideline based on two ambition levels: detailed modelling and screening modelling. Detailed modelling is used in situations where new initiatives of great social significance are to be implemented as well as in planning situations. Screening modelling is less ambitious, used in situations where only rough assessments are needed. Based on initial model runs, it was found that the RMS criteria for detailed modelling would be difficult to achieve. The main reason for this is probably a relatively high degree of heterogeneity, both on the geological model structure with buried valleys, fault zones, and end moraines, and on the hydraulic properties within layers or units that could be expected to show a high degree of variability. Therefore, only the detailed modelling threshold on mean error and river observation error is applied. For the root mean square error, the model screening threshold value is used.

Particle tracking
Based on the retained, behavioral realizations for each model, particle tracking is performed using MODPATH 6 (Pollock 2012). In each topmost active cell, 20 particles are tracked forward to the discharge points. The number of particles is based on initial runs showing that the recharge area to the Højballegård well field stabilizes around 20 particles per cell (Fig. S2 of the ESM). Only particles that obtain a travel time of less than 200 years are considered. A threshold of 200 years is chosen since this value is typically used when determining groundwater recharge areas.
The differences in the particle tracking predictions from model LL and those of models LV and VV may be attributed to the numerical grid (section 'Translation of geological model structures'). The layer grid in LL consists of vertically distorted layers that vary in thickness in space. The grid therefore does not align with the orthogonal coordinate system assumed in MODPATH, which may lead to inaccuracies (Zheng 1994). This is opposed to the grid of models LV and VV that only feature cells of the same thickness; therefore, the particle tracking predictions from models LV and VV will be compared, but the predictions from model LL are also presented.

Statistical significance test
To test whether the predictions of the different models are significantly different, the Kolmogorov-Smirnov statistical test is applied. The null hypothesis is that the two sets of predictions from the two models are samples drawn from the same continuous distribution. For a p value higher than 5%, it is assumed that the null hypothesis cannot be rejected, and thus the two prediction samples are considered to be drawn from the same distribution. The p value is calculated using the stats.ks_2samp class from the open source Python framework SciPy. The results of the analysis will be used as guidance when interpreting the results of the groundwater models.

Hydrogeological models
In this section, the layer and voxel models are compared independently of the groundwater model. The comparison of the geological model structures is based on vertical profiles, the volumes of the lithological units and a cell-to-cell comparison of the lithological units. As the Limestone and Paleogene units have mostly been based on the same interpretation points, the differences between the two models are mainly limited to the Miocene and Quaternary units. Three general differences can be observed (Fig. 5b,c): firstly, layers are more coherent in the layer model than in the voxel model, where the sand appears as smaller isolated bodies, e.g., profile distance 8-12 km; an analysis of the changes in lithology, horizontally and  (Fig. 1a). Vertical exaggeration is 10 vertically, shows that this is general in the entire area (Fig.  S3 of the ESM). Secondly, in the voxel model, a larger number of lithological categories are used, e.g., at profile distance 11 km the geological configuration is generally more detailed. The voxel model enables the definition of local units, while in the layer model the interpretation is limited to predefined units present in the entire area. In other cases, different units are grouped into one category in the voxel model, e.g., at profile distance 12-14 km, where the upper deposits are interpreted as glaciotectonically disturbed mainly sand and clay units in the voxel model, but as separately interpreted Miocene and Quaternary sand and clay units in the layer model. Thirdly, the layer model always follows the same stratigraphic order, whereas the voxel model can deviate from the order-e.g., at profile distance 11.5 km, the voxel model shows glaciotectonically displaced Miocene deposits within the Quaternary layer series, which cannot be resolved in the layer model. Other differences between the profiles include, e.g., the small valley around profile distance 7 km filled with varying layers of mainly sand in the voxel model, but clay in the layer model. These differences between the two models can be attributed to the fact that the models are interpreted by two different geologists.

Relative volumes of the lithological units
The relative proportions of lithologies in the modified model structures (bottom row, Fig. 3) are shown in Fig. 6. Model LL and LV are both based on the layer model and the difference between them is therefore a result of the translation from a layer grid to a voxel grid. The Paleogene clay and the limestone occupies a similar proportion of all models with almost 50 and 20%, respectively. For the Quaternary, the proportion of sand units compared to clay units is slightly higher in the layer models than in the voxel model. The proportion of Miocene units is small in all models, taking up less than 5% of the total volume. However, a significant difference is seen for Miocene units between the layer models and the voxel model. The layer models contain about 50% more of the Miocene units. This difference can mainly be explained by the three tectonic units of mixed layers in the voxel model, which have been interpreted into Quaternary and Pre-Quaternary layers in the layer model but have been subscribed to being solely Quaternary units in the modified VV model.

Cell to cell comparison of model structures
A comparison of the layer model LV and the voxel model VV (Fig. 3, bottom row) is shown in Fig. 7 based on the Quaternary and Miocene lithological units. The count of cells that do not agree between the two models are normalized to the number of grid cells in that column. Paleogene clay and Limestone are excluded from this analysis as these units in the layer model are based on the corresponding units in the voxel model.
The white areas represent zones of good agreement between the layer model and the voxel model, while the green areas represent zones where the geology has been interpreted differently in the two models. Green areas are especially correlated with data sparse areas (Fig. 1c) that are subject to subjective interpretation decisions in both geological modelling techniques. However, differences between the geological models also exist in areas with data, indicating that the modelling technique influences the resulting geological model. The largest difference between the two models is found in the southernmost part of the area in the fault zone and in the glaciotectonic deformed area (Fig. 1b). In the voxel model, these areas have been interpreted as mixed layer zones, while in the layer model these layers have been interpreted as either Miocene or Quaternary sediments, and has no mixed layers defined in the model.

Groundwater models
For each model structure 20,000 realizations have been run and 500 realizations are retained based on the thresholds defined in section 'Thresholds'. In the following, the results of the 500 retained realizations are presented. Further, the Fig. 6 Histogram of percentage of the lithological units of the total volume in models LL, LV and VV. The lithologies in the models have been categorized into six categories: Quaternary sand and clay, Miocene sand and clay, Paleogene clay and limestone. Note that the Paleogene clay takes up about 50% in all three models, plotting beyond the range of the y-axis results of the realization obtaining the lowest RMS error for each model structure is shown corresponding to the model parameterization which would be obtained if traditional calibration was used.

Model performance
The statistical performance of the realizations of the three models are shown in Table 3. The convergence rate is similar for models LV and VV which contain the same grid, while it is lower for model LL, possibly because the grid of model LL is vertically distorted causing a higher rate of convergence problems. The number of models that achieves a performance better than the defined threshold values are, on the other hand, lowest for model VV and highest for model LL. The number of behavioral realizations for model LV is similar to that of model LL, but the frequency of behavioral realizations to converged realizations is lower. The range of mean error of both hydraulic head and river discharge are similar for all models, while the minimum root mean square of hydraulic head is higher for model VV than for the two other models.

Parameters
In Fig. 8  As indicated by the grey line, in most cases hidden by the dark blue line, the same parameter samples have been applied for the three models. However, not all realizations have converged for all models (dark blue line). Especially high values of horizontal conductivity of Quaternary sand (Kh QS) and low values of horizontal conductivity of Miocene clay (Kh MC) and Paleogene clay (Kh PC) have difficulty converging in model LL, which is a pattern echoed to a lesser degree in model LV.
Nonuniform probability densities of the parameter values of the retained realizations (colored bars) indicate a preference or disfavor towards a certain parameter interval. Uniform probability density distributions for converged models indicate no preference.
Conductances and horizontal conductivities of the Miocene units have preserved the uniform shape of the sampled distribution in all models. Similar for all models is also that they prefer lower horizontal conductivities of Paleogene clay (Kh PC) and limestone (Kh LS). The remaining parameters  show a slightly diverse pattern in the models. For the horizontal conductivity of Quaternary sand (Kh QS), slightly higher values are preferred in model VV than in models LL and LV where the distributions are similar. For the horizontal conductivity of Quaternary clay (Kh QC), slightly higher values are preferred in model LV than in model LL and higher still in model VV.
The realization obtaining the lowest RMS error achieves different parameter values for all model structures. Models LL and LV attain higher values for horizontal conductivity of Quaternary sand (Kh QS) than model VV, but lower values for horizontal conductivity of Quaternary clay (Kh QC).
The result of the Kolmogorov-Smirnov tests are shown by the crosses between the rows in Fig. 8. The tests indicate that the distributions of the retained parameter values are significantly different for the horizontal conductivity of Quaternary sand (Kh QS), Quaternary clay (Kh QC) and Miocene sand (Kh MS) between models LL and LV. Between models LV and VV the distributions are significantly different between the horizontal conductivity of Quaternary sand (Kh QS), Quaternary clay (Kh QC) and Miocene clay (Kh MC), whereas for the remaining parameters, the differences in distributions are insignificant.

Hydraulic head
The median hydraulic head of the retained realizations for each model is shown in Fig. 9. Further, the dots show whether the observations are within or outside the simulated range of the retained realizations. A yellow dot shows that the observation is within the simulated range, while the red and blue triangles indicate that the observations are above and below the simulated range, respectively.
The median hydraulic head appears quite similar for all models and the errors are generally distributed in the same manner. A systematic error is found in the middle of the catchment to the east. Generally, the hydraulic head is simulated too low in the southern part but is then simulated too high towards north. None of the models catches the transition that is observed within a relatively short distance. The problem might be explained by errors in the description of the buried valleys in this area, which is relatively complex with crosscutting structures of various ages, infill and dimensions. In addition, in the southeastern part, three observations show a mound in the water table which is not represented in any of the models. In this area, the layers are disturbed by glaciotectonic deformation, and hence challenging to model. The spatially uneven distribution of errors in all of the models could indicate an error in the conceptual understanding of the area.
The results of the significance test of hydraulic head distributions from the 500 retained realizations in all catchment cells between models LV and LV and between models LV and VV is shown in Fig. 10. The significance test shows that for most cells in the catchment the difference between the simulated hydraulic head distributions is significant. The figure also shows that more cells obtain significantly different crosses indicate whether the sample of predictions are expected to be drawn from the same population (two-sample Kolmogorov-Smirnov test p > 0.05 is green, while p < 0.05 is red). The test is performed between predictions of LL and LV and between predictions of LV and VV distributions between models LV and VV than between models LL and LV.

Groundwater budget
Histograms of the groundwater budget terms drain outflow, general head boundary (GHB) outflow and river outflow for the retained realizations of the three models are shown in the three first columns of Fig. 11. For LL and LV, the difference between predictions of drain outflow and the river outflow is not significant, while the difference for GHB outflow is significant. Comparing predictions of models LV and VV, all groundwater budget terms are significantly different. The overall magnitudes of the different groundwater budget terms are similar for the three models, with symmetric histograms peaking around 150, 37 and 65 mm/year for the drain, GHB and river outflow, respectively. However, the variance of the predictions from VV is smaller than for the two other models, which makes the difference between the distributions significant from a statistical point of view. The lowest RMS error realizations (red line) of the three models, however, obtain very similar results for the three groundwater budget terms. Fig. 9 Simulated median hydraulic head of 500 retained realizations of the three models and distribution of errors. Observations within, below and above the simulated range are indicated. Obs. = observation Fig. 10 Kolmogorov-Smirnov test of hydraulic head. The colors indicate whether the sample of predictions are expected to be drawn from the same population (two-sample Kolmogorov-Smirnov test p > 0.05 is green, while p < 0.05 is red). The test is performed between predictions of LL and LV and between predictions of LV and VV

Particle tracking
Histograms of the median travel time and recharge area to the Højballegård well field in the 500 retained realizations of the three models are shown in the last two columns in Fig. 11. Further, vertical red lines indicate the result of the realization obtaining lowest RMS error in the three models. The Kolmogorov-Smirnov test indicates that the predictions of both median travel time and recharge area are significantly different both between models LL and LV and between LV and VV.
The median travel times of models LV and VV are similar in range, and similar in the sense that they are leftskewed and cover the same range of ages from 70 years to 180 years. However, the histogram of median travel time peaks around 100 years for model LV, while it peaks around 80 years for model VV. The histogram of model LL is also left-skewed but the range of ages is lower, covering ages from 90 to 190 years. The peak of the histogram for LL is around 110 years, similar to that of LV. For the realization obtaining the lowest RMS, the median travel time is 180, 125 and 90 years, respectively for models LL, LV and VV.
The histograms of the recharge area of the models are right-skewed with peaks around 6.5 km 2 for all models. Models LL and LV cover almost the same range between 5 and 7 km 2 , while model VV covers a larger range between 4 and 8 km 2 . For the realization obtaining the lowest RMS, the recharge areas are 5.8, 6 and 5.5 km 2 for models LL, LV and VV, respectively.
The recharge area to the Højballegård well field in percentage of the retained realizations is shown in Fig. 12. The recharge area occupies a similar extent in all three models, extending towards the east, almost reaching the boundary for some of the realizations. Also, the recharge area of some realizations (<20%) extends towards the northeast in all models. The black line represents the recharge area in the realization obtaining the lowest RMS error. The realization in model LL obtains a more elongated area than models LV and VV, whose recharge areas are wider and more in proximity to the pumping wells.

Discussion
A geological model based on the layer-based modelling strategy is compared to a geological model based on a voxel modelling strategy and the impact on groundwater model predictions. The two hydrogeological models are based on the same geological conceptualization and data; the differences between the two models are therefore attributed entirely to modelling techniques and the subjective elements related to different geological interpreters.

Fig. 11
Histograms of groundwater budget terms, median particle travel time and recharge area to the Højballegård well field of the 500 retained realizations in models LL, LV and VV. The extent of the boundaries for the groundwater budget terms is shown in Fig. 4. The red lines within the plot represent the value obtained in the realization of each model that obtained the lowest RMS error. Between the rows, red and green crosses indicate whether the sample of predictions are expected to be drawn from the same population (two-sample Kolmogorov-Smirnov test p > 0.05 is green, while p < 0.05 is red).

Dissimilarities between model predictions
The voxel-modelling technique leads to Quaternary sand lenses isolated in the Quaternary clay to a higher degree than the layer modelling approach (Fig. 5). The less continuous sand layers of the voxel model lead to a lower volume of Quaternary sand found in the voxel model (Fig. 6). Given fewer continuous layers and the lower volume of Quaternary sand, higher values of hydraulic conductivity of both Quaternary sand and clay are preferred in model VV compared to those preferred in models LL and LV (Fig. 8) to obtain the same overall horizontal hydraulic conductivity. The translation of the layer model into a voxel grid (Fig. 3) has the same effect, as layers become less continuous. Model LV thereby prefers higher values of the hydraulic conductivity of Quaternary clay (Kh QC) than model LL; hence, the choice of modelling technique has an impact on the preferred hydraulic conductivity values, regardless of the subjective decisions made by different geological interpreters.
The number of models that achieves a better performance than the predefined threshold values are lowest for model VV and highest for model LL (Table 3). This indicates that the prior parameter distributions are most appropriate for model LL and less so for model VV. This is also indicated by the horizontal hydraulic conductivity of Quaternary clay being slightly more constrained in model VV than in the other models (Fig. 8). The differences in parameters in the retained realizations affect the predicted groundwater budget terms and the particle tracking predictions. The more constrained parameters of VV cause less variance of the predicted groundwater budget terms compared to LL and LV (Fig. 11). The higher hydraulic conductivity of model VV than in model LV increases the ratio of conductivity of upper layers to the conductivity of lower layers. Thereby, the higher hydraulic conductivity of model VV leads to a simulated median travel time for model VV that is shorter (Fig. 11).
All predictions using the best fitting models of each model structure obtain quite different results, except for the groundwater budget terms. The differences between the results from the best performing realizations appear to be quite arbitrary.

Similarities between model predictions
A cell-by-cell comparison of the hydrogeological models ( Fig. 7) showed that the differences between the models are scattered, which can be attributed to the fact that the conceptual model is the same. This is with the exception of a fault zone in the southernmost part of the area and a zone of glaciotectonics in the southeast that was not interpreted in the same way in the two models. However, the three models obtained similar spatial distributions of errors (Fig. 9).
A spatially uneven distribution of errors on hydraulic head observations indicate that the same conceptual error was present in all models. Large errors were located in the central part of the catchment to the east as well as in the southeastern corner of the catchment where several observation values fall outside the range of the predicted heads ( Fig. 9). In the central part of the catchment to the east, many buried valleys are present, and the uneven distribution of errors suggests that the correct hydrostratigraphic relationship of the sediments in the valleys have not been found. In the southeastern corner of the catchment, the large errors may be associated with perched water tables in the glaciotectonic complex; however, the uneven distribution of errors may also be because the models are constrained by the same hydraulic head observations and under the same global rejection thresholds. Under a global rejection threshold, large deviations from the observations may be balanced Fig. 12 Recharge area to the Højballegård well field in percentage of the 500 retained realizations in models LL, LV and VV. The black line indicates the recharge area in the realization obtaining the best RMS error in the three models by smaller deviations at other locations. Finally, the fact that all models were based on the same conceptual model probably caused the simulated recharge area to the Højballegård well field to extent to a similar area in all models (Fig. 12).

Implications
The layer-based modelling approach is often thought to be inappropriate for complex geological settings (e.g., Henriksen et al. 2017;Jørgensen et al. 2013;Turner 2006). For the Egebjerg area, a voxel model approach was initially selected because the area was thought to be too complex for a layer model because of the many generations of buried valleys, fault zone and glaciotectonics characterizing the area (Jørgensen et al. 2010). One reason for this is that the layer model is constrained by the stratigraphic order of units initially defined. Based on the results presented here, the models' results cannot substantiate that the voxel model is a better representation of the hydrogeology in the area (Table 3). It can only be concluded that the geological modelling technique does have an impact on predictions of the groundwater model.
The results confirm existing evidence (Rojas et al. 2010;Neuman and Wierenga 2003;Troldborg et al. 2007) of the dominating importance of the conceptual model for groundwater model predictions. The modelling techniques were shown to give rise to differences in interpretation even in areas that are data dense (Fig. 7). However, the differences between the model structures are scattered and do not relate to larger conceptual elements in the area, e.g., the location of a buried valley. This is with the exception of a fault zone in the southernmost part of the area and the zone of glaciotectonic in the southeast, which have been interpreted in two different ways in the geological models because of lack of data and because of different model concepts.
If parameter uncertainty is not considered, the differences between the obtained predictions were even more conspicuous. This corresponds with existing understanding that realizations obtaining equal performance can lead to significantly different predictions (Beven 2006) and that not one set of parameters will represent the "true" parameter set, because all model structures and measurements are associated with uncertainty.

Limitations
The same parameterization was used for both the layer model and the voxel model and some units were therefore lumped together in the translation from geological model to hydrostratigraphic model. In other words, the description is simplified by reducing the number of units and information is therefore lost. Most information was lost in the voxel model where the number of units was reduced from 72 to 6.
The reason for the massive difference in number of units in the two models relate to the purpose with which the models have been developed. The layer model was developed with groundwater modelling in mind and units of comparable hydrological properties have therefore been correlated. The voxel model on the other hand was developed with the geological history in mind. In general, information lost from lumping together units of a stratigraphic model may not be pertinent to groundwater model predictions in general if the units have comparable hydrogeological properties.
The results may have been different if more lithological units were retained in the groundwater models, as structural noise potentially increases by reducing the number of parameters or fixing their value (Doherty and Welter 2010). The goal of this study was, however, to investigate the effect of the layer and voxel modelling technique on the model structure and to isolate this effect; thus, the models were parameterized in the same way. The effect of lumping together different zones of homogeneous hydraulic conductivity have already been investigated (e.g., Engelhardt et al. 2013;Foglia et al. 2007;Poeter and Anderson 2005).
Geologically, the Egebjerg area is very complex, and therefore the voxel approach was initially chosen as the most appropriate. In geologically more simple areas, i.e., undisturbed sedimentary environments, it is expected that the layer model will be able to resolve the geology even better and therefore it is expected that the difference between a layer model and a voxel model, would be even less.
The results presented here are likely to be dependent on the scale of the geological models. The cell size of the geological model will determine how many geological details the geological model can resolve. The precision of the layer model can however be approached in the voxel model with a sufficiently fine vertical discretization. Therefore, it is expected that for a lower vertical discretization in the voxel model, it will be more similar to the layer model. On the other hand, the detail of the layer model is constrained by an initial choice of the number of units to be resolved in the model. The detail of the voxel model can however be approached in the layer model with a larger number of lithological units. Therefore, it is expected that for a higher number of lithological units in the layer model, the layer model will be more similar to the voxel model. Finally, the groundwater model predictions investigated here are all of global nature at catchment scale. It is expected that for groundwater model predictions at more local scale, the details of the hydrostratigraphic models would become more important because differences in the models have less chance of cancelling out over smaller distances.
From a statistical perspective, a Kolmogorov-Smirnov test showed that many of the groundwater model predictive distributions between the models were significantly different. Compared to the difference in predictions caused by the uncertainty of parameters, the differences between the predictions caused by the different geological models, are however limited.

Conclusion
As the hydrogeological conceptual model controls the storage and movement of water and solutes, it is essential to understand how different translation techniques from a conceptual geological model to a hydrostratigraphic model affects groundwater model predictions. The conclusions in this study are based on hydrostratigraphic and groundwater models for the Egebjerg area, Denmark.
Based on the same conceptual geological model and data, two hydrostratigraphic models were constructed. One based on the layer approach and one on the voxel methodology. Groundwater models based on the same parameterization and assumptions were developed for each model as well as for a layer model translated into the voxel model grid. Differences in the hydrostratigraphic models independent of the groundwater models were analyzed and their impacts on groundwater model predictions were investigated.
This study reveals that the differences between the two models are mainly related to the continuity of lithological units caused by differences in the hydrostratigraphic modelling techniques. However, only marginal differences between the volumes of the different lithologies and the overall geological structures were found as the hydrostratigraphic models were based on the same conceptual geological model.
By comparing groundwater modelling predictions based on the different hydrostratigraphic models, this study found that the obtained sensitive model parameters, hydraulic head, groundwater budget and particle tracking results were significantly different from a statistical perspective. When only considering a single best fitting realization of each model structure, the differences in model predictions are even more significant.
This study thereby establishes that the chosen geological modelling technique, being layer-or voxel-based, does have an impact on the groundwater model predictions given the applied scale, geological setting, discretization, and parameterization applied here. The impacts on groundwater model predictions should therefore be considered before choosing the geological modelling technique for groundwater modelling.