Introduction

A conceptual groundwater model is a simplified representation of a groundwater system that captures the key hydrogeological characteristics and processes such as stresses, flow paths, subsurface parameters, and aquifer boundary conditions (Freeze and Cherry 1979; Konikow and Bredehoeft 1992; Brassington and Younger 2010; Anderson et al. 2015; Enemark et al. 2019). The conceptual model acts as a foundation for quantitative analyses of groundwater systems and assists in assessing the availability, sustainability, and potential impacts of groundwater resources to make informed decisions about water allocation, extraction, and monitoring strategies (Doherty and Simmons 2013; Fienen et al. 2013; Jakeman et al. 2016). The most frequent sources of error in model applications are typically related to conceptualization problems and uncertainty surrounding the data (e.g., Konikow and Bredehoeft 1992; Moore and Doherty 2006; Refsgaard et al. 2006, 2012; Gupta et al. 2012; Tian-chyi et al. 2015; Vrugt 2016). Inaccuracies or uncertainties that originate in the conceptual model may propagate through subsequent mathematical models and lead to errors in the final predictions (Gupta et al. 2012).

One approach to support the conceptual model development is the use of screening models (Hunt et al. 1998). Screening models can be developed prior to the development of a numerical model and used to test alternative hypotheses in conceptualization and understanding system dynamics (Anderson et al. 2015). The application of screening models is common in groundwater contamination studies—for example, Ehteshami et al. (1991) used DRASTIC (Aller et al. 1985) as a screening tool to identify areas vulnerable to contamination. Shukia et al. (1998) developed the attenuation factor (simple analytical solutions of transport processes) model to assess the groundwater vulnerability to pesticides, and Willson et al. (2006) defined the source term for the release of dense nonaqueous phase liquids (DNAPLs) to groundwater with a conceptual screening model. Guo et al. (2022) quantified leaching per- and polyfluoroalkyl substances (PFAS) in the vadose zone and mass discharge to groundwater. Other studies, for instance, tested the use of analytic element models to find errors in a complex finite-difference model and developed the analytical solution to identify the damping depth where the flux variation damps to 5% (Hunt et al. 1998; Dickinson et al. 2014). Such examples of the application of screening models in quantitative groundwater research are less common.

Data-driven models also have the potential to be used as screening models. Their application has already been seen as surrogate models (also known as model emulators) to approximate the features and reduce the computational time of complex models (Razavi et al. 2012a, b; Asher et al. 2015). Data-driven models explain the relationship between input (e.g., precipitation, river level) and output variables (e.g., groundwater level time series), based on empirical or statistical relationships (Solomatine et al. 2009; Bakker and Schaars 2019). Data-driven models do not require extensive hydrogeological data and can be built relatively quickly. They have been used to predict groundwater levels (Manzione et al. 2012; Shirmohammadi et al. 2013; Khalil et al. 2015; Lee et al. 2019; Kalu et al. 2022; Sun et al. 2022), to identify the most important stresses (Von Asmuth et al. 2008; Shapoori et al. 2015a; Sahoo et al. 2017; Sartirana et al. 2022), to capture the groundwater regime transition (Obergfell et al. 2019), and to estimate gross recharge, groundwater usage, and hydraulic properties (Peterson and Fulton 2019; Collenteur et al. 2021), among other cases. Recently, their application has been proposed as a support tool for developing a numerical groundwater model (Obergfell et al. 2013; Bakker and Schaars 2019; Zaadnoordijk et al. 2019).

This study focuses on investigating the Grazer Feld Aquifer in southeastern Austria, which holds strategic importance to society as a source of freshwater for agriculture, industry, and human water consumption. The resource is, however, under pressure from potential human overconsumption and the impacts of climate change, which exhibits itself disproportionately in this part of the world (Strauss et al. 2013; Gobiet et al. 2014; Maraun et al. 2022). To address these challenges and evaluate these impacts on the aquifer, a quantitative understanding of the groundwater system is required. Given the inherent complexities arising from urbanization and agricultural activities in the area, it is crucial to better understand the stresses affecting the aquifer. In this case study, time series models are used as screening tools to identify the primary stresses and assess how their influence on the head fluctuations varies across the aquifer. In addition, the screening models are used to identify wells where the fluctuations cannot be explained by known stresses while engaging stakeholders with local expert knowledge to explore potential causes. This helps to find gaps in the understanding of the groundwater system and the available data.

Specifically, the following research questions are addressed:

  • Which stresses are impacting the head fluctuations in the individual monitoring wells and how does their relative contribution vary spatially?

  • For which locations is there currently a lack of understanding and/or data to model the head fluctuations satisfactorily?

  • What are the limitations and opportunities of using data-driven models to support conceptual model development?

Study area and data

The study area is the Grazer Feld Aquifer in southeastern Austria (Fig. 1). The Grazer Feld aquifer covers the entire city of Graz and stretches ~30 km to the south, with a total area of ~166 km2. The mean altitude is 337 m above sea level (masl; Austrian Federal Ministry of Agriculture, Forestry, Regions and Water Management 2019). The groundwater body is situated in an urban and semiurban area. Urban land use dominates in the northern part and occupies 58% of the total aquifer land area, while agricultural fields cover 31.7% of the total area and are concentrated in the southern part. Other land covers include canopy (contributing 9% to the total area) and water bodies (contributing 1.3%). An extensive groundwater monitoring network covers the aquifer (green dots in Fig. 1a) and is maintained by the Provincial Government of Styria.

Fig. 1
figure 1

Overview map of the study area in EPSG:32,633. a Digital elevation model (DEM) with mean groundwater levels, meteorological stations, and groundwater monitoring wells. b Geological map

Hydrogeological setting

Morphologically, the Grazer Feld is a wide terrace consisting of Quaternary sandy gravels. The east and northeast parts of the city of Graz are dominated by a mixture of clays, sands, and gravel (Harum et al. 1997), whereas in the west, the aquifer is bounded by hills consisting of carbonate rocks. The hydrogeological setting of the aquifer is illustrated in Fig. 1. The unconfined aquifer is underlain by a Neogene layer representing the aquiclude. The aquifer thickness increases from the boundaries of the valley towards its central parts. In the Quaternary terraces, the aquifer thickness ranges approximately between 10 and 20 m; however, it generally exceeds 20 m in the Holocene floodplain (Fig. 1), reaching a maximum depth of 53 m in a channel incised into the Neogene base west of the river Mur (Office of the Provincial Government of Styria 2015). The average depth to water table varies between 10 and 22 m below the ground surface (bgs) for the lower terrace and is lower than 10 m bgs in the floodplain (Giuliani et al. 2012).

Drivers of groundwater head fluctuations

The study area is situated in a temperate climate zone, with warm summers and relatively mild winters. The mean annual precipitation for the period from 2000 to 2019 is 837 mm/year, while the mean annual potential evapotranspiration for 2000–2019, estimated with the Penman–Monteith method (Allen et al. 1998; Vremec et al. 2024), is 803 mm/year. Studies of evapotranspiration in this region indicate that potential evapotranspiration has increased in the last decades compared to earlier time periods (Nolz et al. 2014; Duethmann and Blöschl 2018; Collenteur et al. 2021; Forstner et al. 2021). The groundwater is replenished by recharge from precipitation and surface-water infiltration (Kralik et al. 2014). Local studies estimated annual recharge values between 164 and 698 mm for the period 1971 to 1991 (Fank 1993; Benischke et al. 2002; the Office of the Provincial Government of Styria 2015).

Several surface-water bodies are present in the area and intersect the aquifer. The major surface-water body is the river Mur, flowing through the entire aquifer from the north to the south (Fig. 1). The interaction between the aquifer and the river varies depending on the location (Kralik et al. 2014)—in the north, the river is losing, whereas in the south the river is gaining. This interaction is also influenced by five operating hydropower dams (Table S2 of the electronic supplementary material, ESM). With hydraulic height differences of ~10 m, the construction of these dams modified the groundwater base level across the aquifer, affecting the groundwater dynamics in nearby areas. To successfully model long head time series in the case study area, it is therefore expected that dam developments need to be considered.

Another anthropogenic stress on the aquifer is groundwater pumping. The usage ranges from the individual and municipal drinking water supply to the production of utility water for agricultural irrigation and industrial purposes (Provincial Government of Styria 2014). In the Grazer Feld, the total water withdrawals are estimated at ~17.7 million m3/year, with 66% allocated for drinking water supply (Umweltbundesamt 2021).

Data collection and availability

Data was collected, to the extent possible given the available quantity and quality, regarding stresses affecting the aquifer. These data are made publicly available on a Zenodo repository (Kokimova et al. 2023). An overview of the collected data and data sources is provided below.

Groundwater head data

are obtained from the monitoring network of the Provincial Government of Styria (‘Land Steiermark’). The monitoring network consists of 233 monitoring wells measuring the head fluctuations in the unconfined aquifer. The total well depth varies between 1.7 and 50 m, with a mean of 11 m below the surface, while the groundwater head is, on average, 6.8 m below the surface. The box plots of the total well depth and the depth to the water table are provided in Fig. S1 of the ESM. First, the raw dataset with irregular time steps was resampled to daily values using Python package ‘ehyd_reader’ (Haas 2020). At this daily timescale, an average of 3398 data points is available over a 60-year period. In this study only time series are used where more than 100 monthly recordings in the period between 2005 and 2015 are available, resulting in 144 monitoring wells with time series. Furthermore, the daily interval was resampled to a 14-day interval to reduce the autocorrelation of residuals in the noise model (Collenteur et al. 2021, 2023; Brakenhoff et al. 2022).

Precipitation data

are collected from six stations, shown as meteorological stations in Fig. 1. The records are available at the GeoSphere Austria Data Hub (2022) and the Federal Ministry of Agriculture, Forestry, Regions and Water Management (eHYD). Missing values were filled with the data of the closest station, and the period of daily values was extended from 2000 to 2019 for all the stations. As snow contributes less than 3.3% to the total precipitation, and the mean number of snowfall days is 10 days/year in the Grazer Feld, the effects of snow were not accounted for in the modelling (Prettenthaler 2010; Collenteur et al. 2021). The nearest meteorological station data was used as model input for each monitoring well.

Potential evapotranspiration

is computed from the meteorological data obtained from three stations of the GeoSphere Austria Data Hub at a daily resolution. Station coordinates are provided in Table S1 of the ESM. The Penman–Monteith method was computed using the open-source PyEt Python package (Allen et al. 1998; Vremec et al. 2024). The input data included global radiation, wind speed, relative humidity, mean, maximum, and minimum air temperature at 2 m, station’s latitude and elevation. This method was shown to be appropriate for estimating evapotranspiration in the nearby aquifer south of Graz (Klammler and Fank 2014).

Surface-water level data

are obtained for one location on the southern part of the river Mur (within the aquifer) from the eHYD platform (Austrian Federal Ministry of Agriculture, Forestry, Regions and Water Management 2022), shown as river station in Fig. 1. For other surface waters, no data is available. The original time series is on a daily time step, available for 29 years and is normalized by subtracting the minimum river level from the original values while only the river level variation is of interest.

Hydropower dam

information was obtained from the report Verbund Hydro Power (2013). Although no time series of river levels are obtained, the first known operating day of the dam, which can be used to model a step trend in the model simulation, is recorded. In this study, the year 2012 from Gössendorf Dam was used (Table S2 of the ESM).

Groundwater abstraction

While information about permissions to abstract groundwater and, in some cases (e.g., for waterworks), lumped sums of the actual withdrawals are available, time series data are not publicly available for the period considered here. In cases where pumping rates are (nearly) constant, the groundwater abstraction will not be a relevant driver of head fluctuations. Yet, in some places, pumping rates may vary both in the agricultural and urban areas, such that omitting this driver of head fluctuations might result in low performance of the time series model. The time series model was used as a screening tool within the context of this study. Low model performance thus may indicate a gap in knowledge (e.g., due to lack of pumping data) that requires further consideration and guides future investigations of the study area (see section ‘The identification of knowledge gaps’).

Methods

Data-driven modeling workflow

Figure 2 illustrates the general modeling workflow for the use of time series models as screening models as applied in this study. The workflow consists of five steps, with the first two dedicated to creating models, while the last three are used to evaluate and select the model with the most appropriate combination of stresses for a single monitoring well. First, based on the available data, four hypotheses on different sets of stresses that potentially influence the observed head fluctuations were developed, illustrated by M1, M2, M3, and M4 and shown in Table 1. In essence, it is hypothesized that the head fluctuations might be explained by different combinations of recharge variations, river level variations, and the presence of a step trend due to damming. After creating and calibrating different models (shown in step 2), the models were classified using a ‘traffic light system’. This involved several stages of selection criteria (shown as Nos. 3, 4, and 5 in Fig. 2) to determine their suitability for further use. These selection criteria are described in detail in section ‘Model evaluation and selection’. The models that ended up in blue, orange, and red boxes were presented to stakeholders for possible improvement and hypothesis adjustment, shown as M5 in Fig. 2.

Fig. 2
figure 2

The time series modeling workflow applied in this study

Table 1 Overview of the four model structures tested for each monitoring well in the data set and the number of parameters of each model (No. Par.). x indicates that the stress is included in the model structure

Based on the initial analysis of the study area and the available data for stresses, four candidate hypotheses were translated into model structures developed for each monitoring well, as shown in Table 1.

Data-driven modeling

There is an increasing number of data-driven models to choose from to simulate head time series. In this study, lumped-parameter models using impulse response functions were applied (Von Asmuth et al. 2002). Impulse response functions are used to simulate the head response to different stresses (e.g., precipitation, potential evapotranspiration, or river level). Advantages of this type of model include the relative ease to test different model structures (different stresses and processes), and low input data requirements (no need for detailed information on the subsurface). The models are implemented in the open-source software Pastas (Collenteur et al. 2019) and are created through Python scripts. The use of scripts to generate the models enables the modeler to test different model structures in a limited amount of time and in an automated and reproducible way. These scripts are provided on a Zenodo repository (Kokimova et al. 2023).

General model setup

The basic model to simulate the observed heads (h) is written as follows (Von Asmuth et al. 2008):

$$h\left(t\right)=\sum_{m=1}^{M}{h}_{\mathrm{m}}\left(t\right)+d+r\left(t\right)$$
(1)

where \({h}_{\mathrm{m}}\left(t\right)\) is the contribution to the head fluctuations from stress m, d is the model base elevation, M is the number of stresses, and \(r(t)\) are the model residuals. The basic model structure shown in Eq. (1) remains flexible with regard to (1) the number of stresses that are included in the model, and (2) how these stresses are transformed into a contribution to the head fluctuations. This conveniently serves one of the main purposes for which the data-driven models were used in this study, testing and improving different conceptual models.

Impulse response functions

The contribution of each stress to the head fluctuation (hm) is computed by convolution of a time series of the stress (Sm) with an impulse response function (Eq. 3):

$${h}_{\mathrm{m}}\left(t\right)=\underset{-\infty }{\overset{t}{\int }}{S}_{\mathrm{m}}\left(\tau \right){\theta }_{\mathrm{m}}\left(t-\tau \right)\text{d}\tau$$
(2)

where θm is the impulse response function that simulates the response of the head to that specific stress (Sm) occurring at time (τ). The lag between the response at time t and the impulse at time τ is then described by t τ (Yang and McCoy 2023). A scaled Gamma distribution [Γ(n)] is often used to simulate the head response to a stressor:

$$\theta \left(t\right)=A \frac{{t}^{n-1}}{{a}^{n}\Gamma (n)}{\text{e}}^{\frac{-t}{a}}$$
(3)

where A, a, and n are parameters that describe the shape of the response function. In this study, the head response due to river level variations is simulated by convolution of a time series of the observed river level with a scaled Gamma response function.

The impact of a stressor on groundwater heads and its spatial distribution within an aquifer can be assessed by examining the shape and gain of the response function. This is particularly insightful when multiple monitoring wells are analyzed, allowing a spatial interpretation. A commonly used property of the response function is its gain, the final head response that is achieved when a constant unit of stress is applied indefinitely (Fig. 3). In Pastas, this property is conveniently captured by the value of parameter A in Eq. (3). Another property is the response time of the system, which here is defined as the time at which 95% of the head response to a stress impulse has occurred (Fig. 1, after Brakenhoff et al. 2022). The response time offers insights into the aquifer’s behavior, determining if an aquifer is a fast or slow responding system.

Fig. 3
figure 3

The step response for the scaled Gamma response function with parameters A = 100, n = 1.5, a = 15 days (adapted from Collenteur et al. 2019)

The stress (Sm in Eq. 2) may also be the result of a subroutine that computes a single stressor from multiple other stresses. A common example of this is the computation of a precipitation excess or a recharge flux from precipitation (P) and potential evapotranspiration (Et). Here, the nonlinear root zone model developed by Collenteur et al. (2021) is applied to compute groundwater recharge. The advantage of this approach over a linear precipitation-excess model is that the recharge and actual evapotranspiration are a function of the water storage in the root zone, causing the response to precipitation and evapotranspiration to become nonlinear. The details on the nonlinear recharge model can be found in Appendix 1. The contribution from precipitation and evapotranspiration is calculated by convoluting the computed recharge flux with a scaled Gamma response function.

Contribution of sudden, systematic changes

A special case needing consideration in this study is the effect of (hydro-power) dams, causing the river levels upstream and downstream from the dam to be structurally altered. This was included in the model by using a step trend. The constraint on the step trend going up or down was given based on the well location with regard to the selected dam. In the implementation in Pastas, a binary time series is constructed first using a Heaviside function, this function is then used as Sm in Eq. (2):

$$H\left(t\right)=\left\{\begin{array}{ccc}0&if&t\leq T_{\mathrm{start}}\\1&if&t>T_{\mathrm{start}}\end{array}\right.$$
(4)

where Tstart is the approximate date when a dam construction was finished (provided by the modeler), and the water level started dropping or increasing. Similar to the other stresses, the time series H(t) is then convoluted with an exponential impulse response function that follows Eq. (3) when n = 1 and has two parameters A and a. The advantage of this approach is that it is possible to use different response functions to simulate the head response to a sudden change.

Noise model

The residual errors of groundwater models simulating head time series often show high autocorrelation. This violates one of the assumptions that is being made here to reliably estimate the parameter uncertainties. A common approach to tackle this problem is to apply a noise model to model the residual errors and transform the correlated residuals into uncorrelated noise. Here, an autoregressive model of order one (AR(1)) noise model was applied (Von Asmuth and Bierkens 2005) for this purpose. The noise model adds one additional parameter (α) to the model.

Model calibration

Depending on the model structure (see Table 1), between 9 and 14 parameters needed to be estimated from the head data. An overview of the model parameters is found in Table 2. The 11-year period from 2005 to 2015 was used for calibration, and the 4-year period from 2015 to 2018 served as validation. The model calibration was conducted in a two-step optimization procedure (Collenteur et al. 2021). First, the model parameters were optimized without the noise model. Then, after fixing the parameter that determines the size of the root zone bucket (Sr,max), the model was calibrated again with the noise model using the optimized parameters from the first step as initial parameters. The sum of the weighted squared innovations (SWSI) criterion was used as the objective function to minimize the parameters, using a Levenberg–Marquardt algorithm. This criterion is employed here because it can deal with the irregular time interval between head measurements, which are present in the head data. The authors reference Von Asmuth and Bierkens (2005) for more details on this criterion and its derivation. Additionally, to show the model’s uncertainty, 95% confidence intervals were estimated using the covariance matrix obtained from the optimization algorithm.

Table 2 Model parameters and their bounds applied for the models. The parameter bounds and fixed values were chosen based on Pastas’ default settings and experiences from previous model applications (Collenteur et al. 2021, 2023)

Model evaluation and selection

The evaluation of the calibrated models followed a three-stage process (shown as boxes 3, 4, and 5 in Fig. 2). This involved (1) checking the model reliability, (2) assessing the goodness-of-fit, and (3) selecting the best model. The procedure was based on an adapted version of the acceptance criteria proposed by Brakenhoff et al. (2022).

In the third step of the workflow (Fig. 2), the models underwent reliability checks (further also defined as criteria), which are summarized in Table 3. In this study, the reliability of the models was determined by three conditions. First, no significant autocorrelation in the noise must be present. This allows reliable estimates of the model uncertainties to be obtained, which are checked with the Stoffer-Toloi test for autocorrelation with a significance level of α = 0.01 (Stoffer and Toloi 1992). The second condition tests if the response length is not exceeding the calibration time length for recharge, and half the calibration time for river stress. When this condition is not met, it is argued that the calibration period may be too short to accurately estimate the parameters of the response function. The third condition examines the gain that should be significantly different from zero, i.e., zero should not be within the 95% confidence interval (Brakenhoff et al. 2022).

Table 3 Model evaluation and selection steps after Brakenhoff et al. (2022). These steps (3, 4, and 5) are shown in the blue area in Fig. 2

The fourth step involved assessing the goodness-of-fit, where a satisfactory model fit is determined if the Kling-Gupta Efficiency (KGE) exceeds 0.6 for calibration and validation periods separately (Kling et al. 2012). This is an arbitrarily selected threshold.

If multiple model structures passed the preceding criteria, the fifth step was to select a single model for further analysis. This was done with the selection of the minimum Akaike Information Criterion among the set of models for a single monitoring well for models passing the autocorrelation test (Akaike 1973). If a model failed the Stoffer-Toloi test for autocorrelation, the selection procedure was slightly different. Models that met all other conditions except for autocorrelation were still selected based on KGE > 0.6 for both calibration and validation periods. However, if the model failed any other reliability test, only the basic structure model (which includes only recharge) was selected.

Once the model selection procedure is completed (see Fig. 2), the chosen models are classified into four categories (1) ‘reliable good fit’ models (category 1) pass all checks, (2) ‘reliable bad fit’ models (category 2) pass only reliability checks but fail in one of the KGE assessments, (3) ‘semi-reliable good fit’ models (category 3) pass goodness-of-fit metric (autocorrelation is ignored but other two criteria are satisfied), and (4) ‘unreliable bad fit’ models (category 4) pass neither goodness-of-fit nor reliability checks. Such a ‘traffic light system’ provides flexibility for users to choose the appropriate model for the required application. All models are assessed with the root mean square error (RMSE) and the Kling-Gupta Efficiency (KGE) between calibration and validation periods (Kling et al. 2012; Chai and Draxler 2014).

Stakeholder engagement

A workshop was organized with local stakeholders to understand why certain models could not successfully simulate the head fluctuations with the selected stresses. The stakeholders are employed by the regional governing body of the Province of Styria (‘Land Steiermark’) and are responsible for the groundwater management in the area. These stakeholders also act as experts in the regional and local hydrogeology. The model outputs from model categories ‘reliable bad-fit’, ‘semi-reliable good-fit’, and ‘unreliable bad fit’ were presented with an interactive map developed with the Python package Folium (the script is available on a Zenodo repository (Kokimova et al. 2023), and the example is provided in Fig. S3 of the ESM).

Results

Model selection and performance

Four models were built for each of the 144 monitoring wells, resulting in 576 models. A single model was selected for each monitoring well following the three-step selection procedure. Table 4 summarizes the number of models after each reliability criterion was applied. A total of 131 models for 80 monitoring wells passed all reliability checks. Among these, 38 monitoring wells had more than one reliable model, and a single model was selected with the AIC. Overall, 39 models had a KGE for both the calibration and validation periods higher than 0.6, and 41 models had one or both KGE lower than this threshold. As such, they formed two first categories ‘reliable good fit’ and ‘reliable bad fit’ models, respectively.

Table 4 Number of models and monitoring wells passing different acceptance criteria

From the 64 monitoring wells where the models failed the Stoffer-Toloi autocorrelation test, 33 models were selected as ‘semi-reliable good fit’, having KGE higher than 0.6, and 31 models were ultimately classified as ‘unreliable bad fit’ models failing both reliability and goodness-of-fit checks. In the group of latter models, 11 models were selected with only recharge as stress, because one of three reliability requirements was not satisfied.

Figure 4 shows the boxplots of the goodness-of-fit metrics (RMSE and KGE) for the 144 selected models and all 576 models, split between calibration (left column) and validation periods (right column). As expected, the selected models show better performance for both periods compared to all 576 models. The average RMSEs of the 144 selected models in the calibration and validation periods are 0.15 and 0.16 m, respectively, compared to 0.21 m of all 576 models in both periods. The median KGE of the 144 models in the calibration period is 0.79 vs. 0.7 for 576 models. In the validation period, the models’ performance is slightly lower, with KGE of 0.66 for 144 models and 0.57 for 576 models. While the median RMSE of all models for the calibration period is the same as for the validation period (0.19 m), the models show higher KGE in the calibration period compared to the validation period (median KGE of 0.72 and 0.58 accordingly). The spatial distribution of the KGE of the selected models can be seen in Fig. 5a. The models with lower KGE estimates are predominantly located in the southern region of the aquifer. Conversely, within the northern part, they are concentrated around the city center, in close proximity to the northeastern aquifer boundary. The KGE estimates correspond to the model categorization, presented by blue and red dots in Fig. 5b. These categories, namely ‘reliable bad-fit’ and ‘unreliable bad-fit’, fail to meet the KGE threshold of 0.6. Spatially, they are more prevalent in the northeastern and southwestern parts of the aquifer.

Fig. 4
figure 4

Median metrics (with outliers) for calibration and validation periods for all 576 models and the selected 144 models. Black diamonds indicate outliers

Fig. 5
figure 5

a KGE of selected models, b model categories, and c selected model structures

Identification of stresses and their spatial distribution

Dominant stresses

The majority (104) of the 144 monitoring wells considered in this study are located in urban areas. Particularly the northern part of the aquifer is dominated by the urban area of the city of Graz. In terms of model performance, 26% of the models in the urban areas fall into the category of ‘reliable good fit’ models, 28% are ‘reliable bad fit’ models, 25% ‘semi-reliable good fit’, and 21% of ‘unreliable bad fit’ models (Table S4 of the ESM). In the agricultural areas, located mostly in the southern part of the aquifer and represented by 33 observation wells, there is a greater proportion (36%) of models with ‘reliable good fit’, mainly at the cost of the lower proportion (15%) of models from category ‘semi-reliable good fit’.

The number of models for each model structure per model category is summarized in Table 5. The spatial distribution of these models is depicted in Fig. 5c, where models categorized as ‘unreliable bad-fit’ are delineated by transparent colors with red borders. As expected, recharge plays an important role in explaining the observed head fluctuations as a single driving force and in combination with other stresses. The river level was used as a driving force in 110 of the selected models, and 22 times in combination with a step trend. Spatially, these models are spread all around the aquifer from the north to the south, as shown in Fig. 5c. This indicates the presence of interaction between the groundwater and the river Mur in large parts of the aquifer.

Table 5 Number of selected models per category and model structure

For 29 models, recharge was used as the only driving force to simulate the heads. Within these 29 models, 11 belong to ‘unreliable bad-fit’ models. These sites with recharge as the only stress are predominantly located at some distance from the river and in the southern part of the aquifer, presented with dark blue dots in Fig. 5c. This spatial distribution indicates a relatively weaker river–aquifer interaction not only with an increasing distance from the river but also towards the southern section of the aquifer.

The model with influences of recharge, river, and step trend was selected as the most appropriate model for 22 wells. More than half of them were categorized as reliable models. The spatial distribution is presented as brown dots in Fig. 5c, which are mostly concentrated in the middle of the aquifer (y coordinate: 5,203,000–5207000, x coordinate: 534,000–538,000), upgradient from the Gössendorf hydropower dam (shown as the third orange star from the top) that was completed in 2012. In some of these models, particularly if they represent locations distant from the river and are assessed as ‘semi’ and ‘unreliable’, the step trend might compensate for unknown processes. The inclusion of the step trend together with recharge (but without river influence) was needed to simulate the heads at five sites, which are shown with orange dots in Fig. 5c. The visual distinction between a model only with recharge and with both recharge and a step trend in two cases is depicted in Fig. S2 of the ESM. Four models are located close to the dams, while one model is 2.5 km away. The inclusion of the step trend in the latter model might be associated with other human activities as discussed in section ‘The identification of knowledge gaps’.

Spatial variability of the gain

For the Grazer Feld, the examples of gains of the response functions are illustrated in Fig. 6 for the recharge gain (b–c) and river gain (f–g). Figure 6d, h illustrates the spatial distribution and the influence of the stresses based on the ‘reliable good fit’ models.

Fig. 6
figure 6figure 6

Recharge and river gains of response functions: a Boxplots of recharge gain of the response function for land cover type; bc and fg Step response functions with 95% confidence interval and response times; e The range of the river gain of the response function depending on the distance from the river; d and h Spatial distributions of recharge and river gains

A distinct pattern for the recharge gain can be discerned in the Grazer Feld. Several models in the southern part of the aquifer are characterized by large head changes in response to recharge. These sites predominantly coincide with agricultural lands, whereas sites with lower recharge gains tend to be located in urban areas. Recharge gain across different land covers is presented in Fig. 6a. Two examples with recharge gains illustrate small and large simulated head responses depicted in Fig. 6b,c, respectively. The observation well 336818 with a recharge gain of 0.2 m is located in an urban area in the north, whereas well 318808 with a recharge gain of 1 m is associated with an agricultural area in the south. The first model considers recharge and river in the model structure, whereas the second model includes only the recharge.

A spatial division corresponding to recharge but with a reversed pattern exists for the river gain. The stronger response of simulated groundwater heads to river level fluctuation is found in the northern part of the aquifer. In the southern part, the impact of the river is less pronounced (see Fig. 6h). The second aspect that influences the head response to the river stress is the distance to the river. Most of the models with large river gains (parameter A > 0.5 m) are located within 2 km from the river (Fig. 6e). The river gain of model 332726 is estimated at 0.8 m, and the observation point is located approximately at a distance of 0.58 km from the river, whereas the river gain of the response function for model 315036 is estimated at 0.2 m, and the observation point is ~1.5 km from the river and located in the south of the aquifer.

Stakeholder engagement

The discussion with the stakeholders revealed human impacts for 18 locations and natural causes for three models out of the presented 105 models. It also provided an explanation for the presence (or absence) of certain stressors in unexpected locations (Table S3 of the ESM). For example, a severe drop in the heads occurring in early 2006 in four wells neighboring the Graz Airport, was caused by the temporal pumping organized by the local waterworks company (Fig. S4 of the ESM). Among these models, two were classified as an ‘unreliable bad fit,’ one model as a ‘semi-reliable good fit,’ and another as a ‘reliable bad fit’. Another example explains the sudden decrease of heads in the observation well 315044 (Fig. 7). Here, a pumping test was conducted for an underground karst spring in 2007, resulting in a sudden head decline.

Fig. 7
figure 7

Simulation vs. observations of model 315044_110. The sudden head decline observed around 2007 (between red-dashed lines) is attributed to a pumping test

In the south, even the models situated near the river did not incorporate the river stress, suggesting the aquifer–river interaction is weaker in the south than in the central and northern parts of the aquifer. The stakeholders supported the hypothesis of a weaker aquifer–river interaction in the southern part of the aquifer attributed to the presence of a Neogene impermeable riverbed. This impermeable layer has the potential to disconnect groundwater from surface water.

In summary, for 21 monitoring wells the reasons underlying unsuccessful simulations under the existing combination of stressors were found (Table S3 of the ESM), thus supporting the identification of local human impacts or other stressors that were previously not recognized. The use of data-driven models offered a low-level approach to discussing the groundwater head data and conceptual model of the Grazer Feld aquifer with the stakeholders.

Discussion

Improvement of the aquifer understanding

The data-driven models used in this study helped to identify the driving forces stressing the aquifer and how their impacts vary spatially. In the northern and central part of the aquifer, the river plays a crucial role, indicating that any impact on the river such as hydropower plants, flood protection, and channel modification measures, might lead to changes in aquifer dynamics and need to be considered in future models. However, despite the relatively lower impact of recharge from precipitation on head fluctuations within this region of the aquifer, its contribution to the overall water balance is expected to be important and thus will have to be considered, for example, in water resources assessments or groundwater models of this aquifer. Therefore, it is crucial not to disregard the absolute value of recharge when considering the dynamics of the aquifer system.

In the southwest part of the aquifer, recharge emerges as the dominant stress. This finding highlights the need for accurate quantification of recharge particularly in this area. The reason for this dominance is attributed to the channel structure at the base of the aquifer, particularly the presence of a ridge that separates the groundwater in this area from the river Mur (Fank 2011). The data-driven models demonstrate that this structural feature must be adequately represented in the conceptual and numerical models to ensure their reliability and accuracy.

The models corresponding to the ‘semi-reliable good fit’ category provide a good fit to the data but fail to pass the Stoffer-Toloi autocorrelation test. Autocorrelation in the noise of these models hampers the statistical evaluation (e.g., uncertainty analysis) of the results. The higher tendency towards autocorrelated noise in the northern, urban part of the study area might indicate that despite the good fit the models do not adequately represent all processes governing the head fluctuations in these areas. Remarkably, the reliable good-fit models in the northern part appear to be aligned along the river, whereas the semi-reliable or unreliable models are found at some distance from the river (Fig. 5b). This suggests that the potentially inadequate process representation is not associated with the river, but possibly with the complexity of urban recharge processes (Lerner 1990; Foster et al. 1999; Barron et al. 2013) or other anthropogenic impacts, e.g., related to construction activities (see the example given in section ‘Stakeholder engagement’).

In addition to the autocorrelation in the noise addressed by the previous discussion, the autocorrelation of the head observations themselves also deserves consideration. The autocorrelation function (ACF) of head observations at a lag of 1 month characterizes the response of the groundwater system. Figure 8a shows how the ACF at a lag of 1 month changes with the distance of the monitoring well from the river. Head observations located further away from the river exhibit a stronger autocorrelation. Furthermore, this pattern is linked to the classification of models, as ‘semi-reliable’ or ‘unreliable’ models, show a slightly higher median ACF than the ‘reliable’ models. Thus, the models mostly succeed in representing the dynamic head fluctuations resulting from the interaction with the river but appear to be challenged by the more inert behavior of head observations at a distance from the river.

Fig. 8
figure 8

The autocorrelation function (ACF) of head observations and its relation to the distance from river and model category. a Scatterplot of ACF vs. the distance from river. b Boxplots of ACF vs. model category. Black diamonds indicate outliers

The identification of knowledge gaps

The analysis of the time series model provided valuable insights into the dominant stressors and their spatial distribution within the Grazer Feld study area. Yet, only at 80 of the 144 sites were the models classified as “reliable”, and among these, only 39 models provided a good fit to the data. The majority of observation wells are located in the urban areas where recharge processes are complicated, for example, by sealed surfaces and the resulting fast runoff processes or artificial infiltration. Moreover, groundwater dynamics likely is affected by various human impacts as discussed further on. Thus, the result of the model classification, on the one hand, highlights the challenges posed by an urban environment and, on the other hand, underlines the benefit of using a screening model that supports the identification of knowledge gaps in such environments.

As shown in section ‘Stakeholder engagement’, the causes of low model performance were sometimes resolved through discussion with stakeholders. In particular, deviations between modelled and observed heads apparent in Fig. 7 and Fig. S4 of the ESM were attributed to temporal pumping activities known to the stakeholders. However, specific locations remain with unexplained head fluctuations. These knowledge gaps can likely be attributed to unknown pumping and construction activities, or other local influences (e.g., the aforementioned urban recharge processes) not captured by the model structure. Since the aquifer is heavily used for public water supply and agricultural and industrial purposes, one of the major obstacles to explaining head fluctuations can be linked to the lack of pumping data. The model calibration can partially compensate for the effects of unknown drivers. However, this can result in the model being ‘right for the wrong reason’, leading to a decrease in performance during the validation period, which may partially explain the observed decrease in KGE values from calibration to validation (Fig. 4).

A cluster of models, namely the ‘reliable bad fit’ and ‘unreliable bad fit’ models (shown as blue and red dots in Fig. 5b), is concentrated along the southern boundary. The stakeholders suggested that certain locations within this cluster might be influenced by human activities related to land-use changes, although precise details remain unknown. Additionally, other model outcomes could be affected by boundary conditions such as the Neogene impermeable layer.

However, it should be noted that not all poorly performing models can be explained solely by engaging stakeholders. In some cases, errors in data measurements or data collection techniques (potentially caused by station changes) could lead to unexplained fluctuations. Moreover, other surface-water levels, for which no data are available, may influence parts of the aquifer and lead to poor performance.

Construction activities have not been taken into account in the models, which is an additional factor to consider. In certain cases of groundwater fluctuations, especially in urban areas, stakeholders have attempted to establish a connection between the construction of underground parking garages and abrupt changes in groundwater heads. Nevertheless, exploring the influence of construction activities requires detailed information about the timing, location, and magnitude of these activities, which is rarely available.

Furthermore, the presence of geological heterogeneity and possible complex hydrogeological features within the aquifer may contribute to unexplained fluctuations. Investigating the influence of hydrogeological complexities in lithology, permeability, and hydraulic conductivity can provide insights into unexplained fluctuations and improve the accuracy of the conceptual model. In this study, the evaluation of existing investigation reports and stakeholder engagement supported the identification of such features (e.g., the aforementioned channel structure of the base of the aquifer).

Limitations and opportunities of data-driven models as screening models

The conceptual model can be viewed as a hypothesis of how the groundwater system under study functions (Anderson et al. 2015). Depending on the system, it is often possible to develop different conceptual models, or in other words, different hypotheses of how a groundwater system works (Enemark et al. 2019). An important aspect of the conceptual model relates to the stresses that are thought to drive the observed groundwater dynamics (National Research Council 2001; Merz 2012). These hypotheses regarding stresses can be built in a top-down approach, from simple to complex model structures (Shapoori et al. 2015b). In the current study, the hypotheses are shown via M1, M2, M3, and M4 in Fig. 2. The use of the workflow allowed the identification of the areas under recharge and river influences, for example, the effect of recharge in the southwest of the aquifer, the aquifer–river interaction in the northern part, and its disconnected nature in the south.

Depending on the model purpose, not only different types of data-driven models can be applied but also the model selection procedures (Peterson and Western 2014; Zaadnoordijk et al. 2019; Brakenhoff et al. 2022). With different types of data-driven models and objectives, the workflow might be altered due to the newly arising limitations or opportunities coming with these models. The criteria for what constitute a good fit or a reliable model are chosen by the modeler and may depend on the purpose of the modeling. When none of the available model hypotheses result in successfully modeled head observations, it is possible to engage stakeholders to adjust the hypotheses and include the missing information.

The use of the ‘traffic light’ classification of data-driven models allows zooming in to head observations and finding potential areas with knowledge gaps. Moreover, categorized models can be applied for different purposes depending on the acceptable assumptions. For instance, to estimate the standardized groundwater level index (SGI), the reliability criteria may not affect the results; hence, the use of ‘semi-reliable’ and ‘unreliable’ models can be appropriate (Bloomfield and Marchant 2013). However, reliability checks are important as they consider the parametric uncertainty of a response function that is used to accurately estimate the system’s response to various stressors. For example, in cases where recharge estimates may be utilized, model parameter uncertainty quantification is crucial (Collenteur et al. 2021). Another use of the ‘traffic light system’ might be relevant to the preparation of calibration data sets against which a numerical groundwater model can be calibrated. These data sets could, for example, be head time series of good-fit models that are cleaned from outliers (as identified by the data-driven models), or even entirely new data that characterize the groundwater system, such as the step response (Bakker et al. 2008). One possible advantage of such an approach could be that the groundwater model is calibrated to less noisy data—for example, the observed groundwater data may be generated by stresses and processes that are not (or cannot be) included in the groundwater model, and the data-driven models may help to detect such data points. Another advantage may be that less time is required to calibrate the model and perform uncertainty analysis.

While the proposed workflow helps to identify important stresses, the approach is dependent on the data quantity and quality representing those stresses and groundwater head fluctuations. As such, including pumping data might explain the head fluctuation in a larger number of monitoring wells. Unfortunately, time series of groundwater withdrawals are rarely available, not only in the given case but also for many aquifers worldwide (Condon et al. 2021; Brookfield et al. 2024). One of the methods to address this data gap involves the use of data-driven techniques proposed by Yu et al. (2023) who applied empirical orthogonal function (EOF) and Hilbert-Huan transform (HHT) to extract high-frequency head variations related to pumping activities. These head variations are subsequently used to obtain operational periods and their pumping rates. The study of Yang and McCoy (2023) employed the aggregated permitted withdrawals as a surrogate series to represent pumping activities, which revealed underestimations and correlation to the increased groundwater extraction during drought conditions.

Additionally, data-driven models are commonly thought to lack physical interpretation (Lees 2000; Todini 2007; Young et al. 2007; Solomatine and Ostfeld 2008; Reichstein et al. 2019). Some methods exist to tackle this issue with other types of data-driven models—for example, the combination of data-driven models with process-based models (Li et al. 2022), with data assimilation methods (Chang and Zhang 2019), and the integration of domain knowledge and physical principles (Reichstein et al. 2019; Soriano et al. 2021; Depina et al. 2022; Shadab et al. 2023). Attempts to identify aquifer properties such as transmissivity and storativity, through lumped-parameter and time series models were taken in studies by Olin (1995); Shapoori et al. (2015a); Lewis et al. (2016); Yu et al. (2023), among others. Further investigation of the Pastas models’ parameters and aquifer properties may broaden the application of these models for the improvement of groundwater conceptual models.

Conclusions

In this paper, an adapted workflow is proposed to use data-driven models as screening models to characterize a groundwater system. The workflow allows the testing of different conceptual hypotheses to identify the best plausible combination of stresses for a point location in an automated and reproducible way. Special attention was given to the inclusion of stakeholders to identify unknown stressors on the aquifer. The model screening scheme allows the classification of all models into four categories based on a ‘traffic light’ system. This system may help in identifying the ways models can be used. The approach was tested on 144 groundwater observation wells located in the Grazer Feld Aquifer, Austria. The time series model applied here was found to improve the understanding of the impact of different stressors on the groundwater head fluctuation in the Grazer Feld Aquifer. The main findings are:

  • Recharge contributes to a higher head increase in the south of the aquifer than in the northern part.

  • The river plays an important role in the aquifer with a north–south division opposite to that of recharge. The groundwater heads respond with larger fluctuation to river stress in the northern part. The second important feature is the distance to the river. Monitoring wells located closer to the river show a stronger impact. Both the model results and stakeholder input suggest the presence of a disconnection between the aquifer and the river in the southeastern portion of the aquifer due to the impermeable river bed.

  • A step trend is present in the area and needs to be considered for the locations in the vicinity of dams. On the other hand, an example with a step trend not being linked to dams may indicate the presence of other processes.

  • The use of data-driven models provided a fast, low-effort approach to involve stakeholders during the initial stages of the groundwater model conceptualization and enabled data-supported discussions.

Thus, the time series modeling not only revealed significant (and nonsignificant) stresses affecting the groundwater dynamics for point locations, but also their spatial pattern. Improved aquifer understanding of hydrological stresses and aquifer boundary conditions is the main contribution of the proposed time series models workflow as a screening tool. Depending on the modeling purpose, the workflow may need further adaptation when used with other types of data-driven models or model selection criteria.