Baseline groundwater monitoring for shale gas extraction: definition of baseline conditions and recommendations from a real site (Wysin, Northern Poland)

Public concerns have been raised regarding the use of hydraulic fracturing for shale gas extraction and its potential impact on the environment. The absence of baseline monitoring data in the US experience has been identified as a major issue. Here, results from a 6-month groundwater baseline monitoring study near an active shale gas pad in northern Poland are presented. The data collected in dedicated boreholes include groundwater samples analysed for inorganic constituents, dissolved gases and stables isotopes (δ2H and δ18O) and downhole temperature and conductivity measurements at 15-min intervals. A robust statistical analysis combined with an estimation of data uncertainty helps to identify spatial and temporal variability within the datasets. As a result, baseline conditions are defined using confidence intervals around the mean on a per-well basis and these will serve for future reference for this site. The groundwater chemical composition is similar to regional background levels and typical of Quaternary aquifers in the region. It is also consistent with previous baseline monitoring carried out by the Polish Geological Institute. Only manganese and bromide occur in groundwater at concentrations above Polish drinking water standards. Based on this work, the paper provides some recommendations for future baseline monitoring and identifies areas for future research such as use of statistics for high-frequency datasets.


Introduction
Exploitation of shale gas by hydraulic fracturing ('fracking') gained its controversial status after allegations that many drinking water wells in the vicinity of US shale gas sites had suffered detrimental changes in abstracted groundwater quality (Howarth et al. 2011;Schnoor 2012). Systematic reviews have largely failed to substantiate the most extreme claims, although there are a few situations in the published literature where well integrity failure has been identified or is strongly suspected to have been the cause of impairment to groundwater quality (Bair et al. 2010;Darrah et al. 2014;Jackson et al. 2013a;US EPA 2015). The lack of baseline data in the USA led to difficulties in identifying the existence, magnitude and cause of alleged groundwater quality changes (Vidic et al. 2013). Nevertheless, uncritical reporting of groundwater contamination allegations triggered public opposition in Europe at the time the first shale gas drilling permits were awarded (Williams et al. 2015). In 2011, hydraulic fracturing at Preese Hall 1 near Blackpool, the first UK shale gas well, was suspended after a 2.3-magnitude induced earthquake (Huw et al. 2014). This event drew further public attention to the shale gas industry and its use of hydraulic fracturing. As a consequence, a more cautious approach has been taken in Europe as compared to the USA. Several potential impacts have been identified (Mair et al. 2012), and the need for baseline monitoring prior to any shale gas activity has been widely acknowledged by the scientific community (Jackson et al. 2013b;Mair et al. 2012). Baseline monitoring will allow the detection of changes and trigger corrective actions from the operator if necessary or, Electronic supplementary material The online version of this article (https ://doi.org/10.1007/s1160 0-019-00254 -w) contains supplementary material, which is available to authorized users. in the absence of changes, to reassure the public regarding the safety of shale gas operations. To date, very few published baseline studies have been carried out prior to hydraulic fracturing and they were mostly focusing on dissolved methane concentrations (Bell et al. 2017;Humez et al. 2016;Moritz et al. 2015;Schloemer et al. 2016;Siegel et al. 2015)-which is often insufficient for a contamination diagnostic (Lefebvre 2017)-and/or considered baseline conditions at regional scale (Harkness et al. 2017;Rhodes and Horton 2015;Sloto 2014).
In Europe, Poland is one of the countries with the largest estimated reserves (PGI-NRI 2012; US EIA/ARI 2013), located within the Lower Palaeozoic Baltic-Podlasie-Lublin Basins. Poland was the leader of shale gas exploration until the end of 2016 when efforts ceased, largely due to disappointing production rates from test wells (Cantoni 2018). During the period 2010-2016, a total of twenty-eight horizontal wells were hydraulically stimulated (Pyssa 2017), of which seven sites were independently monitored by the Polish Geological Institute (English acronym PGI-NRI; Konieczyńska et al. 2015), producing limited baseline data, e.g. one single-event sampling for water bodies, 1.5 months of background seismicity at the Łebień site (Konieczyńska et al. 2011). One particular site (Wysin in Pomerania, northern Poland, within the Baltic Basin) was subject to an intense international monitoring study within the framework of the EU Horizon 2020 SHEER project. The overall goal of the project was to develop best practice in order to understand, prevent and mitigate environmental impacts and risks associated with shale gas activities. To achieve this objective, the first multidisciplinary continuous monitoring effort at an instrumented shale gas site in Europe was undertaken, where seismicity, air and groundwater monitoring was undertaken before, during and after fracking. Together with subsequent British Geological Survey's (BGS) monitoring efforts at two planned shale gas sites in England (Smedley et al. 2015), these programmes provide the only known examples of long-term baseline monitoring prior to planned shale gas fracking.
Baseline monitoring started progressively from July 2015 (air quality) and by the end of 2015, the entire monitoring network was functional. The baseline monitoring lasted until hydraulic fracturing was carried out in June and July 2016. The duration of operational and post-fracking monitoring varied from a few months (seismicity) to 18 months (groundwater) (more details available in López-Comino et al. 2018). Here, this paper focuses on the results from the baseline groundwater monitoring at site level, using a network of four dedicated boreholes. The objectives of the paper are threefold: (1) the quality of the shallow drinking water aquifer located near the Wysin shale gas pad is assessed using dedicated boreholes. The monitoring results are then compared with the regional background levels and previous local baseline monitoring to evaluate the suitability of the boreholes for monitoring (Jackson and Heagle 2016); (2) a systematic statistical approach to small and continuous datasets is combined with the estimation of data uncertainties to assess spatial and temporal variability and provide a consistent framework for robust data analysis; (3) unbiased baseline conditions at local scale are defined and could serve as future reference levels to identify potential groundwater contamination from shale gas activities. The paper concludes with some recommendations arising from this work.

Study area
The Stara Kiszewa shale gas concession area is located in the Pomerania province of northern Poland, about 50 km south-west of Gdansk city (Fig. 1). It covers an area of 981.5 km 2 . Within the Stara Kiszewa concession area, exploration works targeting hydrocarbons have been carried out since the 1960s. Seismic surveys were carried out in the region during two campaigns in 2004.

Geology of the region
The Pomerania province is located within the Baltic Basin, considered to be one of the most promising shale gas basins in Europe (PGI-NRI 2012; US EIA/ARI 2013). It has a simple geological structure which has experienced relatively little tectonic deformation. The rock sequence runs from the Palaeozoic to the Mesozoic periods. The prospective target rocks for shale gas are the Lower Palaeozoic Middle Silurian Wenlock (typically 400 m thick) and Lower Silurian Llandovery Formations (typically less than 100 m thick), which both contain dark grey to black organic shales that commonly exhibit strong gas shows in exploration wells (Brownfield et al. 2015).
The geology of the sub-surface was confirmed by the drilling of the vertical Wysin-1 borehole for shale gas exploration. The Wysin-1 borehole was drilled in 2013 to a final depth of 4040 m, after penetrating a 54.5-m-thick section of Middle Cambrian rocks (Makos 2014). These rocks comprise black mudstones and clays with interbeds of fine-grained quartz sandstones. They are overlain by 30 m of Ordovician rocks comprising marl, claystone and shale belonging to the Prabuty formation. The Silurian succession is more substantial, about 1900 m in thickness, including 35 m of the Llandovery Formation shales. The rest of the sequence includes in order of succession, about 400 m of Permian, 600 m of Triassic, 300 m of Jurassic and 550 m of Cretaceous strata. The sequence is completed by 235 m of Tertiary and Quaternary sediments (Konieczyńska et al. 2014(Konieczyńska et al. , 2015. Subsequent to the initial vertical Wysin-1 exploration well, two additional boreholes, with laterals intended for hydraulic fracturing, were drilled from the same pad (Wysin-2H and Wysin-3H) during autumn 2015. Wysin-2H reaches a true vertical depth (TVD) of 3865 m with a target formation of the Lower Silurian Wenlock Formation strata. Wysin-3H reaches a TVD of 3974 m targeting the Ordovician age strata.
Hydrogeology of the region The regional groundwater system is known as the Baltic Artesian Basin (BAB). It forms a complex multi-layered hydrogeological system that covers about 480,000 km 2 . The thicknesses and permeabilities of the successive aquifer and aquitard layers are variable. The main aquifers are present in Cambrian, Upper Ordovician-Upper Silurian, part of Devonian (where present), part of Carboniferous (where present), part of Upper Permian, Lower-Middle Jurassic, Cretaceous, Paleogene, Neogene and Quaternary age strata (Virbulis et al. 2013). The Upper Cretaceous system forms the predominant regional fresh water aquifer (Sadurski 1986). Fresh water is also found in the most recent sediment layers, especially the Pleistocene aquifer, one of the richest groundwater sources in Poland (Jaworska-Szulc 2009; OGI n.d.). The whole system is mostly confined apart from the unconfined/ leaky confined Quaternary system. The permeability of the aquitards allows some exchange of groundwater between the layers of active flow (Jaworska-Szulc 2009;Sadurski 1986;Virbulis et al. 2013).
The shale gas concession is located in a predominantly rural region where the population relies on groundwater for drinking water and irrigation water for agriculture. The hilly landscape's geomorphology is a result of the last glaciations (the presence of the Scandinavian continental glacier), combined with the erosive and accumulative activity of the rivers. Locally, the Quaternary aquifer is considered to be multi-layered and is commonly conceptualised with three recognised aquifers separated by two aquitards (Konieczyńska et al. 2015;OGI n.d.). The uppermost aquifer is associated with river valleys sediments. This shallow aquifer has only local significance as a potable water supply due both to its limited thickness and deteriorated water quality (Konieczyńska et al. 2014(Konieczyńska et al. , 2015. A discontinuous layer of silty clay separates this aquifer from the middle one and offers partial protection from the contaminated superficial aquifer (Konieczyńska et al. 2015). The middle aquifer forms the main water resource due to its easy accessibility (15-50 m below ground level), its sandy/gravelly composition and its good water quality (OGI n.d.). The third aquifer is largely unexploited due to its depth (given the presence of the second aquifer) and is therefore poorly understood due to scarcity of data (Konieczyńska et al. 2015), but is thought to be supplied by infiltration from the upper aquifers (Konieczyńska et al. 2014). In terms of the EU Water Framework Directive, parts of the Gołębiewo Intra-Moraine groundwater body (groundwater body No. 30; EEA 2014) Fig. 1 Location of the shale gas pad, the groundwater monitoring boreholes from this study and the sampling points from the Polish Geological Institute's baseline monitoring (Konieczyńska et al. 2014) 1 3 are located within the shale gas concession, including the area of the drilling pad (Konieczyńska et al. 2015). However, the groundwater protection area is located outside the concession (OGI n.d.).
About 800 m south of the Wysin shale gas pad lies the Wietcisa River, which flows from west to east, towards the Wisła (Vistula), and acts as a local discharge zone for the Quaternary aquifer. The Rutkownica River, a tributary to the Wietcisa, is located 2 km east of the pad and also acts as a local discharge zone ( Fig. 1; Konieczyńska et al. 2015). Consequently, the main shallow groundwater flow direction is north-south, with lateral flow in the east and west directions. In addition to the rivers, the area is characterised by the presence of lakes, waterlogged areas in endorheic depressions and seasonal or permanent wetlands (OGI n.d.).

Monitoring plan
The monitoring strategy is based on the compilation of existing data, mainly from the Polish Geological Institute, the environmental impact assessment undertaken prior to drilling, and the literature review presented in the introduction.
Prior to the SHEER groundwater monitoring network, four monitoring boreholes were installed in the vicinity of the pad and are believed to be part of the operator's monitoring plan. Unfortunately, access to these data was not granted. The SHEER groundwater monitoring network also consists of four boreholes, whose locations were selected based on a hydrogeological model developed by the PGI-NRI (Konieczyńska et al. 2014). The final locations were constrained by owner's permissions to access their land. These boreholes were drilled in the main Quaternary aquifer in November 2015 ( Fig. 1). Their characteristics are summarised in Table 1 of the Electronic Supplementary Material.
Drilling of the monitoring boreholes suggests that the commonly accepted three-layer conceptual model is significantly oversimplified. The Quaternary multi-layer aquifer is highly heterogeneous, as is often the case with fluvioglacial sediments, and is characterised by the presence of lowerpermeability layers or lenses interbedded within highly permeable layers. Simplified geological cross sections based on these boreholes and other available borehole logs in the vicinity of shale gas pad can be found in Gunning et al. (2017).
The groundwater baseline monitoring started about 1 month after the installation of the monitoring boreholes, in December 2015. The baseline monitoring period lasted 6 months during which four sampling rounds were completed. Emphasis was placed on the characterisation of the aquifer system and its background concentrations in terms of inorganic compounds, dissolved gases and isotopic signatures. Inorganic constituents are comparatively easy to sample and analyse. Their fate within the sub-surface is well known with a few of them behaving conservatively (e.g. chloride). They may be good indicators of surface spills reaching the aquifer, one of the most common sources of contamination at shale gas sites (US EPA 2016). Much less is known about organic compounds (Gordalla et al. 2013;Kahrilas et al. 2015;Luek and Gonsior 2017;Stringfellow et al. 2014).

Continuous monitoring of temperature and electrical conductivity
Downhole probes (CTD-Divers, Schlumberger) were installed in each borehole in December 2015, at the midpoint of the screened interval ( Fig. 1 in ESM). They record absolute pressure (non-vented sensor), temperature and specific electrical conductivity (at 25 °C) at 15-min intervals. In this paper, the focus is on the groundwater quality. Water level data processing, including correction for barometric effects and interpretation, is fully described in Gunning et al. (2017).

Confidence intervals
Definition of baseline conditions 0.05 (95% confidence level) was measured with an accuracy of ± 0.1 °C and a resolution of 0.01 °C. Accuracy and resolution were, respectively, ± 1% and 0.1% of the reading for the electrical conductivity.

Groundwater sampling
As part of the baseline monitoring, a total of four sampling campaigns were carried out between December 2015 and June 2016, before the fracking started. The wells were purged using a submersible pump (GRUNDFOS ® model SQE-2-85) placed a few metres below the water level to ensure a good-quality purging ( Fig. 1 in ESM). Samples were taken after pumping out three wellbore volumes and once the physico-chemical parameters had stabilised. The physico-chemical parameters (temperature, pH, specific electrical conductivity, dissolved oxygen and oxidation-reduction potential) were measured using a multiparameter probe (model YSI Professional Plus) which was calibrated each day.
A 1-l plastic bottle was filled up to the neck for major ions, minor and trace element analyses. In addition to the laboratory measurement, total alkalinity was also determined in the field using a HACH ® kit. Each titration was performed with 0.16 N sulphuric acid on 100 ml of freshly collected sample, to which a bromocresol green-methyl red indicator powder pillow was added. Samples for dissolved gas analyses were collected in 20-ml brown glass vials filled without bubbles and closed by a septum cap. Following this same sampling procedure, water stable isotope (δ 18 O and δ 2 H) samples were collected in duplicate or triplicate in 15-ml glass vials filled without bubbles. The bottle cap is taped to minimise losses by evaporation. All samples are kept at 4 °C during storage and transport.
Dissolved gases (methane, ethane, propane and ethene), analysed by the Concept Life Sciences laboratory in Manchester, UK, were undertaken by gas chromatography coupled to a flame ionisation detector (GC-FID).
Water stable isotope analyses were undertaken at the SUERC laboratories in East Kilbride, UK. For δ 18 O analysis, each sample was over-gassed with a 1% CO 2 -in-He mixture for 5 min and left to equilibrate for a further 24 h. A sample volume of 2 ml was then analysed using standard techniques on a Thermo Scientific Delta V mass spectrometer set at 25 °C. Final δ 18 O values were produced using the method established by Nelson (2000). For δ 2 H analysis, sample and standard waters are injected directly into a chromium furnace at 800 °C (Donnelly et al. 2001), with the evolved H 2 gas analysed online via a VG Optima mass spectrometer. Final values for both δ 18 O and δ 2 H are reported as per mil (‰) variations from the Vienna Standard Mean Ocean Water (V-SMOW) in standard delta notation. In-run repeat analyses of water standards (international standards V-SMOW and GISP-Greenland Ice Sheet Precipitation, and internal standard Lt Std) gave a reproducibility better than ± 0.3‰ for δ 18 O and ± 3‰ for δ 2 H.

Duplicates and blanks
To ensure good-quality sampling and laboratory analyses, sequential duplicates represent 10% of the total number of analysed samples (1 duplicate every 10 samples), and field and transport blanks represent 5% (1 of each type for every 20 samples). Blank preparation, field handling and interpretation can be found in Gunning et al. (2017). This procedure was applied for the entire monitoring. As a result, only one pair of duplicates was taken during baseline monitoring (GW1; March 2016). Additional duplicates and blanks were taken during the operational monitoring period.

Uncertainties of laboratory and downhole measurements
Measurement uncertainties (u) arising from analyses ( s 2 analytical ) and sampling ( s 2 sampling ) were estimated using an empirical approach based on a statistical model (Eq. 1; Ramsey and Ellison 2007). The complexity of the empirical method depends on the number of sampling teams involved in the sampling process and the number of sampling protocols followed. Here, only one sampling team is involved using the same protocol (exceptions detailed below). As a result, the 'duplicate' method was applied which allows estimation of a combined analytical and sampling uncertainty. By this method, only the precision component of each uncertainty ( s 2 a,precision and s 2 s,precision in Eqs. 2 and 3) is evaluated and the bias ( s 2 a,bias and s 2 s,bias in Eqs. 2 and 3) component is assumed to be negligible. The theory behind the duplicate method together with examples can be found in Grøn et al. (2007), JCGM (2008), Ramsey and Ellison (2007) and Witczak et al. (2006). More details on the calculation of uncertainties applied to this study are available in Montcoudiol et al. (2018).
It is assumed that measurement uncertainties are constant during the entire monitoring programme. Therefore, uncertainties are estimated by using all duplicates collected over the course of the 2-year monitoring programme (representing 10% of the collected samples). A total of five duplicate pairs are available ( Table 3 in ESM). Grøn et al. (2007) recommend using a minimum of eight sets of duplicates to obtain a reliable estimate. Fewer can result in an overestimation of the uncertainty (Ramsey and Ellison 2007).
The same method is used for isotope data, for which a large number of duplicates/triplicates is available during the baseline monitoring programme (8 duplicates and 8 triplicates). The 'duplicate' method was modified as necessary when applied to triplicate samples as explained in Montcoudiol et al. (2018). The method provides the uncertainty of one measurement. Rules of error propagation were applied to estimate the uncertainty for the average duplicate/triplicate result.
Uncertainty for temperature and conductivity measurements is estimated using the manufacturer's specifications provided in "Continuous monitoring of temperature and electrical conductivity" section. The resolution is related to the precision of the measurement, i.e. the lowest precision that can be obtained, which results in a precision usually exceeding the resolution. The accuracy relates to the systematic error or bias in the measurement.

Ion balance error
The ion balance error (IBE) was calculated for each analysis. Results of electrical balances are compiled in Table 2 of the ESM. They are mostly between 5 and 10%. Additional discussion regarding these results is available in Gunning et al.
. For duplicates, the sample with the IBE closest to zero was retained for further data analysis and interpretation.

Statistical analysis
A series of statistical tests were performed on the data to define the baseline conditions at the site. These are listed in Table 1, in which α-levels from the US EPA (2009) are included. Full description of the tests including their underlying assumptions is available in Helsel and Hirsch (2002) and US EPA (2009).

Basic assumptions
Normality, statistical independence and equality of variance are the basic assumptions for a number of tests and are tested first. Independence might be achieved when sampling on an occasional basis, but this is certainly not the case for data collected by the downhole probes at short intervals (e.g. 15-min intervals). Although physical independence does not guarantee statistical independence (but makes it more likely), physical independence of the measurements is first estimated for these datasets by calculating a minimum time between measurements to ensure that distinct volumes of groundwater are measured. This method, based on Darcy's law, is outlined in Chou (2004) and US EPA (2009). The normality and equality of variance tests are complemented by a graphical method, i.e. boxplots from Tukey (1977). Spatial variability and temporal variability are tested by using the Kruskal-Wallis test (Kruskal and Wallis 1952) and its reverse version, respectively. Despite the small size of the datasets (3 groups of size 5 or less or 4 or more groups of size 4 or less per group; Helsel and Hirsch 2002), the large-sample approximation is used due to the presence of ties (samples with identical concentrations) within the data, with the exception of strontium for which an exact test is computed (Helsel and Hirsch 2002).

Confidence intervals
Baseline conditions are defined by calculating confidence intervals. Ideally, a nonparametric approach would be used. However, due to the limited amount of data, nonparametric intervals cannot be defined with a sufficient level of confidence (i.e. 87.5% with n = 4 vs. 95% or more). Parametric confidence intervals are based on assumptions of independent and identically distributed measurements. In other words, there should be no outliers, the measurements are statistically independent, there is no trend (no temporal variability) and no spatial variability, and the data are approximately normally distributed (US EPA 2009). A parametric approach is therefore used, after testing for normality, independence and the presence of spatial and temporal variability (Table 1).

Assessing normality
In general, the mean and the median are similar, but they are not always located in the middle of the boxplot box (representing the interquartile range), suggesting some skew in the data distribution (Fig. 2). Some extreme values are present although not identified as outliers (e.g. Na in GW1, HCO 3 − in GW2 and Mn in GW1). The outliers observed for temperature and conductivity data result from the impact of well-purging during groundwater sampling (Figs. 2 and 3 in ESM).
Results from the Shapiro-Wilk test (Shapiro and Wilk 1965) indicate that in most cases the hypothesis of normal distribution cannot be rejected, considering the α-level of 0.1 recommended by Helsel and Hirsch (2002) and the US EPA (2009) to maximise the utility of the test for small datasets (n < 10; Fig. 2). The parameters and wells where this hypothesis is rejected are those exhibiting small concentrations with limited variations (e.g. Mg, Na, K and Ba) or with one sample with a very different concentration (visible in the boxplots in Fig. 2). In the latter case, a departure from normality is plausible. As a result, for inorganic parameters and isotope ratios, the more robust nonparametric tests are used where possible and normality is assumed when a sufficient level of confidence cannot be reached (e.g. confidence intervals).
Normality is not tested for the large datasets obtained with the downhole probes (n > 16,000), for which the central limit theorem (Pólya 1920) applies. The departure from normality (tailing) is largely due to the impact of sampling on groundwater conditions and is not representative of the background conditions. Independence of data at 15-min intervals is discussed in the following subsection.

Statistical independence
For inorganic constituents and isotope ratios, statistical independence with respect to temporal variability is confirmed by the results of the rank von Neumann ratio test, for which the p values are larger than the 0.01 α value. However, no explicit adjustment of the ratio has been developed in the presence of ties among ranks, resulting in very approximate p values.
Data at 15-min intervals (i.e. temperature and conductivity) are not statistically independent with respect to time. Based on data available prior to drilling of the monitoring boreholes, physical independence is estimated to be obtained within 9 h. Additional data from drilling confirm this first estimate except for GW2 where the local hydraulic conductivity is significantly lower ( Table 2). For this borehole, the minimum time would be around 46 h (~ 2 days).
Results from the rank von Neumann ratio test show that statistical independence is reached over 2-week or 3-week time intervals for most parameters. It must, however, be borne in mind that the p values in Table 2 are approximate due to the presence of ties and therefore are only indicative.
The p values are also sensitive to the impact of sampling on the values (i.e. selection of the initial data point from a total of 1920 possibilities), especially for the conductivity in GW1, which shows a longer impact from the sampling (up to 10 days; Fig. 3 in ESM). The new datasets at 3-week intervals are checked again for normality. The hypothesis of normality is rejected for conductivity in GW1 and GW3 with an α-level of 0.1.

Equality of variance
The examination of the boxplots suggests significant difference between the variances of each well (Fig. 2) except for arsenic, strontium and isotope ratios. From the Levene's test results, significant departure from equality of variance is observed for potassium and sodium only ( For temperature and conductivity data, the boxplots suggest similar variance (Fig. 2). However, because the Kruskal-Wallis test assumes statistical independence Table 2 Assessing physical and statistical independence Prior information comes from shape files transmitted by the Polish Geological Institute. Hydraulic conductivity values result from the interpretation of pumping test data (GT 2015). Hydraulic gradients are estimated based on the results of a preliminary groundwater numerical model (Gunning et al. 2017 (Table 3), and similar variance between the datasets is assumed.

Spatial and temporal variability
Results from the Kruskal-Wallis test (spatial variability) show that only bicarbonate and fluoride (and possibly calcium) concentrations and δ 2 H ratios have p values higher than the chosen α-level (Table 3). This means that these parameters are not affected by spatial variability and additional statistical tests could be run on an inter-well basis. For the other parameters, statistical tests on per-well basis are required. The reverse Kruskal-Wallis test for temporal variability only works for parameters with no significant spatial variability; otherwise, it may overshadow the temporal variability (as is the case in this study; Table 3). Of these four parameters, only the δ 2 H isotope ratios show significant temporal variability.

Uncertainties
Maximum uncertainties on the concentrations (at 95% confidence level) are in the 20-25% range for calcium, chloride, fluoride and manganese (Table 3 in ESM). For comparison, maximum uncertainties in the EU Directive 98/83/EC on the quality of drinking water (EC 1998) are 30% for arsenic and 20% for fluoride (Cl, Mn, Na, SO 4 2− and conductivity have been removed in the new directive proposal; EC 2018).
Estimates of uncertainty are made more difficult by a change of analytical method for alkalinity determination (from December 2016), affecting the last duplicate. Based on the entire dataset, it is believed that the new technique has a better analytical precision: before December 2016, analytical results appear to be rounded up to the nearest ten, and after December 2016, to the nearest digit. This has some effect on the results of the duplicate method. Table 3 of the Electronic Supplementary Material shows that samples analysed by the initial technique have the same alkalinity values, whereas the last pair of duplicates analysed with the new method presents a difference of 3 mg/l. From the same date, the filtration and acidification for cations and metals was carried out in the field instead of the laboratory. However, this change seems to have had an insignificant impact on the duplicate concentrations.
Uncertainties cannot be defined when all the sets of duplicates have the same concentrations. This is the case for magnesium, potassium and barium. This is due to the precision of the analytical results which have been rounded up to the nearest digit, and suggests that sampling does not add uncertainty larger than the precision of the analysis results. Maximal theoretical uncertainties can be assessed taking into consideration rounded numbers. With the initial analytical technique for alkalinity determination, the concentration uncertainty is ± 5 mg/l which is not accounted for in the calculation of uncertainty. Taking into account the maximal difference between two duplicates (9 mg/l), the uncertainty could be up to 9.5% (at 95% confidence level). In the same fashion, maximum uncertainty could be 25%, 16%, > 100% and 22% for arsenic (± 0.05 µg/l), barium (± 0.5 µg/l), potassium (± 0.5 mg/l) and magnesium concentrations (± 0.5 mg/l), respectively.
Uncertainties of δ 2 H and δ 18 O isotope ratios calculated using the 'duplicate' method are very similar to the reproducibility given by the laboratory (Table 4 in ESM). The calculated uncertainties combined analytical and sampling precision, whereas reproducibility only accounts for analytical precision. Results demonstrate that sampling does not add more uncertainty to the isotopic ratios than analysis. Reproducibility data are used as a proxy for uncertainties of δ 2 H and δ 18 O isotope ratios.
For the temperature and conductivity measurements, the precision is assumed to be 10 times the resolution. Therefore, precision and accuracy are the same: ± 0.1 °C and ± 1% of the reading for temperature and conductivity, respectively. Combined, the total uncertainty of temperature and conductivity is ± 0.14 °C and ± 1.4% of the reading, respectively (at 95% confidence level).

Chemical characterisation of the aquifer
All the data handling and processing have been extensively discussed in Gunning et al. (2017). Only the main results are presented in this section.

Major, minor and trace elements
The groundwater is characterised by low mineralisation and is of Ca-HCO 3 water type (Fig. 3). Low mineralisation is also corroborated by the measurements of groundwater conductivity, varying within the range of 440-500 µS/cm (Fig. 3 in ESM). Generally speaking, the groundwater quality is similar in the four monitoring boreholes, i.e. concentrations and conductivity values are of the same order of magnitude, suggesting good hydraulic connectivity between them. From a statistical point of view (Kruskal-Wallis test, α = 0.05), only bicarbonates, fluoride and possibly calcium concentrations are similar in the four wells (Table 3), whereas for the other parameters the concentrations in at least one well differ significantly from the other wells. When considering the uncertainties in concentrations, the groundwater chemistry shows limited temporal variability during the baseline monitoring (Fig. 4). The only exception would be bicarbonate and sulphate concentrations, which appear to be significantly different from one sampling event to another. This is not statistically confirmed for bicarbonates since the uncertainties for alkalinity are likely to be underestimated for the reasons discussed in "Uncertainties" section.
Most minor and trace elements have concentrations near or below the detection limit ( Table 2 in ESM). Notable exceptions are manganese and strontium (both about 100 times the detection limit), fluoride (concentrations quite variable), barium (concentrations at least 10 times the detection limit) and arsenic (detected in all samples). Traces of boron, chromium, copper, iron, nickel, antimony and selenium were found on some occasions. The remaining elements were systematically below their respective detection limits. Temporal variations of minor and trace elements in significant concentrations are shown in Fig. 4. Considering uncertainties, strontium concentrations show some significant temporal variability. It is interesting to note that the temporal variations for strontium follow the same trend in the four wells and are similar to sulphate concentration variations. Spatial variation is also quite obvious for strontium, barium and arsenic (less obvious for manganese due to larger uncertainty), with at least one well having a different concentration. This confirms the results from the Kruskal-Wallis test (Table 3).
With respect to compliance with Polish drinking water standards, manganese concentrations exceed the standard by two to three times ( Fig. 4; Table 2 in ESM). The detection limit for bromide (0.05 mg/l) is five times the Polish drinking water standard (0.01 mg/l). Bromide was detected on several occasions, suggesting that bromide concentrations for the rest of the samples are close to the detection limit and likely above the Polish drinking water standard. All other elements have concentrations below Polish drinking water standards.

Temperature and specific conductivity
The stability of the groundwater chemistry is confirmed by the continuous monitoring of the specific conductivity and temperature. The specific conductivity was fairly constant during the baseline monitoring period and was similar in all the monitoring boreholes (varying between 440 and 500 µS/cm; Fig. 2 in ESM). Similarly, the temperature was also fairly constant over time in all boreholes, ranging from 7.85 to 8.20 °C (Fig. 3 in ESM). This is confirmed by the results of the Kruskal-Wallis test on sub-datasets at threeweek intervals (Table 3). The impact of purging and sampling was clearly visible on the specific conductivity for all monitoring boreholes except GW3 (Fig. 2 in ESM). The effect was variable from borehole to borehole: a systematic decrease by 60-70 µS/cm was observed in GW1, whereas a systematic increase could be seen in GW4. The purging and sampling also affected the temperature records, with a systematic increase by 0.03 to 0.04 °C in GW2 for instance (Fig. 3 in ESM). Although the aquifer appears globally homogeneous in terms of groundwater quality across all monitoring boreholes, these small changes on sampling appear to reflect 'new' groundwater being drawn into the well during sampling and illustrate the results from the Kruskal-Wallis test (Table 3). Prior to the pump being switched on, the sensor 'sees' a volume of water which may have been resident in the well for some time, or which has migrated slowly through the well screen under natural head gradients. As soon as the pump is switched on, this volume is rapidly replaced by a new flow of groundwater from the aquifer, migrating under a high induced head gradient (and possibly encompassing a more three-dimensional flow regime-i.e. water from above and below the screened horizon as well as groundwater within the screened horizon).

Dissolved gases
Dissolved gases were not detected during the baseline monitoring, possibly as a result of high detection limits (e.g. 40 µg/l for methane; Table 2 in ESM).

δ 2 H and δ 18 O isotope ratios
No GNIP (Global Network of Isotopes in Precipitation; IAEA/WMO 2016) station is present in northern Poland that could provide a comparison for the groundwater data. Instead, interpolated monthly average data (IAEA 2016;Terzer et al. 2013) for the study area were used as a proxy for the local meteoric water line (LMWL; Gunning et al. 2017). The LMWL has a similar equation to the global meteoric water line (GMWL) defined by Craig (1961) (Fig. 5).
Data were in the same range for all monitoring boreholes, with δ 2 H ratios varying from − 71 to − 60‰ and δ 18 O ratios from − 10.3 to − 9.3‰ (Table 4 in ESM). Significant Fig. 4 Temporal variation in concentration for major elements, and minor and trace elements in significant concentrations. Errors bars: uncertainties calculated by the 'duplicate' method with a 95% confidence level. For all duplicate sets with the same concentrations (Mg, K, Ba and As), no uncertainties could be calculated. Limits of detection (LoD) and drinking water standards (DWS) shown when included within the display of the y-axis. For chloride and manganese, LoD not visible due to their low value (0.01 mg/l and 1 μg/l, respectively) temporal variability was detected by the Kruskal-Wallis test for δ 2 H ratios (p value = 0.01; Table 3) and was confirmed when considering the uncertainties (in particular for GW2 and GW3; Fig. 4 in ESM). δ 18 O values were slightly less negative in GW2 than in the other boreholes (Table 4 in ESM). This difference appears to be significant (Fig. 4 in ESM) and was detected by the Kruskal-Wallis test (p value = 0.02; Table 3). With regard to the uncertainties of the yearly interpolated data, all these differences appear to be relatively insignificant (Fig. 5). They are likely to simply result from some natural variation of the recharge and complex recharge pathways.
The data plot close to the LMWL, within the confidence interval of the interpolated LMWL (Fig. 5). The data confirm that the recharge occurs under current climatic conditions, as is expected for aquifers hosted in Quaternary sediments deposited during the last glaciation. The data actually form a small cluster near the interpolated annual mean isotope signature for precipitation, as often observed under temperate climatic conditions (Clark and Fritz 1997).

Confidence intervals around the mean
Results from the previous statistical tests show some small departure from normality, mostly similar variances and spatial variations between the wells, and limited temporal variability. As a consequence, data from the different wells cannot be pooled together and confidence intervals have to be defined on a per-well basis (intra-well; Helsel and Hirsch 2002;US EPA 2009). Due to the limited number of data, parametric confidence levels were defined instead of nonparametric intervals, with a confidence level of 95%. In this study, concentrations of naturally occurring inorganic parameters were measured under background conditions. The concentrations and isotope ratios appear to fluctuate around an average value. Therefore, assuming normality would be reasonable for most datasets (US EPA 2009). Nonparametric confidence levels with an appropriate level of confidence can only be defined for temperature and conductivity measurements (Table 4).
The width of the confidence intervals around the mean reflects the presence of temporal variability or the occurrence of one anomalous value that might have been accounted for as an outlier if a larger dataset were available (e.g. Na and Mg in GW1). In that case, this corroborates the departure from normality identified for sodium in GW1 and from equality of variance ( Fig. 2 and Table 3). Confidence intervals show their limitation for parameters demonstrating variability and that are close to their detection limit, e.g. arsenic in GW3 with a negative value for its lower limit.
Taking the minimum and maximum values as lower and upper limits, nonparametric confidence intervals have been defined for temperature and conductivity with a confidence level of ~ 99%. As a result, they are slightly wider than their parametric counterparts. Parametric confidence levels at 99% confidence level (not shown here) are similar to the nonparametric for temperature (in all wells) and conductivity in GW3 and GW4. They are slightly narrower for the conductivity in GW1 and GW2. The lower and upper limits of the nonparametric intervals are the minimum and Each coloured point represents one sample (average of duplicates or triplicates). For the sake of clarity, error bars for uncertainty on the measurement are only displayed in the legend. Interpolated monthly and yearly data come from IAEA (2016) and Terzer et al. (2013) maximum values in the datasets at this level of confidence. With such a limited amount of data, nonparametric intervals are sensitive to the dataset values and are not as robust as the parametric ones.

Comparison between uncertainty and confidence intervals
The width of the confidence intervals around the mean is compared with the uncertainty on the measurements where both are defined (Table 4). Three cases can be distinguished with the width of confidence intervals similar to, higher than or lower than uncertainties (illustrated in Fig. 6).
The width of the confidence interval is similar to the estimated uncertainties for sodium (except in GW1), chloride (except GW3), fluoride (GW2 and GW4), δ 18 O and specific conductivity (GW2 and GW4). In these cases, the temporal variations reflect the uncertainties on the measurements rather than true temporal variations. Therefore, these parameters do not show any significant temporal variations.
Confidence intervals wider than uncertainties can be explained by the presence of anomalous values, e.g. sodium and manganese in GW1 (Fig. 6). In the case of alkalinity, although the uncertainties might have been underestimated, the maximal theoretical uncertainties are still much less than the width of the confidence interval (with the exception of GW1). In other cases (sulphate, fluoride-GW1 and GW4; strontium, δ 2 H and specific conductivity-GW1), it suggests the presence of temporal variations within the wells. This comparison confirms what could be qualitatively inferred from Fig. 4 and Fig. 4 of the ESM.
Uncertainty can be wider than the confidence intervals, e.g. calcium, chloride (GW3), manganese (except GW1), temperature and conductivity (GW3). For calcium and manganese, the uncertainty could be affected by the limited number of duplicates and therefore be overestimated. In particular, for calcium, one pair of duplicates has a large concentration difference. Manganese concentrations are expressed with two significant digits (rounded up to the nearest ten for concentrations above 100 mg/l and to the nearest digit for concentrations below 100 mg/l). The impact on the uncertainty is similar to that described for alkalinity ("Uncertainties" section). These parameters are essentially constant during baseline monitoring.
Comparison with other regional studies Pruszkowska and Malina (2008) defined groundwater background quality levels for the Kashubian Lake District, based on the analyses of groundwater from ~ 1400 wells between 1990 and 2005. With regard to their study, the Wysin site is Table 4 Parametric (P) confidence intervals (CI) around the mean with 95% confidence level for different groundwater quality parameters Nonparametric (NP) intervals around the mean are given for temperature and conductivity measurements. Unique values indicate that the concentrations were constant during baseline monitoring. Uncertainties at 95% (converted to mg/l) are given for comparison. For alkalinity, theoretical uncertainty is in parentheses (see "Uncertainties" section). Alkalinity unit is mg/l CaCO 3  (Table 5). The concentration range observed at the Wysin site during the baseline monitoring study is mostly within background levels defined by Pruszkowska and Malina (2008). Calcium concentrations at Wysin tend towards the upper limit of the background concentrations, with concentrations in GW4 sitting just outside the limit (Fig. 7). The range of calcium concentrations in Pruszkowska and Malina (2008)'s study extends to 185 mg/l. Only the range of sodium concentrations is available, with concentrations up to 16 mg/l. The 20 mg/l value in GW1 is certainly unusual and appears to be an outlier.
Isotope data from the Kashubian Lake District are available in Pruszkowska and Malina (2008), for seven wells located 30 km north of the Wysin site. Three additional analysis results are presented in Kachnic and Kachnic (2010), in wells located up to 50 km south of the Wysin site. The δ 2 H and δ 18 O isotope signatures observed during the baseline monitoring are within the range measured by Pruszkowska and Malina (2008). The range from Kachnic and Kachnic (2010) is narrower due to the limited number of data points, although they are from a wider area, and is within the range observed in this study (Table 5).
The δ 2 H and δ 18 O signatures at the Wysin site are not depleted with respect to the weighted mean for Fig. 6 Uncertainties on the measurements (grey error bars) and parametric confidence intervals at 95% (grey dashed lines) around the mean (straight line) for a selection of parameters (Cl, Na, Sr, Ba, Mn and specific conductivity) considered as indicators of potential con-tamination by shale gas activities. For conductivity, the median (all data) is showed instead of the mean. Same colour scheme as in previous figures precipitation, indicating that snowmelt does not form an essential part of the recharge, contrary to northern parts of the BAB (Raidla et al. 2016). Such depletion would also be enhanced by low recharge rates due to evapotranspiration processes in the summer (recharge pattern typical of temperate climates). The absence of depletion with respect to the weighted mean for precipitation suggests complex recharge processes due to the semi-confinement of the aquifer, the depth of the sampling points, the distance from the potential recharge area and the groundwater velocity resulting in a good mixing of groundwater (Clark and Fritz 1997).

Baseline conditions at Wysin
The PGI-NRI carried out a baseline monitoring programme in September 2012, prior to any drilling at the shale gas site (Konieczyńska et al. 2014). As part of this programme, 12 water samples were collected (location of the sampling points shown in Fig. 1). The dug wells and the manual  Fig. 7 Comparison of the baseline data from this study with natural regional background concentrations (grey shaded areas) from Pruszkowska and Malina (2008) and 2012 baseline data (black line) from Konieczyńska et al. (2014) 1 3 soundings tap into the shallow unconfined/perched aquifer. Of the two springs, one is considered to be representative of the groundwater quality of the main aquifer (Konieczyńska et al. 2014). For comparison with this study, only the samples for the drilled boreholes and the above-mentioned spring are considered to be related to the main aquifer level. Historical data are available for one of the drilled boreholes (from 1976 at the time of the drilling). Analysis data are also available for a new well drilled nearby in 2014. Data are compiled in Table 6.
Major ion concentrations in this study are in the same range as those from the PGI baseline (Fig. 7) and the historical data. Minor and trace elements with concentrations largely above their respective limits of detection are also in the same range, i.e. strontium, manganese and barium (Fig. 7). For those close to the limits of detection, concentrations are highly variable between the two baselines (Table 6). It is not surprising since low concentrations are quite variable within the same well. Additionally, the use of different laboratories with different limits of detection is likely to impact more on low concentrations near the limit of detection.

Conclusions and recommendations
This paper presents the results from one of the first baseline monitoring programmes in the context of shale gas development in Europe. The monitoring network consists of four dedicated boreholes. Due to project time constraints, the baseline monitoring only lasted 6 months during which four sampling events took place.
Statistics used to analyse the data revealed their limitations for application to such small datasets, in particular for the use of robust nonparametric tests. Some limitations can be overcome by quantifying the uncertainty stemming from analysis and sampling. Comparison of the uncertainties with confidence intervals around the mean helps to better identify spatial and temporal variations for small datasets. From a statistical point of view, it is recommended to start baseline monitoring as early as possible in order to allow the collection of at least six statistically independent samples evenly distributed during the course of a year. This will enable the use of statistics as an unbiased and robust method to better characterise baseline conditions (e.g. temporal variability) following the framework outlined in this paper. It is also recommended that a robust quality control scheme is designed and implemented at the outset of the baseline monitoring programme in order to quantify data uncertainty. Like small datasets, large datasets from downhole probes present some challenges for the use of statistics. This is an area that requires further investigation considering their widespread use and the possibility of using remote sensing capabilities for live monitoring. Further work is also required on how to identify quality changes from baseline conditions (for both types of datasets), not only from a scientific point of view but from the regulators' perspective. The groundwater composition around the shale gas pad is typical of recently infiltrated water hosted in Quaternary sediments (Ca-HCO 3 water type). The groundwater is characterised by a low mineral content and stable temperature. The groundwater quality is good, complying with most requirements in terms of drinking water quality, except for manganese and bromide. The groundwater composition is relatively homogeneous around the drilling site and is similar to background levels in other aquifers in the region. The baseline groundwater quality at the monitoring boreholes is therefore representative of the regional groundwater quality, and the monitoring boreholes are deemed suitable for their task. However, median concentrations are statistically different between the boreholes for most parameters. Therefore, with the aim of detecting future potential contamination by shale gas development or other anthropogenic activities, baseline conditions should be defined for each borehole for these parameters. The groundwater composition showed limited temporal variability for most parameters which is accounted for when defining the confidence intervals around the mean. For parameters with no detected spatial and temporal variations, the datasets from all boreholes would be pooled together to obtain a larger dataset and define more robust baseline conditions at site level.
Comparison with the baseline carried out by the PGI-NRI in 2012 shows similar groundwater concentrations. Since different wells from this study were sampled only once, it is difficult to reach conclusions on any impacts from the shale gas activities prior to hydraulic fracturing itself (i.e. from the drilling of vertical and horizontal wells in the period [2012][2013][2014][2015]. While the focus of this study has been on inorganic constituents (apart from dissolved gases), the PGI-NRI baseline analysed the samples for a range of organic contaminants. Some organic contaminants relevant to the shale gas industry should be included during baseline monitoring. The analytical suite could be reduced during operational monitoring to reduce costs and be expanded again if any suspicion arises.