## Abstract

Agricultural nitrate pollutants infiltrate into the subsurface and contaminate groundwater. The redox environment in the subsurface is important for the natural removal of nitrate by denitrification. Detailed knowledge of the redox conditions is needed in order to make better-targeted nitrogen regulations for farmers. However, unveiling three-dimensional (3D) redox architectures is challenging because one only observes redox conditions in boreholes. Therefore, this work proposes a combination of towed transient electromagnetic resistivity (tTEM) geophysical surveys and redox boreholes to model 3D redox architecture stochastically. The tTEM survey reveals the geological structure in high resolution. However, the tTEM survey and redox boreholes are often non-colocated. To address this issue, geostatistical simulations are performed to generate multiple resistivity data colocated with redox boreholes. Then, a statistical learning method, namely multinomial logistic regression, is leveraged to predict multiple 3D redox architectures given the uncertain surrounding resistivity structures. In this way, the aggregated prediction of multiple redox architectures has a higher prediction accuracy than a redox prediction model with interpolated resistivity. The trained statistical model can also identify significant resistivity structures for redox predictions. An inverse problem has also been formulated to better match the redox borehole data. In summary, the proposed workflow models 3D resistivity and redox architecture jointly, aggregates to a highly accurate redox architecture, and provides important resistivity structures for domain experts. The highly accurate redox architecture supports a better agricultural regulation decision.

## Résumé

Les polluants agricoles de type nitrates s’infiltrent dans le sous-sol et contaminent les eaux souterraines. L’environnement redox dans la subsurface est important pour l’élimination naturelle des nitrates par dénitrification. Une connaissance détaillée des conditions redox est nécessaire afin de mieux cibler les réglementations concernant les apports azotés pour les agriculteurs. Cependant, révéler les architectures redox tridimensionnelles (3D) est un défi car on n’observe les conditions redox que dans les forages. Par conséquent, ce travail propose une combinaison de levés géophysiques de résistivité électromagnétique transitoire remorqués (tTEM) et de forages redox pour modéliser de façon stochastique l’architecture redox 3D. Le levé tTEM révèle la structure géologique en haute résolution, mais les forages redox sont souvent non colocalisés au niveau du levé tTEM. Pour répondre à cette question, des simulations géostatistiques sont effectuées pour générer des données de résistivité multiples colocalisées avec des forages de redox. Ensuite, une méthode d’apprentissage statistique, à savoir la régression logistique multinomiale, est utilisée pour prédire de multiples architectures redox 3D compte tenu des structures de résistivité environnantes incertaines. Ainsi, la prédiction agrégée de plusieurs architectures redox a une précision de prédiction plus élevée qu’un modèle de prédiction redox avec une résistivité interpolée. Le modèle statistique entraîné peut également identifier les structures de résistivité significatives pour les prédictions de redox. Un problème inverse a également été formulé pour obtenir une meilleure concordance avec les données de forage redox. En résumé, le processus proposé modélise conjointement la résistivité 3D et l’architecture redox, intègre une architecture redox très précise et fournit d’importantes structures de résistivité aux experts du domaine. L’architecture redox très précise permet une meilleure décision en matière de régulation agricole.

## Resumen

Los contaminantes agrícolas de nitrato se infiltran en el subsuelo y afectan a las aguas subterráneas. El entorno redox del subsuelo es importante para la eliminación natural del nitrato por desnitrificación. Se necesita un conocimiento detallado de las condiciones redox para que los agricultores puedan establecer normativas más específicas sobre el nitrógeno. Sin embargo, es difícil revelar arquitecturas redox tridimensionales (3D) porque sólo se observan las condiciones redox en los pozos de sondeo. Por lo tanto, este trabajo propone una combinación de estudios geofísicos de resistividad electromagnética transitoria y sondeos de redox para modelar estocásticamente la arquitectura 3D de redox. El estudio tTEM revela la estructura geológica en alta resolución, pero el estudio tTEM y los sondeos redox no suelen estar localizados. Para abordar este problema, se realizan simulaciones geoestadísticas para generar múltiples datos de resistividad colocados con sondeos de redox. A continuación, se aprovecha un método de aprendizaje estadístico, en particular la regresión logística multinomial, para predecir múltiples arquitecturas 3D de redox dadas las estructuras inciertas de resistividad circundantes. De este modo, la predicción agregada de múltiples arquitecturas redox tiene una mayor precisión de predicción que un modelo redox de predicción con resistividad interpolada. El modelo estadístico entrenado también puede identificar estructuras de resistividad significativas para las predicciones de redox. También se ha formulado un problema inverso para ajustar mejor los datos de sondeo redox. En resumen, el flujo de trabajo propuesto modela conjuntamente la resistividad 3D y la arquitectura redox, se agrega a una de alta precisión y proporciona importantes estructuras de resistividad para los expertos en la materia. La arquitectura redox de alta precisión apoya una mejor decisión de regulación agrícola.

## 摘要

农业硝酸盐污染物渗入地下并污染地下水。地下的氧化还原环境对硝酸盐通过反硝化自然去除很重要。为了更好地制定农民的有针对性的氮素管理规定，需要详细了解氧化还原条件。然而，揭示三维氧化还原结构是具有挑战性的，因为只能在钻孔中观察到氧化还原条件。因此，本研究提出了一种将拖曳式非稳定电磁电阻率（tTEM）地球物理测量和氧化还原钻孔相结合的方法，用于随机建模三维氧化还原结构。tTEM测量显示高分辨率地质结构，但tTEM测量和氧化还原钻孔通常是非共位的。为了解决这个问题，进行地统计模拟以生成多个与氧化还原钻孔共位的电阻率数据。然后，利用一种统计学习方法，即多项式逻辑回归，根据周围电阻率结构的不确定性，预测多个三维氧化还原结构。通过这种方式，多个氧化还原结构的总预测具有比内插电阻率的氧化还原预测模型更高的精度。训练好的统计模型还可以确定氧化还原预测中的显著电阻率结构。还提出了更好匹配氧化还原钻孔数据的逆问题。总之，所提出的工作流程联合建模三维电阻率和氧化还原结构，共生成一个高精度的氧化还原结构，并为领域专家提供重要的电阻率结构。高精度的氧化还原结构支持更好的农业管理决策。

## Resumo

Poluentes de nitratos agrícolas se infiltram no subsolo e contaminam as águas subterrâneas. O ambiente redox na subsuperfície é importante para a remoção natural de nitrato por desnitrificação. O conhecimento detalhado das condiçōes redox é necessário para fazer regulamentaçōes de nitrogênio mais bem direcionadas para os agricultores. No entanto, desvendar arquiteturas redox tridimensionais (3D) é um desafio porque só se observam condiçōes redox em furos. Portanto, este trabalho propōe uma combinação de levantamentos geofísicos de resistividade eletromagnética transitória (tTEM) e furos redox para modelar a arquitetura redox 3D estocasticamente. A pesquisa tTEM revela a estrutura geológica em alta resolução, mas a pesquisa tTEM e os poços redox geralmente não estão localizados. Para resolver esse problema, simulaçōes geoestatísticas são realizadas para gerar dados de resistividade múltipla coalocados com poços redox. Em seguida, um método de aprendizado estatístico, ou seja, regressão logística multinomial, é aproveitado para prever múltiplas arquiteturas redox 3D, dadas as estruturas de resistividade circundantes incertas. Desta forma, a previsão agregada de múltiplas arquiteturas redox tem uma precisão de previsão maior do que um modelo de previsão redox com resistividade interpolada. O modelo estatístico treinado também pode identificar estruturas de resistividade significativas para previsōes redox. Um problema inverso também foi formulado para melhor corresponder aos dados do poço redox. Em resumo, o fluxo de trabalho proposto modela resistividade 3D e arquitetura redox em conjunto, agrega uma arquitetura redox altamente precisa e fornece estruturas de resistividade importantes para especialistas do domínio. A arquitetura redox altamente precisa apoia uma melhor decisão de regulamentação agrícola.

### Similar content being viewed by others

### Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.

## Introduction

Groundwater is a substantial water resource used especially to meet the need for drinking water and agricultural production. Application of fertilizers and manure in agriculture benefits crops but becomes a main cause of excess nitrate in the root zone. This becomes problematic when the nitrate leaches to groundwater and is transported further to surface or coastal waters. To meet the requirements of the European Union Water Framework Directive (Commission of the European Communities 2000), Denmark needs an additional 40\(\%\) nitrogen (N) reduction in application to soils (Refsgaard et al. 2014). Many other European countries (i.e. Netherlands, Belgium, etc.; PBS, NewsHour 2022) have proposed a government-level plan to significantly reduce fertilizer use across the country. As a result, some farmers will not be able to continue their businesses, leading to societal conflict. Lower agricultural productivity will pose a food security challenge. This work aims to contribute to meeting this challenge by creating a better method of predicting subsurface denitrification potential.

Nitrate leaching from the root zone can be reduced naturally through denitrification. Denitrification reduces nitrate through pyrite oxidation or organic matter reactions (as electron donor) in reducing or reduced conditions (Appelo and Postma 2004). Agricultural fields have differing capacities to reduce nitrate because of the spatially heterogeneous geological structures and water/nitrate pathways in the subsurface (Hansen et al. 2009). Therefore, spatially differentiated regulation at the local agricultural scale based on a detailed nitrate-removal map becomes a suitable management strategy and would meet both the reduction target effectively and mitigate the economic consequence for farmers (Hansen et al. 2017; Refsgaard et al. 2007). Such regulations, often named targeted nitrogen regulations, will be critical to mitigating conflicts between regulators and farmers.

To make effective targeted nitrogen regulation decisions, one needs to investigate the spatially heterogeneous subsurface redox conditions. In the context of nitrate reduction, the different redox conditions of the subsurface can be categorized into three groups: (1) oxic zone; (2) N-reducing zone; and (3) reduced zone (Wilson et al. 2018; Kim et al. 2019; Madsen et al. 2021). In oxic conditions, where oxygen is present, nitrate is not reduced. N removal only starts at the N-reducing zone. The reduced zone is a nitrate-free zone with low/negative redox potential. In the simplest case of vertical flow water and gas, redox conditions evolve from oxic at the land surface and to reducing then reduced conditions with increasing depth.

However, aquifer systems in Denmark are dominated by previous glacial activity, resulting in a highly heterogeneous subsurface structure with alternating gravel, sand, and clay layers (Houmark-Nielsen 2004; Høyer et al. 2013). These complex geological structures cause the redox condition to alternate frequently in one borehole, instead of having simple transitions from oxic to reduced conditions with increasing depth. The latter indicates that the redox potential zones are heterogeneously distributed in the subsurface. In the last two decades, electromagnetic (EM) methods have been shown to be a valuable tool to explore spatial heterogeneity of the subsurface, providing more detailed three-dimensional (3D) geological structures than information limited to boreholes. The tTEM method, a newly developed, towed, ground-based transient electromagnetic system with a lateral resolution of around 10 m and vertical resolution of 2–3 m, plays a new role in the mapping of higher-resolution 3D geological structure over a few hectares (Auken et al. 2019). However, the measurement locations in a tTEM survey and redox zones are often not colocated (Fig. 1). Farmers would not like to have randomly placed boreholes in the middle of their fields. This poses a challenge to the integration of a tTEM survey data with redox measurements, and to the use of tTEM data to predict redox conditions. Another challenge is that there is no direct physical relationship between the redox condition and a single colocated resistivity measurement from a geophysical survey. Spatially distributed tTEM data provide more information on spatial geological structures and water/nitrate pathways. Kim et al. (2019) interpret subsurface geological structures from tTEM data and delineates three different types of redox architecture given by non-colocated redox measurements and geological structures. Their expert interpretation emphasizes the complexity of local redox architectures but does not provide a full 3D redox architecture in their studied catchments. In Madsen et al. (2021), geological mapping experts delineate a joint training image of subsurface geological and redox architectures. Stochastic 3D redox architectures are simulated using multiple-point geostatistics with a single interpreted training image. Neither of these approaches actually investigates, statistically and quantitatively, the correlation between redox conditions and tTEM data, mostly because that correlation is not quantifiable when data are not colocated.

The aim of this work is to generate stochastic 3D redox architectures from observed tTEM data and non-colocated redox boreholes using statistical methods, without delineating redox architectures or training images. Resistivity data at the redox boreholes are unobserved. A common approach is to use interpolation of the resistivity data, to create resistivity colocated with the borehole data. Two problems occur with this approach: (1) only a single deterministic interpolation is created and (2) the interpolation at the borehole will have less variance than the actual tTEM data. The latter is important because it will result in a biased assessment of the relationship between redox and tTEM (Wang et al. 2020). To address this lack of variance, this work simulates multiple tTEM resistivity realizations, thereby creating multiple realizations of tTEM data at the redox boreholes. Multinomial logistic regression models are then trained to predict multiple redox architectures with these simulated resistivity data as predictors and aggregate multiple redox architectures to produce one robust architecture. Hypothesis testing is then performed to understand which resistivity structures are important to redox condition predictions.

## Study area and data

### Site description

The study area is an agricultural field in the west part of Javngyde catchment (5 km\({}^2\)), Jutland, Denmark (Fig. 1). The topography of this area ranges from 90 to 110 MASL (Kim et al. 2019). The Javngyde catchment is situated in the glacial deposit formed during the Weichselian glaciation (Houmark-Nielsen 2004; Høyer et al. 2013). Glacial activity brought subsurface heterogeneity with mixed clayey and coarse-grained materials.

### Geophysical data: electromagnetic survey

The electromagnetic survey conducted at the west Javngyde site (Figs. 1b and 2a) used tTEM—a towed transient electromagnetic system for detailed 3D imaging of the subsurface (Auken et al. 2019). This tTEM system is specifically designed for detailed investigation of the top 70 m of the subsurface. This transient electromagnetic (TEM) system, including a transmitter and a receiver, is often towed by an all-terrain vehicle. Measured data are processed and inverted using the Aarhus Workbench software package (Auken et al. 2015) with spatially constrained 1D smooth models (Viezzoli et al. 2009). The Javngyde survey was carried out in many agriculture fields (Fig. 1b) but still has gaps in areas with restrictions or crop vulnerability (Kim et al. 2019). The line spacing of this tTEM survey was 20–30 m, and the vertical resolution was 2–5 m, increasing with depth. The Methods section describes how the tTEM data were simulated within the 3D area of the black box depicted in Fig. 1b to fill the gaps in the survey. The input for the geostatistical simulations is the tTEM data assigned to the closest 3D grid node. The simulation grids have a resolution of 20 m \(\times\) 20 m \(\times\) 1 m. Figure 2a shows the tTEM data in the simulation grids.

### Geochemical data: redox conditions measured in boreholes

In the west Javngyde area, redox conditions are interpreted by both sediment color (Hansen and Thorling 2008; Hansen et al. 2016) and water chemistry (Hansen et al. 2011). Red, orange, yellow, and combinations of these sediment colors suggest oxic conditions. Grey, olive and blue colors are reduced conditions. Mixed colors ranging between the oxic and reduced colors suggest N-reducing conditions. In terms of water chemistry, oxic conditions are defined by \(\ge 1\) mg/L dissolved oxygen (DO) and \(\ge 1\) mg/L nitrate (\(\mathrm{NO}_{3}^{-}\)), N-reducing is defined with \(< 1\) mg/L DO and \(\ge 1\) mg/L nitrate, reduced is defined with \(< 1\) mg/L DO, \(< 1\) mg/L nitrate and \(> 0.2\) mg/L iron. Both borehole sediment information and groundwater geochemistry data were obtained from the national well database (JUPITER) managed by the Geological Survey of Denmark and Greenland (Hansen and Pjetursson 2011). These two sources of redox condition data are combined and presented in the simulation grids shown in Fig. 2b. There are 11 boreholes and 549 redox measurements. It was noticed that there are multiple redox shifts within one borehole, possibly shifting from oxic to reduced conditions and then back to oxic with increasing depth. More detailed interpretations of the redox architectures can be found in Kim et al. (2019).

## Methods

### Overview

This section first provides an overview workflow (shown in Fig. 3) to predict 3D redox architectures, then provides more detail.

### Step 1: Geostatistical simulation

Many 3D realizations of tTEM resistivity are generated in the full 3D grid. These geostatistical simulations generate the colocated tTEM data around redox boreholes.

### Step 2: Statistical prediction

Each resistivity realization at and surrounding the redox borehole area as predictors to perform logistic regression is used. The resistivity surrounding each borehole is the predictor for the categorical variable: the redox condition (oxic, reducing and reduced). Because there are multiple realizations, multiple logistic regressions are obtained. All predictions from these logistic regressions need to be aggregated into a single prediction of the probability of redox conditions. To do so, the multiple probabilities are aggregated using a log-ratio approach. This single probability model is used to generate a single redox architecture classification with three redox categories: reduced, reducing, oxic.

### Step 3: Final correction

The resulting categorical model is not fully conditioned to the exact observation of redox conditions in the boreholes. To achieve exact conditioning, the ensemble smoother is used to condition these hard data. At this step, the ensemble smoother is applied to multiple logistic regression results before the log-ratio aggregation.

These three steps generate multiple predictions of 3D redox architectures and one single aggregated redox prediction. Each redox architecture is jointly sampled with each resistivity realization.

### Geostatistical simulation of tTEM: trend analysis and residual simulation

The common geostatistical model of dividing variations in tTEM into a trend and a residual component was used (Eq. 1).

\(Z(\textbf{x})\) is the full grid resistivity simulation, including both trend and residual. \(m(\textbf{x})\) is a deterministic but non-stationary trend function. \(R(\textbf{x})\) is then the stochastic and stationary residual. A trend is a large-scale variation. It is clear from Fig. 2a that a vertical trend exists in the model: the shallower area has lower resistivity, and the deeper area has higher resistivity.

To model this trend, radial basis functions (RBF) are used (Wright 2003). The RBF interpolation is to fit a linear combination of functions *f* centered at all observed data locations (Eq. 2) and interpolate values at unobserved locations. Hence, using radial basis functions is a good choice for interpolation with abundant observed data.

\(f(||\textbf{x}_{\alpha }-\textbf{x}||)\) is a function *f* of the distance between any location \(\textbf{x}\) and location with data \(\textbf{x}_{\alpha }\). Then the function \(f(||\textbf{x}_{\alpha }-\textbf{x}||)\) is a radial basis function near \(\textbf{x}_{\alpha }\). In this study, the multiquadric function is used:

\(\epsilon\) takes the average distance between locations with data \(\textbf{x}_{\alpha }\). Then \(\hat{m}(\textbf{x})\) is a linear combination of all radial basis functions at the locations with data. \(\omega _{\alpha }\) in Eq. (2) is estimated using linear least squares.

The deterministic trend is modeled \(\hat{m}(\textbf{x})\) by taking an average of many uncertain RBF interpolations \(\hat{m}_k(\textbf{x}), k = 1,...,B\). Each RBF interpolation \(\hat{m}_k(\textbf{x})\) has \(n = 500\) randomly sampled resistivity data out of all data with the number of \(N = 121,277\). The random sampling and interpolation is repeated \(B = 100\) times. The final deterministic trend is the mean of *B* different interpolations. In this way, solving linear systems for each RBF interpolation is fast. The average smooths out small-scale variations in each interpolation and keeps the large-scale trend.

The residual component is modeled as stochastic. It is assumed to have a zero mean and follow a Gaussian random function. To simulate this random function, an estimation of the residual variogram is needed. This residual variogram is calculated by subtracting the trend model from the actual tTEM data, then sequentially simulated data can be achieved at unobserved locations \(\textbf{x}\) using the Gaussian random function. Each simulated value is conditioned on all observed data and all previously simulated data using sequential Gaussian simulation (SGSIM) (Deutsch and Journel 1998).

The uncertainty of the residual resistivity data can be quantified at the unobserved locations using \(N_{\text {SGSIM}} = 100\) SGSIM simulations: \(R_1, R_2, ..., R_{N_{\text {SGSIM}}}\). Then the trend *m* is added back rendering multiple full resistivity realizations in 3D: \(Z_1,Z_2, ..., Z_{N_{\text {SGSIM}}}\). Because simulations are performed instead of interpolations, \(Z_1,Z_2, ..., Z_{N_{\text {SGSIM}}}\) have realistic spatial variations in the original observed data.

### Statistical learning: multinomial logistic regression

To predict the redox architecture, the statistical relationship between resistivity and 3D redox architecture is learned for each resistivity realization, and \(N_{\text {SGSIM}}\) redox predictions are aggregated from \(N_{\text {SGSIM}}\) resistivity fields into a single robust redox architecture.

#### Multinomial logistic regression for redox architectures

Each resistivity realization at and surrounding the redox borehole area was used as a predictor to perform logistic regression. In this way, spatially heterogeneous geological structures become the predictors. The surrounding geological structures are important to decide whether oxidants can reach the redox measurement grid and change the redox condition. The exact spatial extent of the surrounding area is determined by sensitivity analysis (Fenwick et al. 2014; Park et al. 2016) in section “Predicted redox architectures using simulated colocated tTEM data”. Note that one logistic regression is performed for one resistivity realization; therefore, \(N_{\text {SGSIM}}\) logistic regressions are trained and \(N_{\text {SGSIM}}\) stochastic redox predictions are obtained.

Multinomial logistic regression (Böhning 1992) uses a linear predictor function to predict the probability of different categories. The sum of probabilities being oxic, reducing, and reduced should be 1. Therefore, if \(K = 3\) possible outcomes, it is possible to fit \(K-1 = 2\) independent linear prediction functions:

\(Y^i\) are the redox condition for each location *i*. \(X^i\) include the lower dimensional representations of surrounding resistivity data: the first ten principal components (\(>90\%\) variance) from principal component analysis (Wold et al. 1987) and the depth to the surface for that particular location *i*. If the radius for the surrounding resistivity is 140 m \(\times\) 140 m \(\times\) 4 m, then in total the result is [(140 m/20 m)\(\times\)2 + 1] \(\times\) [(140 m/20 m)\(\times\)2 + 1] \(\times\) [(4 m/1 m)\(\times\)2 + 1] = 2,025 resistivity values in high dimensions. Therefore to reduce this high dimension, principal component analysis is used.

Regression coefficients \(\beta _1\) and \(\beta _2\) in Eqs. (4) and (5) are estimated using maximum likelihood given observed redox conditions. The* p*-values for each coefficient in \(\beta _1\) and \(\beta _2\) help to understand if the null hypothesis for each principal component predictor is false or not. If the* p*-values are small (\(<0.05\)) for the majority of \(N_{\text {SGSIM}}\) logistic regressions that are trained, then it is assumed that the corresponding principal components are statistically significant for redox condition predictions.

#### Aggregation of multiple predictions to obtain a single robust redox architecture

Each resistivity simulation in \(Z_1, Z_2,..., Z_{N_{\text {SGSIM}}}\) predicts one 3D redox architecture \(Y_1, Y_2, ..., Y_{N_{\text {SGSIM}}}\), in terms of the probability for different redox conditions at location *i*:

All the different probabilities of redox architectures are aggregated into one single robust prediction and the mean and standard deviation of predicted probabilities after log-ratio transformation are estimated (Eqs. 6 and 7). It seems straightforward to use the arithmetic mean; however, it is not possible to use addition/subtraction or multiplication/division operations on probabilities. A simple addition of two probabilities might result in an unwanted probability larger than 1.

It is possible to replace oxic with reducing to obtain the mean and standard deviation of \(\ln (\frac{P(Y_j^i= \text {reducing})}{P(Y_j^i =\text { reduced})})\).

However, these mean and standard deviations are still in log-ratio spaces. To obtain the uncertainty in redox probability spaces, the mean and standard deviation from log-ratios are back transformed to probabilities using an additional softmax transformation:

### Final inversion: ensemble smoother on redox architectures

The previous statistical learning provides a mean and standard deviation for the 3D redox condition probability. However, the resulting categorical model is not fully hard conditioned to observed redox boreholes. In this section, a linear inverse problem (Tarantola 2005; Linde et al. 2015) is solved for categorical redox data to better perform hard conditioning.

To formulate an inverse problem, log-ratios of the entire redox architectures

\(j = 1,2, ..., N_{\text {SGSIM}}\) are treated as the model \(\textbf{m}\),

and log-ratios at the observed redox boreholes

\(i \in\) observed redox locations are treated as the data \(\textbf{d}\). Then the linear forward model is:

where* G* is a matrix with 1 at observed locations and 0 at no redox data locations. Note that in this inverse problem, \(N_{\text {SGSIM}}\) logistic regression predictions are used instead of the aggregated mean. The observed data \(\textbf{d}_{\text {obs}}\) is the log-ratio of observed redox conditions. If an oxic condition has been observed, instead of having a certain probability 1 for \(P(Y_{\text {obs}}^i = \text {oxic})\), in practice, the probability is relaxed as \(P(Y_{\text {obs}}^i = \text {oxic}) = 0.98, P(Y_{\text {obs}}^i = \text {reducing}) = 0.01,P(Y_{\text {obs}}^i = \text {reduced}) = 0.01\). Then, the observed data would not have infinite or undefined values.

Local principal component analysis is performed (Wang et al. 2022) to reduce the dimensionality of \(\textbf{m}\) , while principal component analysis is performed to reduce the dimensionality of \(\textbf{d}\). Then the ensemble smoother (Eq. 10) (Emerick and Reynolds 2013) is used to update \(\textbf{m}\) given \(\textbf{d}_{\text {obs}}\) in low-dimensional spaces. The local principal component analysis focuses on the local redox conditions’ variance of near redox boreholes instead of the variance that is far away, which is suitable for local redox condition updates by redox boreholes.

where \(\textbf{C}_{\text {md}}\) is the cross-covariance matrix between \(\textbf{m}\) and \(\textbf{d}\), \(\textbf{C}_{\text {dd}}\) is the covariance matrix of \(\textbf{d}\).

Then after the inversion, there are multiple log-ratio models \(\textbf{m}_j^{\text {pos}}\) better conditioned on redox borehole data. The same aggregation in section “Aggregation of multiple predictions to obtain a single robust redox architecture” can be applied to the log-ratios again to have one single mean and standard deviation.

## Results

In this section, the simulation and prediction results of stochastic resistivity data and redox architectures on 3D simulation grids are presented, and the aggregation results for multiple redox predictions, following the three steps in the proposed workflow (Fig. 3). The total number of simulation grids is \(119 \times 105 \times 89 = 1,112,055\). Each grid has a resolution of 20 m \(\times\) 20 m \(\times\) 1 m.

### Geostatistical simulation of resistivity data

Figure 4 shows the trend analysis using the radial basis function (RBF) interpolation methods in section “Multinomial logistic regression for redox architectures”. Figure 4a–c shows the realizations of the RBF interpolation with \(n = 500\) sampled resistivity data. Figure 4d is the deterministic trend: taking the mean of \(B = 100\) RBF interpolations with a subset of samples. The residual in Fig. 4e has no significant vertical trend: no large-scale vertical pattern such as increasing resistivity with increasing depth. Now the deterministic trend estimation covers the location of all redox boreholes (Fig. 4f).

Empirical variograms in Fig. 5 also present the effect of the trend removal. Before the trend removal, Fig. 5a has an upward trend in the variogram on the horizontal plane and an upward-downward trend in the vertical direction. After the trend removal, Fig. 5b has a bounded spherical variogram with fitted range = 100 m \(\times\) 100 m \(\times\) 12 m and sill = 0.0085.

An SGSIM was run on the residual resistivity for \(N_{\text {SGSIM}} =\) 100 times. Figure 6a,b has two SGSIM residual realizations adding to the deterministic trend (Fig. 4d). Now there are 100 stochastic colocated resistivity data for the redox boreholes. Figure 6c,d shows the mean and standard deviation of 100 SGSIM simulations, while Fig. 6e is the spatial interpolation result using kriging (Cressie 1990) and indicates that the kriged resistivity is smoother than the SGSIM realizations.

### Predicted redox architectures using simulated colocated tTEM data

Firstly, empirical relationships of the colocated data (Fig. 7) are explored. The median of the colocated resistivity in all simulations increases from reduced to oxic, and the median of the depth to the surface decreases from reduced to oxic. These two boxplots in Figure 7 show that the tTEM and the depth to the surface provide information on the redox conditions. The statistical learning method to predict redox conditions are further detailed in the following.

### Sensitive spatial extent

In section “Statistical learning: multinomial logistic regression”, it is proposed to use the surrounding resistivity data to predict the redox architectures. Now a method for determining the size of the surrounding area can be presented. Multiple sensitivity analyses to the top ten principal components of the resistivity data are preformed: from the smallest extent, a single colocated resistivity value at the redox borehole location (vertical length and horizontal length = 0 m in Fig. 8a) to a large extent with the radius of 220 m \(\times\) 220 m \(\times\) 7 m. The sensitivity analysis method used is the distance-based generalized sensitivity analysis method (DGSA) (Fenwick et al. 2014; Park et al. 2016). The sensitivity measurement from the DGSA method quantifies the differences between the empirical cumulative density function of principal component scores in different redox conditions. The spatial extent 140 m \(\times\) 140 m \(\times\) 4 m (Fig. 8b) is the most sensitive one with the highest median sensitivity measurement of the top 10 principal components.

### Prediction from multinomial logistic regression

Figure 9a,b is two redox architecture predictions from the resistivity realizations in Fig. 6a,b. Columns 3–5 in Fig. 9 are the prediction probability (P) for each redox condition. More reducing or oxic zones are near the surface or in the area with high resistivity values. Additionally, all 100 probabilities of every stochastic prediction are aggregated using Eqs. (6), (7), and (8). Figure 9c presents the mean of aggregated probabilities \(\mu _{\text {reduced}}\), \(\mu _{\text {reducing}}\) and \(\mu _{\text {oxic}}\), and the final prediction using these aggregated probabilities. Figure 9d presents the standard deviation of aggregated probability \(\sigma _{\text {reduced}}\), \(\sigma _{\text {reducing}}\) and \(\sigma _{\text {oxic}}\) in the last three columns. The first two columns in Fig. 9d are the standard deviation on the categorical redox architectures. Redox architectures (Fig. 9e) are also predicted using the interpolated kriged resistivity.

The accuracy of redox predictions is compared given resistivities from (1) spatial interpolation (kriging) and (2) spatial simulations (SGSIM) using a confusion matrix and precision-recall curves. The confusion matrix summarizes the performance of a classification algorithm. Each row of the matrix represents the samples in an actual redox class, while each column represents the samples in a predicted redox class. This confusion matrix helps to understand what types of errors are being made in the predictions. The precision-recall curve is a graph with precision values, true positive/(true positive + true negative), on the y-axis, and recall values, true positive/(true positive + false positive), on the x-axis. It shows the trade-off between precision and recall for different probability thresholds. Both the confusion matrix and precision-recall curves (Fig. 10a,b) show that the aggregated performance from SGSIM is better than kriging. The unrealistic smoothness from the spatial interpolation biases the learned relationship between realistic resistivities and redox conditions. Note that a larger area under the precision-recall curve represents both higher recall and higher precision on redox condition predictions. There is a higher recall and higher precision for the reduced condition than the reducing or oxic conditions.

The z-direction variograms both on the observed redox condition and on the predicted redox conditions are also calculated (Fig. 11). The predictions have lower sills/variances in the variogram compared to observed redox conditions. The ranges of observed and predictions are similar, around 20 m. There is a large nugget effect (around 0.2) in the observed variogram; therefore, the predictions are smoother than the observed redox boreholes, but do reproduce the spatial correlation/range in the redox borehole. This smoothness is not surprising: tTEM is a non-invasive geophysical survey, which provides high-resolution (vertical resolution 2–5 m) subsurface information without drilling many boreholes. However, tTEM still does not contain very fine details compared to the borehole measurements. The tTEM survey provides enough spatial correlations, and these spatial correlations have been captured in the statistical learning process. Adding the nugget effect 0.2 to the predictions and thereby shifting the entire variogram up by 0.2, these two variograms are similar. In the future, an additional random effect can be added to the predicted redox conditions.

### Significant resistivity structures

There are 10 principal components of the surrounding resistivity data, and in total there are 100 SGSIM realizations. The goal is to understand which surrounding resistivity structures are important to redox predictions.

Therefore, 1,000 principal component structures are first clustered into 18 representative clusters using K-Means clustering (Hartigan and Wong 1979). The Silhouette index (Rousseeuw 1987; Fig. 12a) was chosen to number the clusters and present 18 different clusters in the 2D multi-dimensional scaling space (Kruskal and Wish 1978; Fig. 12b). Each cluster contains many similar 140 m \(\times\) 140 m \(\times\) 4 m principal component structures around the redox prediction location. Figure 13 shows the mean of each cluster (10 most statistically significant structures out of 18). Here the statistical significance is defined not only with* p*-values \(<0.05\), but this* p*-value is evaluated across all 100 simulations to ensure the robustness of the* p*-value. All structures with at least a 75% probability of having the* p*-values \(<0.05\) in all 100 trained logistic regression models are presented.

Figure 13 has the probability of* p*-values \(< 0.05\) for each resistivity structure shown in 100 multinomial logistic regressions. The first row has two similar principal components but with different signs. Those two are the mean of resistivity. Both of them are statistically significant with* p*-values \(<0.05\) in all 100 multinomial logistic regressions. The second and fourth columns have the boxplot of logistic regression coefficients \(\beta _1\) and \(\beta _2\) in Eqs. (4) and (5). The first structure (Fig. 13a) with the negative mean has negative coefficients for both reducing and oxic log-ratios. Therefore, in the west Javngyde area, the lower the mean of the resistivity, the lower the probability of being reducing or oxic. Figure 13b has a positive mean and positive coefficients; thus, the higher the mean resistivity, the higher the probability of being reducing/oxic.

There are more interesting resistivity structures in Fig. 13. Figure 13c,d has horizontal resistivity gradients, whereby part c has more high resistivity (high chance of sand) on top of the center location, and part d has lower resistivity (high chance of clay) on top of the center location. The oxidants might go through high-resistivity and high-permeability units such as sand; therefore, the Fig. 13c structure has a higher probability of being reducing/oxic, while Fig. 13d structure has a lower probability of being reducing/oxic. Figure 13e,g,i might indicate a glaciotectonic-thrusted redox architecture (Kim et al. 2019). The complex oxidant pathways more likely increase the probability of being oxic (Fig. 13e,g), but it might also represent the impermeable clayey layer which blocks the oxidants (Fig. 13i) and have a lower oxic probability. Another predictor variable in the logistic regression is the prediction location depth to the surface. This depth variable is statistically significant \(P(p<0.05) = 1\) in all 100 multinomial logistic regressions (Fig. 13k). The probability of being reducing decreases when the location is deeper. The oxic probability also decreases more than the reducing probability with increasing depth.

Overall, it will be site specific, but the statistical learning between resistivity and geochemical data will be a good tool to predict possible redox structures of the subsurface automatically, without manual delineation. In the future, combining domain expert knowledge with statistical learning insights and better predict redox architectures could also be explored.

### Final correction by solving the inverse problem

The accuracy of the logistic regression prediction could further be improved upon by using inversion methods. After updating using the ensemble smoother, the confusion matrix and the precision-recall curve (Fig. 10c) show higher accuracy than before. Figure 14a,b shows, in one two-dimensional (2D) section before and after correction, the reduced probability and predicted redox zones in realizations 1 and 2, respectively. The location of this 2D section is shown in Fig. 14e. After the final correction, the predicted redox architectures have a better match on the redox borehole. The updated aggregated mean (Fig. 14c,f) matches most of the data with minor mismatches near the redox shifting area. The uncertainty of the redox architecture predictions has also been reduced (Fig. 14d,g), especially at the redox borehole locations. Stochastic 3D redox architectures after final inversion are shown in Fig. 3.

## Discussion

In this study, the statistical relationship directly between the geostatistical simulated resistivity data and the measured redox condition is learned and the non-colocated challenge are overcome with geostatistical simulations of tTEM data. The investigated west Javngyde area is around 5 km\(^2\). Another area that is far away from Javngyde might have different statistical relationships. In that case, it is possible to have either the horizontal location inputs as one of the predictors, or different statistical models can be fit for different subcatchments.

Multinomial logistic regression, which is a linear predictor function for the log-ratios, has also been used. In practice, many other classification algorithms are suitable for this study: random forest, support vector machine, or even neural networks with softmax activation functions in the last layer, etc. The classification algorithms provide a statistical learning tool to identify the most important geological structures for redox conditions.

The final inversion step is optional. Ground truth redox conditions only comes from the borehole measurements; therefore, targeted nitrogen regulations need 3D redox architectures with a better redox borehole match. If the intended goal is to understand the statistically important geological structures or scientific processes, the previous statistical prediction has already provided the answer, so no additional inversion step is needed.

## Conclusion

This work models 3D redox architectures through the statistical learning between colocated towed transient electromagnetic resistivity (tTEM) data and redox measurements. The uncertain colocated tTEM resistivity is simulated using geostatistical simulations. The resistivity predictors are compared that come from spatial interpolation and spatial simulation. Results show that the simulation method provides more realistic predictors and achieves a higher redox prediction accuracy than the interpolation method. A higher redox prediction accuracy is essential for a more accurate and effective targeted nitrogen regulation. The statistically significant resistivity structures for redox predictions were analyzed. Those structures agreed most with the expert interpretations in Kim et al. (2019). The multinomial logistic regression method identifies important structures automatically from hypothesis testing. It is concluded that the proposed methodology provides a statistical and automatic mapping tool for 3D redox architecture modeling and uncertainty quantification of the redox condition in the unseen subsurface through statistical learning. In the future, expert knowledge and statistical learning can be combined to better predict redox structures for future detailed agricultural regulation.

## References

Appelo CAJ, Postma D (2004) Geochemistry, groundwater and pollution. CRC, Boca Raton, FL

Auken E, Christiansen AV, Kirkegaard C, Fiandaca G, Schamper C, Behroozmand AA, Binley A, Nielsen E, Effersø F, Christensen NB et al (2015) An overview of a highly versatile forward and stable inverse algorithm for airborne, ground-based and borehole electromagnetic and electric data. Explor Geophys 46(3):223–235

Auken E, Foged N, Larsen JJ, Lassen KVT, Maurya PK, Dath SM, Eiskjær TT (2019) tTEM: a towed transient electromagnetic system for detailed 3D imaging of the top 70 m of the subsurface. Geophysics 84(1):E13–E22

Böhning D (1992) Multinomial logistic regression algorithm. Ann Inst Stat Math 44(1):197–200

Commission of the European Communities (2000) Directive 2000/60/EC of the European Parliament and of the Council of 23 October 2000 establishing a framework for Community action in the field of water policy. Office for Official Publications of the European Communities, The Hague

Cressie N (1990) The origins of kriging. Math Geol 22(3):239–252

Deutsch AG, Journel CV (1998) GSLIB: Geostatistical Software Library and users guide. Oxford University Press, Oxford, UK

Emerick AA, Reynolds AC (2013) Ensemble smoother with multiple data assimilation. Comput Geosci 55:3–15

Fenwick D, Scheidt C, Caers J (2014) Quantifying asymmetric parameter interactions in sensitivity analysis: application to reservoir modeling. Math Geosci 46(4):493–511

Hansen A, Refsgaard J, Olesen J, Børgesen C (2017) Potential benefits of a spatially targeted regulation based on detailed n-reduction maps to decrease n-load from agriculture in a small groundwater dominated catchment. Sci Total Environ 595:325–336

Hansen B, Thorling L (2008) Use of geochemistry in groundwater vulnerability mapping in Denmark. GEUS Bull 15:45–48

Hansen B, Thorling L, Dalgaard T, Erlandsen M (2011) Trend reversal of nitrate in Danish groundwater: a reflection of agricultural practices and nitrogen surpluses since 1950. Environ Sci Technol 45(1):228–234

Hansen B, Sonnenborg TO, Møller I, Bernth JD, Høyer A-S, Rasmussen P, Sandersen PB, Jørgensen F (2016) Nitrate vulnerability assessment of aquifers. Environ Earth Sci 75(12):1–15

Hansen J, Refsgaard J, Ernstsen V, Hansen S, Styczen M, Poulsen R (2009) An integrated and physically based nitrogen cycle catchment model. Hydrol Res 40(4):347–363

Hansen M, Pjetursson B (2011) Free, online Danish shallow geological data. Geus Bull 23:53–56

Hartigan JA, Wong MA (1979) Algorithm as 136: a k-means clustering algorithm. J R Stat Soc Ser C Appl Stat 28(1):100–108

Houmark-Nielsen M (2004) The Pleistocene of Denmark: a review of stratigraphy and glaciation history. Dev Quat Sci 2:35–46

Høyer A-S, Jørgensen F, Piotrowski J, Jakobsen P (2013) Deeply rooted glaciotectonism in western Denmark: geological composition, structural characteristics and the origin of Varde hill-island. J Quat Sci 28(7):683–696

Kim H, Høyer A-S, Jakobsen R, Thorling L, Aamand J, Maurya PK, Christiansen AV, Hansen B (2019) 3D characterization of the subsurface redox architecture in complex geological settings. Sci Total Environ 693:133583

Kruskal JB, Wish M (1978) Multidimensional scaling. Number 11. Sage, Thousand Oaks, CA

Linde N, Renard P, Mukerji T, Caers J (2015) Geological realism in hydrogeological and geophysical inverse modeling: a review. Adv Water Resour 86:86–101

Madsen RB, Kim H, Kallesøe AJ, Sandersen PB, Vilhelmsen TN, Hansen TM, Christiansen AV, Møller I, Hansen B (2021) 3d multiple-point geostatistical simulation of joint subsurface redox and geological architectures. Hydrol Earth Syst Sci 25(5):2759–2787

Park J, Yang G, Satija A, Scheidt C, Caers J (2016) DGSA: a MATLAB toolbox for distance-based generalized sensitivity analysis of geoscientific computer experiments. Comput Geosci 97:15–29

PBS, NewsHour (2022) Dutch government angers farmers with new emission goals. https://www.pbs.org/newshour/world/dutch-government-angers-farmers-with-new-emission-goals. Accessed 21 April 2023

Refsgaard A, Jacobsen T, Jacobsen B, Ørum J-E (2007) Integrated modelling of nitrate loads to coastal waters and land rent applied to catchment-scale water management. Water Sci Technol 56(1):39–47

Refsgaard JC, Auken E, Bamberg CA, Christensen BS, Clausen T, Dalgaard E, Effersø F, Ernstsen V, Gertz F, Hansen AL et al (2014) Nitrate reduction in geologically heterogeneous catchments: a framework for assessing the scale of predictive capability of hydrological models. Sci Total Environ 468:1278–1288

Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

Tarantola A (2005) Inverse problem theory and methods for model parameter estimation. SIAM, Philadelphia

Viezzoli A, Auken E, Munday T (2009) Spatially constrained inversion for quasi 3D modelling of airborne electromagnetic data: an application for environmental assessment in the lower Murray region of South Australia. Explor Geophys 40(2):173–183

Wang L, Kitanidis PK, Caers J (2022) Hierarchical Bayesian inversion of global variables and large-scale spatial fields. Water Resour Res 58(5):e2021WR031610

Wang Z, Yin Z, Caers J, Zuo R (2020) A Monte Carlo-based framework for risk-return analysis in mineral prospectivity mapping. Geosci Front 11(6):2297–2308

Wilson S, Close M, Abraham P (2018) Applying linear discriminant analysis to predict groundwater redox conditions conducive to denitrification. J Hydrol 556:611–624

Wold S, Esbensen K, Geladi P (1987) Principal component analysis. Chemom Intell Lab Syst 2(1–3):37–52

Wright GB (2003) Radial basis function interpolation: numerical and analytical developments. University of Colorado, Boulder, CO

## Funding

Our computations were made possible by Microsoft Education Azure cloud credits, which were awarded to Lijing Wang through Stanford Data Science.

## Author information

### Authors and Affiliations

### Corresponding author

## Ethics declarations

### Conflict of interest

The authors declare that they have no conflict of interest.

## Additional information

Published in the special issue “Geostatistics and hydrogeology”.

## Rights and permissions

**Open Access** This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

## About this article

### Cite this article

Wang, L., Kim, H., Hansen, B. *et al.* Statistical modeling of three-dimensional redox architecture from non-colocated redox borehole and transient electromagnetic data.
*Hydrogeol J* **31**, 1475–1490 (2023). https://doi.org/10.1007/s10040-023-02640-7

Received:

Accepted:

Published:

Issue Date:

DOI: https://doi.org/10.1007/s10040-023-02640-7