Machine learning based approach for the interpretation of engineering geophysical sounding logs

In this paper, a set of machine learning (ML) tools is applied to estimate the water saturation of shallow unconsolidated sediments at the Bátaapáti site in Hungary. Water saturation is directly calculated from the first factor extracted from a set of direct push logs by factor analysis. The dataset observed by engineering geophysical sounding tools as special variants of direct-push probes contains data from a total of 12 shallow penetration holes. Both one- and two-dimensional applications of the suggested method are presented. To improve the performance of factor analysis, particle swarm optimization (PSO) is applied to give a globally optimized estimate for the factor scores. Furthermore, by a hyperparameter estimation approach, some control parameters of the utilized PSO algorithm are automatically estimated by simulated annealing (SA) to ensure the convergence of the procedure. The result of the suggested ML-based log analysis method is compared and verified by an independent inversion estimate. The study shows that the PSO-based factor analysis aided by hyperparameter estimation provides reliable in situ estimates of water saturation, which may improve the solution of environmental end engineering problems in shallow unconsolidated heterogeneous formations.


Introduction
Borehole geophysical data contain valuable information about the physical characteristics of the investigated subsurface (Everett 2013). The measured raw data can be processed and turned into petrophysical parameters by several different approaches. The simplest is the deterministic modeling, which is often based on a single empirical equation that transforms an observed variable into a petrophysical parameter (e.g., shale volume estimation along boreholes based only on the natural gamma-ray intensity log). A more advanced way to process geophysical data is inverse modeling (Zhdanov 2015). This approach combines numerical mathematics and optimization theory to derive the physical parameters of the investigated geological structures from the measured data. Geophysical inverse problems can be solved by linearized or global optimization methods (e.g., simulated annealing (SA), genetic algorithm (GA), particle swarm optimization (PSO)). However, in all cases, geophysical inversion is based on a scientific understanding, a set of equations (i.e., response functions) that describe the relationship between the observed data and the petrophysical parameters of the subsurface.
Inverse problems are solved by updating a starting model iteratively until the synthetic data calculated on the model fits the measured data (Menke 2012). This optimization task is usually solved by linearized methods (e.g., the least squares method). These provide a fast and satisfactory solution given that there is a good initial (starting) model. However, the application of these methods to large scale (multivariate) problems is often problematic due to the complexity of the objective function. During the minimization of the objective function (that measures the misfit between the measured and calculated data) these linearized methods use a gradient-based search, which means that they stabilize in the nearest local minimum of the objective function.
The above-mentioned global optimization methods use random search instead, therefore are capable to get out of local minima of the objective function. Thus, they can provide a reliable and convergent solution independently of the chosen starting model. For this reason, global optimization techniques are widely used in geophysics (Sen and Stoffa 2013), e.g., full-waveform Rayleigh-wave inversion (Xing and Mazzotti 2019), inversion of magnetotelluric data (Wang et al. 2012), two-dimensional inversion of magnetic data (Liu et al. 2018).
One of their disadvantages is that their computational requirements far exceed those of linearized methods (Kaikkonen and Sharma 2001), but due to the rapid development of computing, this is no longer a problem nowadays. Furthermore, it is worth mentioning that based on a single program run, global optimization methods cannot provide information on the error of parameter estimation like linearized methods. However, with different hybrid solutions, one can combine linearized methods with global methods to create efficient, two-phase algorithms. In general, complex inverse problems can be initialized by some global method to avoid the algorithm from stabilizing in a local optimum of the objective function. Then, when the procedure is close enough to the optimal solution, it can be switched to a linearized method so that the computation of estimation error of derived petrophysical parameters becomes possible and the runtime of the algorithm is reduced as well. Such hybrid solutions were suggested by Soupios et al. (2011) for seismic inversion, Chunduru et al. (1997) for 2D resistivity profiling data and by Szabó and Dobróka (2020) for the non-linear well logging inverse problem.
Because these global methods do not use linearization, they do not require supplementary information (e.g., derivatives or a good starting model), but their convergence is greatly influenced by their control (hyper) parameters (e.g., combination and parameters of genetic operators in GA, initial temperature, cooling schedule and parameter perturbation in SA or inertia weight and learning factors in PSO). However, through hyperparameter estimation, it is possible to automatically select the optimal values for these parameters with a secondary optimization algorithm, thus guaranteeing the convergence of the procedure. Based on a similar principle, the efficiency of geophysical inversion methods can also be increased by hyperparameter estimation. For example, the value of parameters otherwise treated as constants (e.g., zone parameters in case of 1 3 well logging inversion) in the response functions can be optimized by incorporating an additional program loop (Dobróka et al. 2016).

Machine learning tools used in geophysics
In addition to inversion-based data processing methods, another approach is to use machine learning (ML) from the toolbox of artificial intelligence (Dramsch 2020). The aim of ML is to create such algorithms that can improve their own effectiveness by utilizing the experience gained during their operation. Such as artificial neural network, deep learning, cluster analysis or fuzzy logic. The frequently used regression analysis (linear, non-linear or logistic) is also considered an ML tool. The use of these tools is gaining ground in the processing and interpretation of geological and geophysical data in recent years (Caté et al. 2017). Some examples from the literature are seismic interpretation (Wang et al. 2018), seismology (Kong et al. 2019), fault detection (Araya-Polo et al. 2017), electrical resistivity tomography (Vu and Jardani 2021) and well logging inversion . The use of these methods is advantageous in the absence of the previously mentioned scientific background (mathematical relationship does not necessarily exist between the petrophysical parameters and observed data) or the theoretical description is so complex (e.g., extremely long runtime) that we must disregard the exact description.
In recent decades a wide variety of machine learning tools have been developed. To select the one that is most suitable for solving a given problem, it is crucial to understand how they work. Therefore, it is best to categorize the different ML methods based on their learning style. This way we can sort them into three main categories. The most commonly applied tools are based on supervised learning. In this case there is exact information (inputs and outputs) for training the algorithm. In addition to the input data, we also know what kind of results we can expect. Thus, supervised learning-based methods provide exact parameters (i.e., numerical labels) such as porosity or categorical labels (e.g., rock type). Some of the most often used ML approaches based on supervised learning are regression analysis and classifiers.
In case of unsupervised learning, there is no exact information in the dataset regarding the output. Meaning that it uses data without labels and therefore the procedure has to group the observed data by finding structures within the dataset. A frequently applied tool that uses unsupervised learning is cluster analysis, which classifies the input data based on some distance metric. Clustering is often applied on geophysical datasets, e.g., for rock typing based on wireline logging data (Ali and Sheng-Chang 2020). Dimension reduction methods also utilize the unsupervised learning approach (e.g., factor analysis or principal component analysis). Big datasets are often difficult to handle therefore it can be advantageous to reduce the size (dimensionality) of the problem. By keeping most of the information contained in the original dataset (i.e., statistical sample), the same phenomenon can be described with fewer variables. Thus, the new variables contain the essential features of the investigated object as well as possible new properties that cannot be measured directly. Furthermore, by removing the error factors, it can even be used to improve the signal-to-noise ratio. In factor analysis, a large number of interrelated or independent variables are replaced by a smaller number of uncorrelated variables, where the resulting new variables cannot be measured directly. The applicability of the dimensionality reduction methods comes from the fact that the newly extracted variables often show a strong correlation with different petrophysical parameters, thus they can be used e.g., for lithological classification (Puskarczyk et al. 2019) or for quantitative estimation of petrophysical parameters (Szabó 2011).
The last category of ML methods based on learning style is the semi-supervised approach. A combination of the previous two cases, it can be used when only part of the database contains information about the output (i.e., labeled data) while some of the data does not have all the necessary information available (i.e., unlabeled data) for supervised training of the system. Semi-supervised learning ensures that the latter data is not wasted and contributes in some way to the design of the system. An example of a semi-supervised based system can be found in Li et al. (2019) for lithology recognition using a generative adversarial network.
In the following sections, an ML-based system is suggested for direct-push log analysis, which enables the estimation of water saturation in shallow unconsolidated heterogeneous formations.

Water saturation estimation in the shallow subsurface by factor analysis
A direct push (DP) logging technology named as engineering geophysical sounding was developed in Hungary (Fejes and Jósa 1990) based on cone penetration testing (CPT). Besides cone resistance and sleeve friction, DP logging can also measure the same parameters routinely recorded by wireline logs including gamma-ray intensity, bulk density, neutron porosity and resistivity. It enables the characterization of the shallow unconsolidated subsurface with high vertical resolution down to approximately 50 m with a highly mobile equipment. These measurements are done by advancing steel rods into the ground without drilling, therefore there is less disturbance of the subsurface, which is also advantageous considering geophysical measurements. The different DP technologies can be used to solve problems of contamination mapping, environmental risks assessment and ground-water investigations (Dietrich and Leven 2006). By processing direct push logs, one can derive quantitative information about the composition of shallow unconsolidated sediments, such as clay content, porosity or water saturation (Balogh 2016). In this paper, an ML-based statistical approach is developed for the quantitative analysis of direct-push logs.
Factor analysis, as mentioned before is an unsupervised ML tool for describing several measured quantities with fewer uncorrelated variables. In case of DP logging data, the measured logs are the input variables, which are processed jointly to extract new factors. Here, the derived factors can be looked at as factor logs, which can be related to petrophysical parameters by regression analysis . In this study, water saturation (S w ) of shallow unconsolidated formations is estimated based on the first factor log extracted from a direct push logging dataset.
As a preliminary step, DP logs need to be standardized to serve as input for factor analysis. Then they are collected into a matrix D, where individual columns contain the recordings of different logging tools. In this paper, the processed logging tools are the natural gamma-ray intensity, GR (cpm), cone resistance, RCPT (MPa), bulk density, DEN (g/cm 3 ), neutron porosity, NPHI (v/v) and resistivity, RES (ohmm).
where K is the number of applied DP logging tools and N shows the number of measured depth points in the given sounding hole. The solution of factor analysis is based upon the decomposition of data matrix D where F is an N-by-M matrix of factor scores (i.e., factor logs) where M denotes the number of computed factors (M < K), L is a K-by-M matrix of factor loadings, which shows the correlation relationship between the measured DP logs and the newly extracted factors and E is an N-by-K error matrix. Based on the model of factor analysis in Eq. (2), the derived factors are essentially the weighted sums of the measured direct push logs. The first column of matrix F (i.e., the first factor) describes most of the data variance of the measured dataset and therefore generally bears the most significance for data interpretation. By assuming that the factors are linearly independent, the correlation matrix is given by where is a diagonal matrix of specific variances that is independent of the common factors. For the estimation of the factor loadings in matrix L, Jöreskog (2007) suggested the following non-iterative formula where is the diagonal matrix of the first M number of sorted eigenvalues of the sample covariance matrix S, denotes the matrix of the first M number of eigenvectors, I is the identity matrix, U is an arbitrary M-by-M orthogonal matrix and θ is an adequately chosen constant that is to be slightly smaller than 1.
Once factor loadings are available, the factor scores can be estimated by a maximum likelihood method (Bartlett 1937) As an advanced approach, in this study, factor scores are estimated by means of global optimization. We currently utilize the metaheuristic particle swarm optimization (PSO) for giving an estimate to the factors scores. This PSO-based solution of factor analysis is referred to as FA-PSO (Abordán and Szabó 2018). In this approach, factor analysis is treated as an inverse problem, thus Eq. (2) must be reformulated where the standardized measured DP logs are represented as a KN length column vector d, factor loadings are given in a NK-by-NM matrix ̃ , factor scores are also gathered in a column vector f of MN length and e is the KN length residual vector (Szabó and Dobróka 2018). To initialize this metaheuristic solution, measured DP logs are first collected into the column vector d, then the matrix of factor loadings ̃ can be estimated by Eq. (4). For getting more meaningful factors, the varimax rotation is applied on the loadings (Kaiser 1958). Then these factor loadings are kept constant, while the optimal values of factor scores f is approximated by PSO. For measuring discrepancy, the following L 2 norm based objective function is used, which is minimized to estimate the optimal values of factors scores where d i (m) and d i (c) are the ith standardized measured and calculated direct push logging data, respectively. The calculated data comes from the multiplication ̃ from Eq. (6), while d stores the measured DP logs in the same equation. The before mentioned multiplication of factor loadings and factor scores permits the calculation of synthetic DP logs, which can be looked at as the solution of the forward problem.
For solving this optimization problem PSO is applied which is a global optimization method inspired by the social behavior of animals. The original method was developed by Kennedy and Eberhart (1995). PSO is often applied for its relatively low computational requirements and easy implementation compared to other optimization methods such as the genetic algorithm (Holland 1975). It is a population based technique where each particle (i.e., possible solution) searches for the optimal solution by adjusting its own position in the search space by taking into account its own best position and the whole swarm's best position in every iteration step. This searching mechanism of PSO is governed by Eqs. (8-9). Having an n-dimensional optimization problem, the ith particle's position in the search space can be represented by the vector x i = (x i1 , x i2 , …, x in ) T and its velocity by vector v i = (v i1 , v i2 , …, v in ) T . Every particle of the swarm has a memory of its best position, which is continuously updated during the iterations and is stored in vector p i = (p i1 , p i2 , …, p in ) T for the ith particle. The position and velocity update equations are where t = 1,…, T is the current iteration step, T is the last iteration and i = 1,2,…, S shows the particle index and S is the population size. In Eq. (9) r 1 and r 2 denote random variables uniformly distributed in 0 to 1 and w is an inertia weight (Shi and Eberhart 1998) that was introduced to balance between global and local search. Vector g stores the very best position found by the swarm until the current iteration step. It is continuously updated in each iteration and helps the swarm to find the global optimum of the objective function. Acceleration factors c 1 and c 2 are positive constants, where c 1 is the cognitive scaling parameter and c 2 is the social scaling parameter, both generally set as 2 (Kennedy and Eberhart 1995).
Since PSO is a metaheuristic method, the chosen control (hyper-) parameters have a great effect on its performance (Zhang et al. 2005). To increase the reliability of the searching mechanism in finding the global optimum, the automatic selection of acceleration factors c 1 and c 2 is developed. It is carried out in an additional program loop with the help of simulated annealing (Metropolis et al. 1953) as depicted in Fig. 1.
For the outer program loop, we initialize the values of both c 1 and c 2 hyperparameters as 1 and then let SA select the optimal values automatically for the current optimization problem. In every SA iteration step their value is adjusted by adding a small b parameter to both values. Parameter b is randomly generated in each iteration from − b max to b max , where b max is also reduced iteratively as b max = b max × ε, where ε is a constant smaller than 1.
Once the new c 1 and c 2 parameters are selected, PSO is run to find the optimal values of factor scores f. If the difference in energy (ΔE) in two successive iteration steps of the SA program loop according to the objective function in Eq. (7) is negative (i.e., PSO was more effective with the new hyperparameters), then the current values of parameters c 1 and c 2 are accepted and the iterations are continued. If the difference in energy is positive (i.e., new control parameters decreased the efficiency of PSO), then the accepting probability of the new c 1 and c 2 is given by P a = exp(− ΔE/T * ), where T * is the current temperature of the system. Temperature is reduced logarithmically according to T (new) = T 0 /lg(1 + q) as suggested by Geman and Geman (1984), where q is the number of iterations computed so far and T 0 is the starting temperature of the system. The new hyperparameters are accepted only if a random number generated from 0 to 1 is smaller than P a . This mechanism of accepting worse solutions prevents SA from being stuck in a local optimum near the starting model and thus allows for the optimal selection of parameters c 1 and c 2 for PSO.

One-dimensional application
To test the suggested hyperparameter estimation assisted factor analysis, a direct push logging dataset is used that was measured in Bátaapáti, Southwest Hungary. The dataset contains a total of 12 sounding holes with the natural gamma-ray intensity, GR (cpm), cone resistance, RCPT (MPa), bulk density, DEN (g/cm 3 ), neutron porosity, NPHI (v/v) and resistivity, RES (ohmm) logs measured in the upper 20-28 m of unconsolidated loessysandy layers. The sounding holes are located along a 550 m long profile approximately 50 m from each other.
First, a one-dimensional application is shown for sounding hole 7 (SH7) where logging data is available for the interval of 0.5-27.7 m. To start off the procedure, DP logging data is standardized and then factor loadings are estimated by Eq. (4). The rotated factor loadings by the varimax algorithm are shown in Table 1. The first factor explains 71% of the total data variance, while the other 29% is explained by the second factor.
According to the computed factor loadings, the first factor correlates most with the bulk density, neutron porosity and resistivity logs, while the second factor is most influenced by the cone resistance and natural gamma-ray intensity logs. Then these factor loadings remain unchanged for the remainder of the procedure. To give an estimate to the factor scores f by PSO, first, a random population of 60 particles (each representing Fig. 1 Flowchart of the FA-PSO method aided by hyperparameter estimation a solution candidate for vector f) is generated with uniform distribution within the search space previously defined by solving Eq. (5). In this instance, the limits of factor scores are set as − 5 to 5. The inertia weight w for Eq. (9) is initialized as 1 and then adjusted in each iteration based on w (new) = w (old) × α, where α is a damping factor set to 0.99, while hyperparameters c 1 and c 2 are automatically selected by SA in the outer program loop. SA is run for 50 iteration steps to find the optimal values of c 1 and c 2 (Fig. 2).
The generated hyperparameters are plugged into PSO in each SA iteration step to test their effectiveness. With the currently set c 1 and c 2 parameters, PSO is run for 2000 iterations to given an estimate to the optimal values of factor scores in vector f. During the PSO runs, all 60 particles are adjusted according to Eqs. (8)-(9), and the objective function defined in Eq. (7) is recalculated with the new values of factor scores f in every iteration step. Figure 2 on the right depicts the final data distance of each PSO run with the corresponding c 1 and c 2 control parameters (Fig. 2 on the left). It can be seen that after approximately 15 iterations, SA finds the optimal c 1 and c 2 parameters and thereafter their value somewhat stabilizes and the data distance reached by PSO does not decrease any further. The optimal values of c 1 and c 2 in this case are found to be 1.54 and 2.29, respectively, which somewhat differ from the default values of 2 and 2, respectively (Kennedy and Eberhart 1995). Once the pre-defined number of iteration steps is reached, the final values of the factor scores (represented by the particle with the lowest data misfit) are accepted as the final solution. In this instance, the final data distance reached by PSO with the automatically selected hyperparameters at the end of the procedure was 0.45. Then the first factor (F 1 ) can be used to estimate the water saturation (S w ) of the penetrated formations by regression analysis (Szabó et al. 2012). In this paper, an exponential relationship is assumed between the first factor log and water saturation  where a, b and c are area specific regression coefficients. As a reference for regression analysis the S w result log taken from the complete quality controlled inversion (Drahos 2005) was applied. Figure 3 on the left depicts the exponential relationship between the first factor log and water saturation along sounding hole 7 based on Eq. (10). For this DP logging dataset, the regression coefficients with 95% confidence bounds are found to be a = 0. The results of the FA-PSO method in SH-7 are depicted in Fig. 4 along with the input direct push logs. The first five tracks contain the measured logs, track 6 contains the first (purple) and second factor (red) logs and the last track shows the water saturation estimates by inverse modeling (green) and by the FA-PSO method (blue) utilizing the first factor log.
It can be seen that the two water saturation estimates are almost identical, which validates the applicability of factor analysis for the processing of direct push logs and can serve as an independent tool for estimating water saturation in the shallow unconsolidated subsurface.
(10) S w = ae (bF 1 ) + c, Fig. 3 The exponential relationship between the first factor and water saturation in SH-7 (on the left), water saturation derived by local inversion and factor analysis (on the right) 1 3

Two-dimensional application
The processing of direct push logs by factor analysis can also be carried out in two-dimensions, which permits the simultaneous processing of DP logs recorded in neighboring sounding holes. Thus, a 2D section of water saturation can be estimated in one interpretation phase from the DP logs recorded in multiple sounding holes . First, gather the measured DP logs in vector d (h) defined in Eq. (6) from the hth hole (h = 1,2,…,H). Then the model of factor analysis can be extended for multiple sounding holes as where ̃ (h) denotes the matrix of factor loadings and f (h) denotes the vector of factor scores computed for the hth sounding hole. Since there are N h number of logged depth points in the hth sounding hole, the total number of depth points is N * = N 1 + N 2 + … + N H . The matrix of factor loadings in Eq. (11) is calculated similarly to the one-dimensional case by Eq. (4), and the optimal factor scores are estimated by PSO. Here it should be noted that the two dimensional case differs from the one dimensional case, because here measured DP logs are processed jointly from several sounding holes together while assuming that the same factor loadings are applicable for the whole exploration area. Once the factor logs are derived and are related to water saturation by regression analysis, they can be interpolated Fig. 4 The results of the FA-PSO method in SH-7 1 3 (e.g., by kriging) between the sounding holes to derive the map of water saturation for the investigated area.
To test this two-dimensional approach for water saturation estimation by factor analysis, data is collected from 12 sounding holes (SH-1 to SH-12), which are located along a 550 m long profile, approximately 50 m apart from each other. The natural gamma-ray intensity, GR (cpm), cone resistance, RCPT (MPa), bulk density, DEN (g/cm 3 ), neutron porosity, NPHI (v/v), resistivity, RES (ohmm) logs are all available along the penetrated sounding holes to serve as the input of the two-dimensional factor analysis. The total number of DP logging data from the 12 sounding holes combined is 15,500. By extracting 2 factors, the rotated factor loadings are given in Table 2. The first factor describes 72% of the total data variance, while the remaining 28% is described by the second factor.
The calculated factor loadings for the full dataset of all 12 sounding holes are essentially the same as the factor loadings estimated only in SH-7 (Table 1). The first factor correlates best with the bulk density, neutron porosity and resistivity logs, while the second factor is most influenced by the cone resistance and natural gamma-ray intensity logs. The remainder of the two-dimensional procedure is identical to the one-dimensional case. Factor loadings in Table 2 are fixed, and then by solving Eq. (5) for the factor scores, the boundaries of the search space for PSO is defined as − 5 to 30. Due to the larger dataset and thus increased number of unknowns, PSO here requires 7500 iterations to find the optimal values of the factor scores by utilizing 60 particles. Hyperparameters c 1 and c 2 are again automatically selected by SA in 50 iteration steps. The minimal data distance reached by PSO in finding the optimal values of factor scores by utilizing the SA derived hyperparameters is depicted in Fig. 5 on the right for all 50 iteration steps with the corresponding c 1 and c 2 parameters which are shown in Fig. 5 on the left.
It can be seen that after approximately 12 iterations, SA finds the optimal values of c 1 and c 2 , thereafter their value somewhat stabilizes and the data distance reached by PSO does not decrease any further. The optimal values of c 1 and c 2 are found to be 1.33 and 2.13, respectively. The final data distance reached by PSO with these parameters is 0.50. Then the resultant first factor log can be related to the water saturation of the investigated area by regression analysis based on Eq. (10) as seen in Fig. 6 on the left. As a reference for regression analysis, the quality checked local inversion derived water saturation values are used. In this instance, the regression coefficients with 95% confidence bounds are found to be a = 0  Fig. 6 water saturation estimated by local inversion and factor analysis is plotted, where the high correlation coefficient (r = 0.97) indicates that the two estimates are nearly linearly proportional and thus shows the reliability of the two dimensional factor analysis for water saturation estimation.
By interpolating the estimated water saturation logs between the sounding holes, one can derive the map of the water saturation of unsaturated formations along the processed profile  Fig. 7). Similarly, the results of local inversion estimates can also be interpolated between the holes, which is depicted in Fig. 8. The logs are located at approximately 50 m apart, sounding hole 7 processed in the previous section is located at 300 m. The layers with different water saturations can be easily recognized on the derived maps and thus can help with site characterization. The good correlation between the two estimates confirm the applicability of the two-dimensional factor analysis for water saturation estimation from direct push logs.  The exponential relationship between the first factor and water saturation in sounding holes 1-12 (on the left), water saturation derived by local inversion and the two dimensional factor analysis (on the right)

Conclusions
The paper presents the results of a machine learning-based log analysis method for direct-push logging data. The suggested factor analysis based approach is shown to effectively reduce the dimension of direct push logging datasets and the newly extracted factor logs by the FA-PSO method can be used to estimate water saturation along arbitrary long sounding hole intervals. By incorporating more than one sounding hole into the procedure, even two-dimensional water saturation profiles can be derived by simultaneously processing DP logs from neighboring holes. The result of the presented method is also verified by independent inversion based estimates of water saturation. It is also shown that simulated annealing is capable to automatically select some of the hyperparameters of the utilized particle swarm optimization algorithm. Thus, the uncertainty of the optimization algorithm can be reduced, since the manual selection of the c 1 and c 2 parameters is no longer necessary. The suggested machine learning tool assures