Introduction

Integration of rock facies classification into the formation permeability modeling, especially given the core measurement and well log interpretations, is a crucial step to reduce the uncertainty in reservoir characterization (Xu et al. 2012). Rock facies classification leads to improve the relationship between permeability and porosity and then results in efficiently estimating the petrophysical properties in noncored intervals (Lee and Datta-Gupta 1999). The discrete facies sequence is produced either from core measurements (lithofacies) or clustered from the well logging data (electrofacies) (Al-Mudhafar and Bondarenko 2015; Lee and Datta-Gupta 1999; Nashawi and Malallah 2009; Tang et al. 2004). The classification procedure starts with modeling the discrete facies distribution as a function of well logging data for limited intervals. Based on that modeling, the facies distribution is then predicted for the entire depth intervals for the well and other wells that have no facies measurements.

Many algorithms have been adopted for lithofacies/electrofacies classification, such as linear discriminant analysis (Al-Mudhafar 2014, 2015a, b; Lee and Datta-Gupta 1999; Rafik and Kamel 2016), multinomial logistic regression (Al-Mudhafar 2014; Tang et al. 2004), neural networks (Avseth and Mukerji 2002; Tang 2008; Wong et al. 1995), kernel support vector machine (Al-Mudhafar 2015a, b, 2017a, b), and principal component analysis (Adoghe et al. 2011). All these algorithms predict the discrete and continuous probability distributions of facies.

In permeability modeling and prediction, there are many various algorithms that have been adopted to predict the permeability given core measurements and/or well logging records in addition to rock facies. These algorithms include multiple linear regression (Dahraj and Bhutto 2014; Mohaghegh et al. 1997; Xue et al. 1997), generalized additive modeling (Al-Mudhafar and Mohamed 2015; Al-Mudhafar and Bondarenko 2015; Lee et al. 2002; Rafik and Kamel 2016), multivariate adaptive regression splines (Al-Mudhafar and Al-Khazraji 2016; Xie 2008), neural networks (Lee and Datta-Gupta 1999; Lee et al. 2002; Mohaghegh et al. 1997), fuzzy logic (Nashawi and Malallah 2009), and support vector regression (Al-Anazi and Gates 2011).

In this paper, the probabilistic neural networks (PNNs) and generalized boosted regression model (GBM) were employed for lithofacies classification and core permeability estimation, respectively.

Research methodology

Linking between different reservoir parameters of distinct measurement scales is a complex procedure because the geological systems always have nonlinear behavior. Therefore, it is essential to consider nonlinear algorithms to model the response parameters, such as permeability or rock facies, as a function of independent variables (predictors), for instance well log interpretations. Since the predictors come from different sources with various scales of few inches, as in core measurements, and few feet, as in well logging data, it is important to look for the most efficient approach to model these different data sources. In this paper, we introduce an efficient workflow to integrate the probabilistic neural networks (PNNs), as a nonlinear facies classification algorithm, into the generalized boosted regression model (GBM), as nonlinear modeling algorithm, for core permeability modeling and prediction. To best of author’s knowledge, the GBM algorithm has never been adopted before, at least in the literature of the petrophysical property modeling, to model the core permeability as a function of well logging and facies data.

PNN is an implementation of a statistical algorithm called Kernel discriminant analysis in which the operations are organized into a multi-layered feedforward network with four layers: input, pattern, summation, and output layers. GBM is a recent data mining technique that has shown considerable success in predictive accuracy as it maintains a monotonic relationship between the response and each predictor. In particular, PNN was employed to model lithofacies sequences in order to predict discrete lithofacies distribution for the entire reservoir thickness, including the missing intervals. After that, the predicted discrete facies distribution was included as a predictor in the multivariate permeability modeling through the GBM approach. The GBM was employed to build a nonlinear relationship between core permeability and well logging data conditioning to the lithofacies. More specifically, it was essential to model the permeability as a function of well logging data given each rock facies to estimate the core permeability in noncored intervals and other wells with preserving the reservoir heterogeneity. The well log interpretations of neutron porosity, shale volume, and water saturation along with the core measurements of permeability and lithofacies, were obtained for a well in the upper sandstone reservoir/Zubair formation in South Rumaila oil field, located in southern Iraq. The principle permeability model is illustrated in the following equation:

$$\begin{aligned} y=f(x_{i}) + \epsilon _{i} \end{aligned}$$
(1)

where \(x_{i}\) refers to the independent variables (predictors), y is the expected core permeability, and \(\epsilon _{i}\) is the residual.

To show the efficiency of the GBM algorithm, its performance was compared to the conventional multiple linear regression (MLR). The root-mean-square prediction error (RMSE) and adjusted R-square were considered as statistical validation tools to compare between MLR and GBM results. More specifically, the RMSE and \(R^{2}_\mathrm{adj}\) were considered to quantify the mismatch between the observed and predicted core permeability calculated by GBM and MLR. RMSE measures the expected squared difference between the observed and predicted core permeability, and \(R^{2}_\mathrm{adj}\) is the adjusted \(R^{2}\) that explains the variance obtained by the permeability model adjusted for the number of predictors that improve the model:

$$\begin{aligned} \hbox {RMSE}=\sqrt{\frac{1}{n}\sum _{j=1}^{n} ({\hat{f}}_{j}(x_{i})-f_{j}(x_{i}))^2} \end{aligned}$$
(2)
$$\begin{aligned} R^{2}_\mathrm{adj}=1-\frac{(1-R^{2})(n-1)}{n-k-1} \end{aligned}$$
(3)

where \(R^{2}\) refers to the coefficient of determination in simple linear regression, or the coefficient of multiple determination in multiple linear regression, n is the number of measurements and k is the number of independent variables (predictors).

All the multivariate statistics analyses of lithofacies classification and permeability modeling with results visualizations were implemented through R, the most powerful open-source statistical computing language.

Probabilistic neural network

Specht (1990) has firstly introduced the probabilistic neural network (PNN) as an efficient nonlinear classification algorithm (Specht 1990). PNN is a type of supervised neural networks that are used in the pattern classification and recognition problems. This method was derived from the Bayesian network and a statistical algorithm called Kernel fisher discriminate analysis. In the PNN, the network is arranged into a multilayer feedforward network, which comprises of four different layers: input, pattern, summation, and decision/output. Figure 1 shows the interconnection between processing units or neurons of each of the four successive layers (Mao et al. 2000).

Fig. 1
figure 1

General architecture of probabilistic neural networks (PNNs) (Mao et al. 2000)

The PNN classifier was adopted in rock facies classification because it often learns more quickly than the other types of neural network classifiers, such as the back propagation network. In addition to the fast training process, PNN has other advantages in comparing with other neural networks like an inherently parallel structure. Training samples can also be added or removed without extensive retraining and guaranteed to converge to an optimal classifier as the size of the representative training set increases (Emary 2008). The entire procedure of facies classification in probabilistic neural network was implemented through pnn package (Chasset 2015) and doParallel package (Weston and Calaway 2015) in R statistical language.

Generalized boosted regression model

Gradient boosting regression model (GBM) is a powerful machine learning tool derived by Friedman (2001, 2002) to capture complex nonlinear function dependencies. Generalized boosted regression is an implementation of expansion to Freund and Schapire’s AdaBoost algorithm and J. Friedman’s gradient boosting machine (Freund and Schapire 1997). GBM has been efficiently adopted in many data-driven tasks with high accuracy of modeling and prediction of response variables. In gradient boosting modeling, more accurate estimate of the response variable is obtained through consecutively fitting new models in order to reduce the variance between the predicted and observed responses. The main idea of GBM is to learn the data to achieve maximum correlation with the negative gradient of the loss function (Natekin and Knoll 2013). For continuous response, the loss function can be Gaussian or Laplace functions. However, binomial and AdaBoost loss functions are suitable for the categorical responses.

The main idea behind loss functions in GBM is to penalize large deviations from the target outputs along with neglecting small residuals. For continuous response variable, the appropriate loss function is the squared-error L2 loss and its derivative represents the residual. So, the GBM can be applied based on the residual fitting (Natekin and Knoll 2013). The GBM procedure begins with assigning a differentiable loss function, which represents the variance between observed and predicted response factor y and starts with an initial model F, which can be averaging of y. Then, iteration is implemented until converge in order to next calculate negative gradients:

$$\begin{aligned} -g(x_{i})=-\frac{\partial L(y_{i},F(x_{i}))}{\partial F(x_{i})}. \end{aligned}$$
(4)

The regression tree h is later fitted to the negative gradients \(-g(x_{i})\). In this paper, the entire procedure of Gradient boosting regression modeling was implemented by gbm package in R, the most powerful open-source statistical language (Ridgeway 2017).

Reservoir and data description

In this paper, the well logging and core data were obtained for a well in the upper sandstone member in Zubair formation/South Rumaila oil field, located in southern Iraq. The giant South Rumaila oil field composed of many oil-producing reservoirs. Zubair formation is one of the oil reservoirs that is represented by Late Berriasian–Albian cycle and its sediments, which belongs to Lower Cretaceous age (Al-Mudhafar 2017a, b). The Zubair formation of 280–400 m thickness range encompasses five members from top to bottom: upper shale member, upper sandstone member, middle shale member, lower sand member, and lower shale member. The upper sandstone member is the main pay zone of South Rumaila oil field (Mohammed et al. 2010). The main pay comprises five dominated sandstone units, separated by two shale units. The shale units act as good barriers impeding vertical migration of the reservoir fluids except in certain areas where they disappear. The five unit zones have been denoted from top to bottom as AB, DJ1, DJ2, LN1, and LN2, and there are two shale layers C and K between AB and DJ1 and DJ2 and LN1, respectively (Al-Ansari 1993). Figure 2 illustrates the geological column of lithology description in a well in South Rumaila oil field.

Fig. 2
figure 2

Geological lithology column of the formations in South Rumaila oil field (Modified from Al-Ameri et al. 2009 and Mohammed et al. 2010)

The well log interpretations were considered for lithofacies classification and permeability modeling. The well log interpretations include neutron porosity, shale volume, and water saturation, all as a function of reservoir depth. In addition, the measured discrete lithofacies types, which were obtained from core measurements, include sand, shaly sand, and shale. Figure 3 decorates the depth-based measured lithofacies and well log interpretations.

Fig. 3
figure 3

Distribution of discrete lithofacies and well log interpretations. The measured lithofacies from core data analyses include sand, shale, and shaly sand only. The available well log interpretations are neutron porosity, shale volume, and water saturation

The total number of measurements in the dataset is 669. Each reading represents the measured property in half-foot interval in the reservoir depth. The summary of the entire dataset is illustrated in Table 1.

Table 1 Summary of well log interpretations and core-based discrete lithofacies distribution for the well under study

Prior to implementing the permeability modeling, it was necessary to set up the cross-validation. The data were sampled and split into two parts: training with 75% and testing with 25%. The modeling was conducted based on the training subset, and prediction was adopted as a function of the testing subset. That led to provide more trust about the external prediction of the models. Table 2 shows the head of the sampled training subset data, while Table 3 shows the head of the sampled testing subset data.

Table 2 Head of the sampled training subset data for the well under study
Table 3 Head of the sampled testing subset data for the well under study

Results and discussion

After adopting a prior cross-validation on the data, probabilistic neural network was conducted for modeling the discrete lithofacies, as distinct digits [1, 2, n], given the aforementioned well log interpretations. The learn and smooth functions in pnn R package produce the observed and guessed discrete lithofacies distribution. The total correct percent (classification success rate) of the classified lithofacies was automatically calculated after computing the count of success and fail points. The resulted success rate in this study was 95.81%, and this rate is accurate enough for the PNN algorithm to be considered for further facies prediction. The comparison between the observed and predicted lithofacies is shown in Fig. 4.

Fig. 4
figure 4

Box-plots of the observed and predicted discrete lithofacies by the PNN algorithm

In core permeability modeling, the predicted discrete lithofacies distribution of sand, shaly sand, and shale was included into the generalized boosted regression model (GBM) as a fifth predictor in addition to well depth, neutron porosity, shale volume, and water saturation. The gbm function, in the gbm R package, requires defining gaussian distribution for permeability as a loss function. It also requires determining the number of iterations and the k-fold for the built-in cross-validation. To ensure more accurate modeling, the number of maximum iterations was set to 50,000 with twofold for the cross-validation; while the optimal number of iterations is automatically specified by gbm.perf function. Through gbm function implementation, the relative influence is calculated to show the most influential predictors on the response (core permeability).

The lithofacies distribution has the most effect on the core permeability modeling because of the distinct permeability distribution given each facies of sand, shaly sand, and shale. The water saturation and depth are more influential than neutron porosity and shale volume. However, the shale volume is less influencing than water saturation because both have the same data behavior as both reflect the same rock and fluid characteristics. Also, they represent a multicollinearity and one of them should be removed from the regression model. Depth parameter also has a significant effect because it controls the locations of high and low permeability ranges given the lithofacies within the reservoir thickness, as shown in Fig. 3 where the sand distribution located at the top depth intervals, shale located at the middle, and shaly sand of medium ranges located at the bottom depth intervals. Figure 5 illustrates the relative influence of the predictors in the GBM-core permeability modeling.

Fig. 5
figure 5

Relative influence of the predictors in the GBM-core permeability modeling

The maximum number of iterations was set in for the gbm function to be 50,000. However, the resulted number that led to obtain the best fit and least square-error loss is 49,995, which was automatically indicated by the gbm.perf function in the R gbm package, as shown in Figure 6. The gbm.perf function estimates the optimal number of boosting iterations for a gbm object and optionally plots various performance measures (Ridgeway 2017). In Fig. 6, the black and green lines refer to the training and testing Bernoulli deviances, respectively. The iteration (tree) selected for prediction, indicated by the vertical blue dashed line, is the tree that minimizes the testing error on the cross-validation folds (Ridgeway 2017).

Fig. 6
figure 6

Optimal number of iterations in the GBM algorithm indicated by the vertical blue dashed line, while the black and green lines refer to the training and testing Bernoulli deviances, respectively

The GBM algorithm obtained the most accurate permeability modeling and prediction. The accuracy is identified from the excellent matching between the observed and predicted core permeability, as shown in the scatterplot in Fig. 7. The computed root-mean-square prediction error (RMSPE) was 28.43, and the adjusted R-square was 0.9953. In Fig. 7, both the predicted and observed permeability were located on the same linear trend without misfit. This fact can be likewise seen in Fig. 8 that show the matching between vertical distributions of predicted and observed permeability given the well depth. The value of adjusted R-square is 0.9953, and it reflects how sure that the given dataset can be represented by the gbm approach and how negligible the variance can be explained by this model as well.

Fig. 7
figure 7

Scatterplot of predicted and observed core permeability given the testing subset by GBM algorithm

Fig. 8
figure 8

Matching between the entire vertical predicted and observed core permeability distributions by GBM algorithm

Further validation

To show the effectiveness of the generalized boosted regression in core permeability modeling, the same procedure was repeated considering the conventional multiple linear regression (MLR). The same dataset was sampled and split for training and testing subsets through the cross-validation. The computed adjusted R-square and RMSPE by the MLR were 0.9551 and 88.42, respectively. There was a significant difference between the GBM and MLR predictions as the RMSPE for GBM is much less than in MLR. That reflects the least mismatch between the observed and predicted core permeability by the GBM. Figure 9 illustrates the scatterplot between the MLR-predicted and observed core permeability. Figure 10 depicts the distribution matching between the predicted core permeability by MLR with the observed measurements for the entire reservoir depth. There is clear mismatch between the MLR-predicted and observed permeability for many depth intervals.

Fig. 9
figure 9

Scatterplot of predicted and observed core permeability given the testing subset by MLR algorithm

Fig. 10
figure 10

Matching between the entire vertical predicted and observed core permeability by MLR algorithm

Summary and conclusions

In order to predict the lithofacies and core permeability at the noncored intervals in a reservoir with preserving its heterogeneity, a powerful workflow that includes lithofacies in the permeability modeling should be adopted. That workflow was implemented in this paper through probabilistic neural networks (PNNs) for lithofacies classification and generalized boosted regression model (GBM) for permeability modeling. More specifically, the discrete core-measured lithofacies of sand, shaly sand, and shale were modeled as a function of the well log interpretations of neutron porosity, shale volume, and water saturation through the PNN algorithm. Then, the predicted discrete lithofacies distribution and the well log interpretations were included in the GBM model for the core permeability modeling.

The PNN classification algorithm led to accurate facies classification as the total success rate exceeded 95%. There was also an excellent matching between the predicted and observed discrete distributions of lithofacies. That reflects the ability of PNN algorithm to generate nonlinear relationship between the data from various sources and scales.

The GBM algorithm created a nonlinear relationship, which is necessary for nonlinear geological systems, between the core permeability given the predictors of well log interpretations and lithofacies. The nonlinearity modeling resulted in the excellent matching between the observed and predicted core permeability for all depth intervals of the reservoir with a very small prediction error. Additionally, the efficiency of GBM results was validated by repeating the procedure of permeability modeling through the conventional multiple linear regression (MLR). The matching between the GBM-based predicted and observed permeability was much more accurate than the MLR as the root-mean-square prediction error was less and adjusted R-square was higher in GBM than in MLR.

To validate the results of the variously presented algorithms, the cross-validation was conducted prior to the permeability modeling by sampling and splitting the data into training and testing subsets. The modeling was adopted in the training subset, and prediction was then conducted given the testing part to make an external prediction from the same dataset and reduce the prediction uncertainty.

Finally, a precise prediction of rock facies leads to improve the well logging–permeability relationships and then obtain adequate permeability modeling with preserving reservoir heterogeneity. Additionally, integrating facies characterization into the permeability modeling results in accurately identifying the vertical and spatial facies as well as boosting the petrophysical property distribution for improved overall reservoir characterization.