Introduction

Mudstones are widely occurring siliciclastic sedimentary rocks that behave as a source, cap and reservoir rock for hydrocarbon systems (Aplin and Macquaker 2011). It contains hidden organic-rich sweet spots and shale gas reservoirs which are favorable for petroleum production. Mudstone reservoirs are unconventional complex geological systems and provide challenges to conventional lithofacies interpretation techniques. Extraction of hydrocarbon from these unconventional resources requires accurate information of formation lithofacies, its association with petrophysical properties of reservoir rock and their spatial distribution (Spain et al. 2015). Conventionally, qualitative analysis is performed to recognize subsurface mudstone facies using core analysis, geomechanical spectroscopy logs, Rock–Eval pyrolysis, etc. (Bhattacharya et al. 2016). The conventional methodology is found to be inconvenient, tiresome, expensive in nature and requires high domain expertise.

Recognition of subsurface lithofacies is much researched topic and still a thought-provoking problem due to the uncertainty associated with reservoir measurements (Chaki et al. 2015; Bhattacharya et al. 2016). Quantitative modeling of lithofacies is essential to assess the potential of unconventional hydrocarbon reservoirs lying in mudstone formations. It also helps to understand the diagenetic and depositional burial history of these reservoirs (Aplin and Macquaker 2011). Quantitative information about underlying mudstone lithofacies can be taken out from conventional well logs which record the physical properties of rocks along with the reservoir depth. Several logging techniques such as wireline, logging-while-drilling, measurement-while-logging, etc., are utilized to generate wide varieties of well logs for measuring petrophysical properties of reservoir rock (Anifowose et al. 2015, 2017). Various researchers have reported that logging data are found to be nonlinear, high dimensional, complex and noisy in nature due to the heterogeneity and spatial distribution of reservoir properties (Kocberber and Collins 1990; Chaki et al. 2015; Bhattacharya et al. 2016; Tewari and Dwivedi 2018a). Therefore, manual identification of geological lithofacies from sensory well logs is an impractical and tedious job even for expert field engineers. Thus, advance predictive machine learning models are suggested for extracting the lithofacies information from well logs.

Several machine learning models have been proposed to extract the facies information of conventional reservoir using well logs data. However, only few research works are available for unconventional mudstone reservoirs. Machine learning paradigms utilized for quantitative lithofacies modeling of mudstone lithology are limited to unsupervised and supervised classifiers (Qi and Carr 2006; Ma 2011; Wang and Carr 2012; Anifowose et al. 2015; Bhattacharya et al. 2016). Aplin and Macquaker (2011) published a comprehensive review of mudstone lithology. They also studied the roles played by mudstone, viz. as an organic-rich source rock, cap rock and reservoir rock, from generation to storage of hydrocarbon. Li and Schieber (2017) did a detailed study about mudstone facies of the Henry Mountain Region of Utah. Qi and Carr (2006) employed an artificial neural network (ANN) model for the identification of carbonate lithofacies existing in Southwest Kansas from well logs data. Wang and Carr (2012) applied discriminant analysis, ANN, support vector machine (SVM) and fuzzy logic techniques for lithofacies modeling of the Appalachian basin at USA. They utilized core and seismic data along with well logs to develop a 3-D model of shale facies at the regional scale. Avanzini et al. (2016) implemented unsupervised cluster analysis for lithofacies classification to identify hidden productive sweet spots in the Barnett Shale formation. Bhattacharya et al. (2016) compared the performance of unsupervised and supervised machine learning models for mudstone facies present in Mahantango-Marcellus and Bakken Shale, USA. Bhattacharya et al. (2019) applied SVM for the identification of shale lithofacies of Bakken formation existing in North Dakota, USA. Table 1 contains a summary of important published research works related to mudstone lithofacies classification utilizing machine learning techniques.

Table 1 Summary of some important published research works related to mudstone lithofacies classification utilizing machine learning techniques

All the above-mentioned research works are based on single supervised or unsupervised classifiers. However, it has been proved that the performance of single classifiers can be improved using hybrid computational models such as multiple classifier system, a committee of machines, composite systems, etc. (Dietterich 2000; Skurichina and Duin 2001). Multiple classifier system, like ensemble methods, can excavate more valuable information from raw sensory data. It combines the decisions of several classifiers together for classification and regression tasks. The ensemble approach can be categorized into two types: (a) homogeneous ensemble methods (HoEMs) such as bagging, random forest, rotational forest, random subspace, etc., and (b) heterogeneous ensemble methods (HEMs) such as voting, stacking, etc. HoEMs in feature space combine several hypotheses generated by the identical type of supervised classifiers which are utilized as base classifiers (e.g., a cluster of hundreds of SVMs). In the case of HEMs, different classifiers are utilized to generate and combine diverse hypotheses to achieve maximum possible prediction accuracy for existing feature space. It has been proved that heterogeneity in base classifiers helps to develop more reliable, robust and generalized classifier models (Sesmero et al. 2015). Ensemble methods have the capability to handle complex, nonlinear, multidimensional and imbalanced petroleum data (Tewari and Dwivedi 2018b; Anifowose et al. 2015; Dietterich 2000; Skurichina and Duin 2001). However, ensemble methods are not properly explored and applied in the petroleum research domain. Only limited applications of the ensemble approach can be found in the petroleum domain as explained briefly in the next section of the paper.

The performance of HEMs, namely voting and stacked generalization ensembles, along with four other popular classifiers, has been studied for the recognition of mudstone lithofacies in the paper. This research work is primarily focused on the construction of diverse base classifiers contained in HEMs that can actually outperform single classifiers and HoEMs for lithofacies identification. A comparative study has been performed among HEMs and also with four contemporary classifiers for the prediction of lithofacies. The suitability of HEMs has also been evaluated for the identification of lithofacies. The complications associated with the development of HEMs have been discussed in this research work. Four popular supervised classifiers, viz. multilayer perceptron (MLP), SVM, gradient boosting (GB) and random forest (RF), have been combined in HEMs as base classifiers to provide more accurate and generalized results. The performance of these classifiers has been evaluated using Kansas oil-field data with proper parameters optimization in their stable search ranges. Validation curve and grid search algorithm have been applied for parameters tuning to achieve maximum classification accuracy. The contribution of each input well logs, for the pattern recognition of mudstone lithofacies, has been studied using Relief algorithm. Relief algorithm is utilized for the attribute selections owing to its capability of identifying discriminatory information. Overall, this research work assesses the pattern recognition competency of HEMs for complex mudstone lithofacies.

Heterogeneous ensemble methods

Heterogeneous ensembles are lesser-known intelligent algorithms in the petroleum domain. They generate several different hypotheses in feature space using diverse base classifiers and combine them to achieve maximum possible accurate results. Sesmero et al. (2015) proved that diverse classification hypotheses in feature space are essential for the development of reliable, robust and generalized ensemble classifiers. Thus, HEMs are investigated in this paper for solving multiclass lithofacies recognition problem in the quest for higher classification accuracy. Extraction of valuable lithofacies information from well logs data is quite a challenging task even for intelligent HoEMs. Spatial distribution and heterogeneous behavior of the hydrocarbon reservoir properties contribute to complexity, nonlinearly and uncertainty in all types of sensor-based measurements (Chaki et al. 2015; Bhattacharya et al. 2016). Also, no standard tools or techniques are available in the present scenario that can measure the reservoir heterogeneity and its influence on other reservoir properties, well logs and drilling data, etc. Therefore, HEMs are employed for recognition of subsurface lithofacies in this paper.

Related works

Ensemble methods are less applied methodology in the oil and gas industry. In the petroleum domain, HoEMs are mostly implemented for drilling parameter estimation and reservoir characterization. Only a few applications of HEMs are found in the existing literature. Santos et al. (2003) utilized a neural net ensemble for the recognition of underlying lithofacies. Gifford and Agad (2010) applied collaborative multiagent classification techniques for the identification of lithofacies. Masoudi et al. (2012) integrated the outputs of Bayesian and fuzzy classifiers to recognize productive zones in Sarvak Formation. Anifowose et al. (2015) utilized the stacked generalization ensemble for enhancing the prediction capability of supervised learners for reservoir characterization. Anifowose et al. (2017) wrote a review on the applications of ensemble methods and suggested that ensemble methods are suitable for solving problems of the oil and gas industry. Bestagini et al. (2017) implemented the random forest ensemble for lithofacies classification of Kansas oil-field data. Xie et al. (2018) published a comparative study on the performance of HoEMs for the recognition of lithofacies. Tewari and Dwivedi (2018b) compared five HoEMs for the identification of lithofacies. Bhattacharya et al. (2019) applied ANN and random forest algorithms to predict daily gas production from the unconventional reservoir. Tewari et al. (2019) applied HoEMs for the prediction of reservoir recovery factors. HEMs have given higher classification performance as compared to supervised classifiers as well as HoEMs in several engineering fields such as remote sensing (Healey et al. 2018), prostate cancer detection (Wang et al. 2019), load forecasting (Ribeiro et al. 2019), wind speed forecasting (Liu and Chen 2019), etc. Therefore, HEMs are investigated in this paper to identify subsurface lithofacies. Two types of HEMs utilized for the identification of lithofacies are briefly explained below.

Stacked generalization ensemble

Stacked generalization ensemble is popularly known as stacking (Wolpert 1992). It combines decisions of different base classifiers in a single-ensemble architecture. Different base classifiers search the feature space with their diverse perspectives to find the maximum possible accurate hypotheses for a given classification task (Anifowose et al. 2015). The classification outcomes of base classifiers are combined together by a meta-classifier to provide final classification result. The combination of base classifiers’ outcomes is decided by the meta-classifier algorithm. Figure 1 shows the conceptual architecture of the stacking ensemble utilized for quantitative lithofacies modeling in this study.

Fig. 1
figure 1

Conceptual architecture of the stacking ensemble utilized for quantitative lithofacies modeling in this study

Stacking ensemble can also be created by merging the decision of similar base classifiers having different parametric values. The selection of base and meta-classifier combination is always a matter of concern during the design of stacking ensemble architecture. It is also difficult to design the most suitable configuration of classifiers in large feature space. Wolpert (1992) proved that the stacking ensemble is good in reducing the generalization error by decreasing bias and variance error associated with data. Initially, input data are split into training and testing datasets. Further, the training dataset is again split into K identical subsets similar to K-fold cross-validation technique. Base classifiers are trained on (K − 1) subsets, while the Kth subset is retained as a validation set. After training with (K − 1) subsets, base classifiers are individually tested with the Kth validation subset and also with the testing data. The outcomes of each base classifier with validation and test datasets will act as new training and testing data for meta-classifier. Moreover, the meta-classifier will be trained with the prediction outcomes of the validation set and the actual values of the target variable.

Voting ensemble

Voting ensemble combines the decisions of different base classifiers for the given classification or estimation task. It provides flexibility in combination strategies so that the maximum possible classification accuracy can be achieved. It does not utilize any algorithm for the combination of predictions from base classifiers as in the stacking ensemble. Two combination schemes can be implemented for merging the decisions of several classifiers, namely majority vote rule (hard voting) and average predicted confidence probabilities (soft voting) to predict the class labels of test samples (Kittler et al. 1998). In place of meta-classifier, the abovementioned combining strategies are utilized to combine outcomes of diverse supervised classifiers. In hard voting, class labels of test samples are decided by majority voting rule. Every base classifier individually assigns a class label to a given test sample during the testing phase. The final classification of the test sample is decided by the maximum number of times a particular class label gets assigned to that test sample. On the other hand, soft voting strategy initially assigns weights to each base classifier. During the testing phase, it generates prediction probabilities for every test sample belonging to various classes. Later, these probabilities are multiplied with the weights assigned to every class labels and then it is averaged. Test samples are finally classified into that class which achieves the highest average confidence probability. Mathematically, soft voting technique classifies data samples as argmax (argument of maxima) of the sum of assigned probabilities (Kittler et al. 1998; Kuncheva 2004). Figure 2 shows theoretical architecture of voting ensemble utilized for quantitative lithofacies modeling in this study.

Fig. 2
figure 2

Conceptual architecture of voting ensemble utilized for quantitative lithofacies modeling in this study

Experimental evaluation

In this paper, two HEMs were utilized for the recognition of quantitative lithofacies modeling. The primary goal of this research work was to achieve higher classification performance using the HEMs approach. HEMs were trained and tested on real-field well logs data with other popular classifiers, namely RF, MLP, SVM and GB. All the ensemble methods were implemented on the Python Scikit-learn platform. Figure 3 portrays a portion of the geological lithofacies setting of Deforest well existing in the Kansas oil and gas field.

Fig. 3
figure 3

Well logs of Kansas hydrocarbon-producing Deforest well (KGS database)

A brief description of Kansas field

Kansas region is mainly composed of sedimentary rocks with a maximum width of 2850 m. A large number of unconformities occur in the Kansas region with the sedimentary strata having 15–50% of the post-Precambrian period (Merriam 1963). Northeastern Kansas is enclosed by Pleistocene glacial deposits. A thick layer of Mesozoic rock is present in the western Kansas region. Mesozoic rock layers are mainly made up of limestones, chalks, sandstone, marine shales and nonmarine shale contents. Panoma field and Hugoton field, existing in western Kansas, comprise of large natural gas-producing reservoirs. Pennsylvanian and Permian systems are broadest structures of rock containing bedded rock salts in several layers. The pre-Pennsylvanian system existing in Kansas contains dolomites, marine, limestones layered alternatively with sandstones and shales. The Precambrian basement composed mainly of quartzite, granite and schist. Permian strata contain the carbonate reservoirs that produce the majority of natural gas. In 1992, Mississippian strata produced 43% of cumulative hydrocarbon production of Kansas field out of which 19% contributed to cumulative oil production (Newell 1987a). The numerous unconformities available in the Kansas region help trapping and migration of petroleum. Basal Pennsylvanian in Kansas has a huge deposition of hydrocarbon along with its length. A detailed description of the petroleum geology of the Kansas region can be found in Newell (1987a, b), Merriam (1963), Adler et al. (1971) and Jewett and Merriam (1959). Manual interpretation of wells logs data of such a huge hydrocarbon-producing region is time-consuming and costly. Therefore, automatic detection and identification of subsurface lithofacies using machine learning algorithms is highly desirable to minimize cost and time.

Data description

The well logs data used for the development of ensemble methods were obtained from the Kansas Geological Survey (KGS) Web site (Kansas 2009) which is a very large available well logs data repository. The downloaded digital well logs, “Las” files, contain 13,000 data samples out of which 3425 samples are extracted belonging to nine different lithofacies, viz. dolomitic wackstone (DW) (1015), clay (CL) (320), dolomitic mudstone (DM) (240), dolomitic sandstone (DS) (455), siltstone (SS) (85), dolomitic packstone (DP) (265), carbonate mudstone (CM) (520), packstone (PS) (465) and wackstone (WS) (60). The above-said lithofacies are acknowledged as class labels for the classification of well logs data into their respective lithofacies. The downloaded “Las” files belong to Paradise A, Deforest and Strahm wells existing in the Kansas field. These files also contain information about mineral contents and lithofacies prevailing in these wells. “Las File Viewer” app has been downloaded from KGS Web site to visualize geological settings and facies of these wells. Table 2 contains the range of different well logs acting as input predictor variables. These logs and recorded information about different reservoir properties are used in the pattern recognition of lithofacies. The input well logs data were downloaded from the Kansas Geological Survey database (KGS) available on the KGS Web site only for research purposes. The well logs are also accessible in the form of digital logs (.csv) files. The digital logs (.csv) files contain null or missing values in the data samples that are generated due to calibration loss of logging sensors, faulty logging tools, etc. Figure 4 shows generalized workflow for the HEMs to recognize the subsurface lithofacies.

Table 2 The statistical description of input well logs data
Fig. 4
figure 4

A generalized conceptual workflow for the HEMs to recognize the subsurface lithofacies

Data preprocessing

Diaz et al. (2018) suggested that the preprocessing of petroleum data, such as resampling, normalization, noise filtering, attribute selection, etc., helps to improve the classification or estimation accuracy of intelligent algorithms. Initially, resampling of well logs data was done to eliminate samples containing null, garbage and missing values. After resampling, the input data were normalized to reduce the impact of larger values on the smaller values of predictor variables. Input data can be normalized as given below.

$$X_{{i,{\text{norm}}}} = \frac{{X_{i} - X_{\text{Min}} }}{{X_{\text{Max}} - X_{\text{Min}} }}$$
(1)

where XMax and XMin are maximum and minimum values of the respective predictor variables. Equation (1) represents the Min–Max normalization technique that ensures all the input variables are equally scaled. Min–Max normalization is preferred over other scaling techniques especially when input data distribution does not follow Gaussian or normal distribution. It is highly beneficial for those machine learning algorithms that involve distance calculation or optimization in their internal mathematical architectures such as ANNs, K-means clustering, SVM, etc. (Mustaffa and Yusof 2010; Shalabi et al. 2006; Suarez-Alvarez et al. 2012).

Noise filtering

Noise filtering of well logs data was done to minimize the effects of noise during the pattern recognition of lithofacies. Tewari and Dwivedi (2019) studied the influence of noise levels on the performance of supervised classifiers and reported its damaging effects on the classifier’s performance. There are several denoising techniques that are available in the petroleum and geophysics literature such as low-pass filter, high-pass filter, Savitzky–Golay, wavelet denoising, moving average, Gaussian, etc. Savitzky–Golay (SG) smoothing filter has been found suitable and widely utilized noise filtering technique for geophysical data (Baba et al. 2014). This is a digital filter used for data smoothening which fits a polynomial of degree n by the linear least square method and maintains signal tendency through convolution (Baba et al. 2014). High peaks of well logs data were considered as noise components which were eliminated using SG filters. Noise contents harmfully affect the pattern recognition ability of intelligent classifiers. Figure 5 shows four important well logs with original and denoised waveform. SG smoothening filter was utilized for removing the noise content of input well logs. The degree of polynomial fitted in well logs for smoothening was found to be varying from 5 to 13. The higher components or spikes in well logs were treated as noisy contents and eliminated through smoothening the waveforms of well logs. Fifteen input well logs data were smoothened using the SG filter and passed to Relief algorithm for important well logs selection.

Fig. 5
figure 5

Four original well logs showed with their denoised waveforms using SG smoothening filter

Attribute selection

Important well logs were selected to decrease the dimensionality of data by removing the redundant logs. The high dimensionality of logs data increases computational cost and time during the pattern recognition of lithofacies. Several attribute selection paradigms are available in the literature such as a forest of tree-based attributes selection, univariate feature selection, Relief algorithm, etc. Overlapping lithofacies can only be identified by recognizing their discriminatory information contained in the attributes or logs. Discriminatory information plays a decisive role, especially in classification tasks. Further, the Relief algorithm was applied for the attributes selection due to its capability of identifying discriminatory information. Relief algorithm recognizes conditional dependencies and correlations among the attributes or predictor variables and lithofacies (Jia et al. 2013; Farshad and Sadeh 2014; Urbanowicz et al. 2017). Relief algorithm assigns weights and ranks to every predictor variable depending upon their relevance for the pattern recognition of lithofacies. Figure 6 shows input well logs with predictor importance weights plotted on the y-axis and ranks on the x-axis. The well logs having negative weights were removed as they did not add any contribution in the pattern recognition process. NPOR, GR, DPOR, SP, MI, MN, SPOR, DT and RILD were important well logs contributing to the identification and recognition of mudstone lithofacies as shown in Fig. 6.

Fig. 6
figure 6

Available well logs arranged according to their predictor important weights assigned by Relief algorithm for pattern recognition of lithofacies

Data partition

The processed input data were further divided into training sets and testing sets using a cross-validation technique. There are three cross-validation techniques, namely K-fold, leave-one-out and hold out, that are popular in the machine learning domain for the generation of training and testing datasets. K-fold cross-validation technique was utilized in this research work for splitting the processed input data into training and testing subsets (K = 10). Tenfold cross-validation (10-FCV) technique has been reported to have minimum variance error as compared to other cross-validation techniques (Kohavi 1995). Cross-validation helps to minimize the chances of overfitting and underfitting of models (Xie et al. 2018). The input well logs data were randomly divided into K subsets during K-FCV. (K − 1) subsets were used for training the intelligent models and Kth for testing it. This was repeated in iterations until each subset gets selected as a testing set. The final accuracy of every machine learning model was decided by averaging the accuracies obtained in the iterations.

Optimization of model parameters

The optimum value of every model parameter is essential to be determined during the training phase so that models can be generalized for unseen data samples. Grid search algorithm was utilized for tuning model parameters to achieve maximum classification accuracy on unseen test data samples. Models with optimally tuned parameters were saved, to classify unseen test data samples, and were also evaluated for their generalizability and reliability. Grid search algorithm is one of the popular tuning paradigms in the petroleum domain (Tewari et al. 2019a). Machine learning models always have the possibility of getting overfitted or underfitted during pattern recognition. A separate validation score test was conducted to examine the overfitting and underfitting tendency of intelligent models. Validation curve was utilized to shrink the search range for various parameters. It clearly illustrates the overfitting and underfitting regions of the respective classifiers with a specific parameter variation. In an underfitting state of the intelligent model, training and validation scores are normally recorded to be low, whereas overfitting states result in high training and low validation scores. The parameter search range is primarily comprised of upper and lower constraints of a stable region. In stable region, no dramatic variation in training and validation scores takes place. However, the model still needs an optimization algorithm that explores within the stable search range to find the best possible value of the model parameters. The search range and optimum values for various model parameters are depicted in Table 3. Figures 7 and 8 show the validation curves of GB and RF classifiers for four important parameters, namely Estimators, Min_samples_split, Max_depth and Min_samples_leaf. Figure 9a, b shows the validation curves of SVM for regularization constant (C) and gamma (ϒ) versus accuracy score (Fig. 9).

Table 3 The search range and optimized values of model parameters obtained through grid search algorithm on input well logs data
Fig. 7
figure 7

Validation curves of gradient boosting classifier to identify stable search range for four primary model variables. a Number of estimators, b learning rate, c minimum samples required at a leaf node and (d) minimum samples required for splitting the internal node

Fig. 8
figure 8

Validation curves for RF classifier to identify stable search range for four primary model parameters. a Number of estimators, b maximum depth of tree, c minimum samples required for splitting the internal node. d Minimum samples required at the leaf node

Fig. 9
figure 9

Validation curves for SVM classifier to identify stable search range for two primary model parameters. a Penalty cost parameters for misclassified error samples and b kernel coefficient of RBF

The optimal settings of MLP parameters were determined through several computational trials to obtain the maximum possible classification accuracy. The speed of convergence and training loss function of MLP decides its classification performance during training and testing phases. Figure 10 compares several learning strategies available for training of MLP classifier using training loss versus iterations plots. The initial learning rate of MLP was set at 0.001 for lithofacies prediction. MLP utilized Adam solver for weight optimization because of its fast convergence and low training loss for large data as depicted in Fig. 10. MLP updates its parameters iteratively during training operation using the fractional derivatives of the loss function. Cross-entropy loss function was utilized to calculate the probability for data samples belonging to a particular class or lithofacies. Figure 10 shows the comparative graph for diverse training strategies of MLP.

Fig. 10
figure 10

Comparison among diverse learning strategies for MLP classifier for prediction of lithofacies

Performance evaluation

The performance of optimally tuned intelligent models was evaluated with testing data using five statistical performance indicators, namely recall, precision, f1-score, accuracy and Matthew correlation coefficients (MCC). The accuracy parameter was used to evaluate the classification performance of each classifier for the recognition of lithofacies. This parameter is recommended only for balance data conditions and becomes unreliable with uneven data distribution. Precision and recall also act as performance metrics for classifiers. Every classifier should maximize the values of precision and recall for good classification results. F1-score investigates the accuracy of precision and recall value and mostly used in information retrieval domains. However, in the case of data with high imbalance conditions, these performance indicators may give misleading results. Therefore, the test performance of each classification model is also evaluated using the MCC parameter which is unaffected by data imbalance issues. The performance indicators used for the evaluation of machine learning models are given below.

$${\text{Precision = }}\frac{{{\text{True}}\,{\text{Positive}}\, ( {\text{TP)}}}}{{{\text{True}}\,{\text{Positive}}\, ( {\text{TP)}} + {\text{False}}\,{\text{Positive}}\, ( {\text{FP)}}}}$$
(2)
$${\text{Recall = }}\frac{{{\text{True}}\;{\text{Positive}}\, ( {\text{TP)}}}}{{{\text{True}}\,{\text{Positive}}\,({\text{TP}}) + {\text{False}}\,{\text{Negative}}\, ( {\text{FN)}}}}$$
(3)
$$f 1 {\text{-score}} = 2 \times \frac{{{\text{Precision}} \times {\text{Recall}}}}{{{\text{Precision}} + {\text{Recall}}}}$$
(4)
$${\text{Accuracy}} = \frac{{{\text{Correctly}}\;{\text{classified}}\;{\text{data}}\;{\text{samples}}}}{{{\text{Total}}\;{\text{number}}\;{\text{of}}\;{\text{data}}\;{\text{samples}}}}$$
(5)
$${\text{MCC}} = \frac{{{\text{SC}} \times {\text{TS}} - \sum\nolimits_{k}^{K} {{\text{PC}}_{k} \times {\text{TC}}_{k} } }}{{\sqrt {\left( {{\text{TS}}^{2} - \sum\nolimits_{k}^{K} {{\text{PC}}_{k}^{2} } } \right) \times \left( {{\text{TS}}^{2} - \sum\nolimits_{k}^{K} {{\text{TC}}_{k}^{2} } } \right)} }}$$
(6)

where TP is a number of correctly classified data samples of target lithofacies, FP is a number of correctly classified data samples other than target lithofacies, and FN is the number of incorrectly recognized samples classified as target lithofacies. In MCC, TCk is the number of times prediction of k class truly happened, SC is correctly classified data samples, TS is the total number of data samples, and PCk is the number of times k class predicted.

Results and discussion

This section discusses the experimental results obtained during the recognition of nine mudstone lithofacies belonging to Kansas oil and gas fields. The performance of stacking and voting ensembles was compared with four popular classifiers, namely GB (Friedman 2001), RF (Ho 1995), SVM (Cortes and Vapnik 1995) and MLP (Windeatt 2006). Stacking and voting are two HEMs that were implemented to predict the complex lithofacies. Figure 4 depicts a generalized conceptual workflow for HEMs to predict lithofacies of the formations. The performance of HEMs was tested by two separate data-driven experiments for the prediction of lithofacies. In the first experiment, 10-FCV was performed to split the input data samples into training and testing subsets so that generalized prediction outcomes can be obtained. The performance of each classifier has been reported in the form of precision, recall and f1-score for individual lithofacies. Tables 4, 5 and 6 show precision, recall and f1-score acquired by HEMs and base classifiers for each lithofacies during 10-FCV. Overall, the classification performance of stacking has been found higher than all the other classifiers considered in this study. Voting ensemble has secured second place in terms of overall classification performance as shown in Tables 4, 5 and 6. GB and RF classifiers have given similar performance scores for the identification of mudstone lithofacies as shown in Table 5. SVM classifier has also maintained good classification performance during 10-FCV for all the lithofacies. MLP becomes the worst performing classifier in terms of evaluation metrics, viz. average precision, average recall and average f1-score, as shown in Table 6. It is also found that voting, GB, RF and MLP have fluctuations in their performances for smaller classes, namely SS and WS. However, stacking and SVM classifiers are successful in maintaining their performances even for smaller classes as shown in Tables 4 and 6. Smaller classes have contributed a lesser number of data samples during training and testing of machine learning models. These classes also represent facies having thin layers that are difficult to identify using conventional well logs interpretation techniques. WS and SS facies are intentionally included with lesser data samples to magnify data imbalance conditions that make classification harder even for strong classifiers such as GB, RF, voting, etc. Voting and stacking ensembles have utilized the same base classifiers for the classification of facies; however, stacking performed better than voting due to the presence of meta-classifier for combining the outcomes of base classifiers.

Table 4 The performance of HEMs after tenfold cross-validation for lithofacies classification
Table 5 The performance of GB classifier and RF ensembles after tenfold cross-validation for lithofacies classification
Table 6 The performance of SVM and MLP classifiers after tenfold cross-validation for lithofacies classification

In the second experiment, a separate test was also performed with randomly selected training and testing data samples without 10-FCV. Table 7 depicts the overall performance of every classifier utilized in this study with processed input data split into (80%) training subset and (20%) testing subset. The testing accuracy for individual lithofacies is depicted diagonally in confusion matrices. Figure 6 illustrates the confusion matrices of HEMs and their base classifiers generated during the testing phase. Training and testing classification accuracies of HEMs are found higher than all other machine learning models utilized in this study as shown in Table 7. Naturally, subsurface layers exist inside the formations with uneven thickness and patterns. Therefore, uneven data distribution has been considered to represent real-field conditions. This also provides us an opportunity to understand worst to best possible performance of machine learning classifiers for individual layers during imbalance data conditions. The uneven data distribution is in particular chosen for this study to understand the effect of data imbalance conditions. Facies having lesser data points such as WS, SS, etc., are designed for magnifying data imbalance effects. Stacking ensemble has shown great potential to extract lithofacies information from well logs data even for smaller classes due to the presence of meta-classifier in its architecture. Stacking ensemble has scored 83% accuracy for WS and 94% for SS which are challenging smaller lithofacies. This research work is specially designed to evaluate worst- to best-case scenarios for lithofacies modeling. Layerwise classification accuracy of HEMs along with its base classifiers can be summarized as follows: (a) stacking (67.9–95.8%), (b) voting (58.3–94.1%), (c) GB (58.3–94.1%), (d) RF (41.7–94.6%), (e) SVM (58.3–94.1%) and (f) MLP (0.0–88.7%). In the case of data with high imbalance conditions, performance indicators (viz. accuracy, precision, recall, and f1-score) may give misleading results. Therefore, the testing performance of each classification model is also evaluated using the MCC parameter which is unaffected by data imbalance issues as shown in Table 7. It is found that MCC scores of applied models also justify their performance as shown in Table 7. DP has emerged as one of the most challenging subsurface rock layers during the testing phase. In this study, all the classifiers have identified data samples related to DP mostly as CM. It may be possible that the presence of calcareous mud inside DP has confused base classifiers with CM. This uncertainty may be removed by increasing the number of training data samples that will help in learning discriminating features between similar layers (Fig. 11).

Table 7 Classification accuracy of six machine learning models depicted on training (80%) and testing (20%) datasets for lithofacies classification
Fig. 11
figure 11figure 11

Confusion matrices for different classifiers using 20% of input data as a testing dataset to predict different lithofacies

Conclusions

A rigorous facieswise comparison has been made between stacking and voting ensembles for the detection and identification of lithofacies. Stacking has shown nearly 4% and 2% improvement in test accuracy as compared to SVM and RF. Four popular machine learning algorithms have been combined in HEMs as base classifiers to provide more accurate and generalized results. In this study, HEMs have combined MLP, SVM, GB and RF classifiers to achieve better classification accuracy than their individual performances. The individual performance of the abovementioned classifiers has been evaluated using Kansas oil and field data with proper parameter optimization in their stable search ranges. Validation curve and grid search algorithm have been properly utilized for the model parameters tuning to achieve maximum classification accuracy. The research work carried out in this paper has led to the following conclusions.

  • The performance of HEMs depends upon the selection of efficient base classifiers for the quantitative lithofacies modeling.

  • Validation curve has been found as an efficient measure for identifying stable search range for machine learning parameters.

  • Stacking ensemble has shown great potential to extract lithofacies information from well logs data.

  • The training and testing classification accuracies of HEMs have been found highest among the other classifiers used in this study.

  • DP layer is found to be the most challenging facies among all the nine target lithofacies. Stacking ensemble has given the highest individual identification accuracy for all the layers of lithofacies.

  • Prediction accuracy of individual facies ranges from 67.9 to 95.8% (worst to best possible testing accuracy), and maximum overall accuracy is (training = 92.78% and testing = 88.32%) obtained for stacking ensemble.

In this study, HEMs have shown its potential for quantitative lithofacies modeling and have outperformed the other classifiers. A combination of diverse base classifiers will lead to higher accuracy and better model generalization as established from the results obtained in this study. The analysis of results reveals that HEMs are practical and more accurate models, with a significant improvement in classification accuracy for lithofacies identification, as compared to the individual base classifiers.