A comparative study of heterogeneous ensemble methods for the identification of geological lithofacies

Mudstone reservoirs demand accurate information about subsurface lithofacies for field development and production. Normally, quantitative lithofacies modeling is performed using well logs data to identify subsurface lithofacies. Well logs data, recorded from these unconventional mudstone formations, are complex in nature. Therefore, identification of lithofacies, using conventional interpretation techniques, is a challenging task. Several data-driven machine learning models have been proposed in the literature to recognize mudstone lithofacies. Recently, heterogeneous ensemble methods (HEMs) have emerged as robust, more reliable and accurate intelligent techniques for solving pattern recognition problems. In this paper, two HEMs, namely voting and stacking, ensembles have been applied for the quantitative modeling of mudstone lithofacies using Kansas oil-field data. The prediction performance of HEMs is also compared with four state-of-the-art classifiers, namely support vector machine, multilayer perceptron, gradient boosting, and random forest. Moreover, the contribution of each well logs on the prediction performance of classifiers has been analyzed using the Relief algorithm. Further, validation curve and grid search techniques have also been applied to obtain valid search ranges and optimum values for HEM parameters. The comparison of the test results confirms the superiority of stacking ensemble over all the above-mentioned paradigms applied in the paper for lithofacies modeling. This research work is specially designed to evaluate worst- to best-case scenarios in lithofacies modeling. Prediction accuracy of individual facies has also been determined, and maximum overall prediction accuracy is obtained using stacking ensemble.


Introduction
Mudstones are widely occurring siliciclastic sedimentary rocks that behave as a source, cap and reservoir rock for hydrocarbon systems (Aplin and Macquaker 2011). It contains hidden organic-rich sweet spots and shale gas reservoirs which are favorable for petroleum production. Mudstone reservoirs are unconventional complex geological systems and provide challenges to conventional lithofacies interpretation techniques. Extraction of hydrocarbon from these unconventional resources requires accurate information of formation lithofacies, its association with petrophysical properties of reservoir rock and their spatial distribution (Spain et al. 2015). Conventionally, qualitative analysis is performed to recognize subsurface mudstone facies using core analysis, geomechanical spectroscopy logs, Rock-Eval pyrolysis, etc. (Bhattacharya et al. 2016). The conventional methodology is found to be inconvenient, tiresome, expensive in nature and requires high domain expertise.
Recognition of subsurface lithofacies is much researched topic and still a thought-provoking problem due to the uncertainty associated with reservoir measurements (Chaki et al. 2015;Bhattacharya et al. 2016). Quantitative modeling of lithofacies is essential to assess the potential of unconventional hydrocarbon reservoirs lying in mudstone formations. It also helps to understand the diagenetic and depositional burial history of these reservoirs (Aplin and Macquaker 2011). Quantitative information about underlying mudstone lithofacies can be taken out from conventional well logs which record the physical properties of rocks along with the reservoir depth. Several logging techniques such as wireline, logging-while-drilling, measurement-while-logging, etc., are utilized to generate wide varieties of well logs for measuring petrophysical properties of reservoir rock (Anifowose et al. 2015(Anifowose et al. , 2017. Various researchers have reported that logging data are found to be nonlinear, high dimensional, complex and noisy in nature due to the heterogeneity and spatial distribution of reservoir properties (Kocberber and Collins 1990;Chaki et al. 2015;Bhattacharya et al. 2016;Tewari and Dwivedi 2018a). Therefore, manual identification of geological lithofacies from sensory well logs is an impractical and tedious job even for expert field engineers. Thus, advance predictive machine learning models are suggested for extracting the lithofacies information from well logs.
Several machine learning models have been proposed to extract the facies information of conventional reservoir using well logs data. However, only few research works are available for unconventional mudstone reservoirs. Machine learning paradigms utilized for quantitative lithofacies modeling of mudstone lithology are limited to unsupervised and supervised classifiers (Qi and Carr 2006;Ma 2011;Wang and Carr 2012;Anifowose et al. 2015;Bhattacharya et al. 2016). Aplin and Macquaker (2011) published a comprehensive review of mudstone lithology. They also studied the roles played by mudstone, viz. as an organic-rich source rock, cap rock and reservoir rock, from generation to storage of hydrocarbon. Li and Schieber (2017) did a detailed study about mudstone facies of the Henry Mountain Region of Utah. Qi and Carr (2006) employed an artificial neural network (ANN) model for the identification of carbonate lithofacies existing in Southwest Kansas from well logs data. Wang and Carr (2012) applied discriminant analysis, ANN, support vector machine (SVM) and fuzzy logic techniques for lithofacies modeling of the Appalachian basin at USA. They utilized core and seismic data along with well logs to develop a 3-D model of shale facies at the regional scale. Avanzini et al. (2016) implemented unsupervised cluster analysis for lithofacies classification to identify hidden productive sweet spots in the Barnett Shale formation. Bhattacharya et al. (2016) compared the performance of unsupervised and supervised machine learning models for mudstone facies present in Mahantango-Marcellus and Bakken Shale, USA. Bhattacharya et al. (2019) applied SVM for the identification of shale lithofacies of Bakken formation existing in North Dakota, USA. Table 1 contains a summary of important published research works related to mudstone lithofacies classification utilizing machine learning techniques.
All the above-mentioned research works are based on single supervised or unsupervised classifiers. However, it has been proved that the performance of single classifiers can be improved using hybrid computational models such as multiple classifier system, a committee of machines, composite systems, etc. (Dietterich 2000;Skurichina and Duin 2001). Multiple classifier system, like ensemble methods, can excavate more valuable information from raw sensory data. It combines the decisions of several classifiers together for classification and regression tasks. The ensemble approach can be categorized into two types: (a) homogeneous ensemble methods (HoEMs) such as bagging, random forest, rotational forest, random subspace, etc., and (b) heterogeneous ensemble methods (HEMs) such as voting, stacking, etc. HoEMs in feature space combine several hypotheses generated by the identical type of supervised classifiers which are utilized as base classifiers (e.g., a cluster of hundreds of SVMs). In the case of HEMs, different classifiers are utilized to generate and combine diverse hypotheses to achieve maximum possible prediction accuracy for existing feature space. It has been proved that heterogeneity in base classifiers helps to develop more reliable, robust and generalized classifier models (Sesmero et al. 2015). Ensemble methods have the capability to handle complex, nonlinear, multidimensional and imbalanced petroleum data (Tewari and Dwivedi 2018b;Anifowose et al. 2015;Dietterich 2000;Skurichina and Duin 2001). However, ensemble methods are not properly explored and applied in the petroleum research domain. Only limited applications of the ensemble approach can be found in the petroleum domain as explained briefly in the next section of the paper.
The performance of HEMs, namely voting and stacked generalization ensembles, along with four other popular classifiers, has been studied for the recognition of mudstone lithofacies in the paper. This research work is primarily focused on the construction of diverse base classifiers contained in HEMs that can actually outperform single classifiers and HoEMs for lithofacies identification. A comparative study has been performed among HEMs and also with four contemporary classifiers for the prediction of lithofacies. The suitability of HEMs has also been evaluated for the identification of lithofacies. The complications associated with the development of HEMs have been discussed in this research work. Four popular supervised classifiers, viz. multilayer perceptron (MLP), SVM, gradient boosting (GB) and random forest (RF), have been combined in HEMs as base classifiers to provide more accurate and generalized results.
The performance of these classifiers has been evaluated using Kansas oil-field data with proper parameters optimization in their stable search ranges. Validation curve and grid search algorithm have been applied for parameters tuning to achieve maximum classification accuracy. The contribution of each input well logs, for the pattern recognition of mudstone lithofacies, has been studied using Relief algorithm. Relief algorithm is utilized for the attribute selections owing to its capability of identifying discriminatory information. Overall, this research work assesses the pattern recognition competency of HEMs for complex mudstone lithofacies.

Heterogeneous ensemble methods
Heterogeneous ensembles are lesser-known intelligent algorithms in the petroleum domain. They generate several different hypotheses in feature space using diverse base classifiers and combine them to achieve maximum possible accurate results. Sesmero et al. (2015) proved that diverse classification hypotheses in feature space are essential for the development of reliable, robust and generalized ensemble classifiers. Thus, HEMs are investigated in this paper for solving multiclass lithofacies recognition problem in the quest for higher classification accuracy. Extraction of valuable lithofacies information from well logs data is quite a challenging task even for intelligent HoEMs. Spatial distribution and heterogeneous behavior of the hydrocarbon reservoir properties contribute to complexity, nonlinearly and uncertainty in all types of sensor-based measurements (Chaki et al. 2015;Bhattacharya et al. 2016). Also, no standard tools or techniques are available in the present scenario that can measure the reservoir heterogeneity and its influence on other reservoir properties, well logs and drilling data, etc. Therefore, HEMs are employed for recognition of subsurface lithofacies in this paper.

Related works
Ensemble methods are less applied methodology in the oil and gas industry. In the petroleum domain, HoEMs are mostly implemented for drilling parameter estimation and reservoir characterization. Only a few applications of HEMs are found in the existing literature. Santos et al. (2003) utilized a neural net ensemble for the recognition of underlying lithofacies. Gifford and Agad (2010) applied collaborative multiagent classification techniques for the identification of lithofacies. Masoudi et al. (2012) integrated the outputs of Bayesian and fuzzy classifiers to recognize productive zones in Sarvak Formation. Anifowose et al. (2015) utilized the stacked generalization ensemble for enhancing the prediction capability of supervised learners for reservoir characterization. Anifowose et al. (2017) wrote a review on the applications of ensemble methods and suggested that ensemble methods are suitable for solving problems of the oil and gas industry. Bestagini et al. (2017) implemented the random forest ensemble for lithofacies classification of Kansas oilfield data. Xie et al. (2018) published a comparative study on the performance of HoEMs for the recognition of lithofacies. Tewari and Dwivedi (2018b) compared five HoEMs for the identification of lithofacies. Bhattacharya et al. (2019) applied ANN and random forest algorithms to predict daily gas production from the unconventional reservoir.  applied HoEMs for the prediction of reservoir recovery factors. HEMs have given higher classification performance as compared to supervised classifiers as well as HoEMs in several engineering fields such as remote sensing (Healey et al. 2018), prostate cancer detection , load forecasting (Ribeiro et al. 2019), wind speed forecasting (Liu and Chen 2019), etc. Therefore, HEMs are investigated in this paper to identify subsurface lithofacies. Two types of HEMs utilized for the identification of lithofacies are briefly explained below.

Stacked generalization ensemble
Stacked generalization ensemble is popularly known as stacking (Wolpert 1992). It combines decisions of different base classifiers in a single-ensemble architecture. Different base classifiers search the feature space with their diverse perspectives to find the maximum possible accurate hypotheses for a given classification task (Anifowose et al. 2015). The classification outcomes of base classifiers are combined together by a meta-classifier to provide final classification result. The combination of base classifiers' outcomes is decided by the meta-classifier algorithm. Figure 1 shows Stacking ensemble can also be created by merging the decision of similar base classifiers having different parametric values. The selection of base and meta-classifier combination is always a matter of concern during the design of stacking ensemble architecture. It is also difficult to design the most suitable configuration of classifiers in large feature space. Wolpert (1992) proved that the stacking ensemble is good in reducing the generalization error by decreasing bias and variance error associated with data. Initially, input data are split into training and testing datasets. Further, the training dataset is again split into K identical subsets similar to K-fold cross-validation technique. Base classifiers are trained on (K − 1) subsets, while the Kth subset is retained as a validation set. After training with (K − 1) subsets, base classifiers are individually tested with the Kth validation subset and also with the testing data. The outcomes of each base classifier with validation and test datasets will act as new training and testing data for meta-classifier. Moreover, the meta-classifier will be trained with the prediction outcomes of the validation set and the actual values of the target variable.

Voting ensemble
Voting ensemble combines the decisions of different base classifiers for the given classification or estimation task. It provides flexibility in combination strategies so that the maximum possible classification accuracy can be achieved. It does not utilize any algorithm for the combination of predictions from base classifiers as in the stacking ensemble. Two combination schemes can be implemented for merging the decisions of several classifiers, namely majority vote rule (hard voting) and average predicted confidence probabilities (soft voting) to predict the class labels of test samples (Kittler et al. 1998). In place of meta-classifier, the abovementioned combining strategies are utilized to combine outcomes of diverse supervised classifiers. In hard voting, class labels of test samples are decided by majority voting rule. Every base classifier individually assigns a class label to a given test sample during the testing phase. The final classification of the test sample is decided by the maximum number of times a particular class label gets assigned to that test sample. On the other hand, soft voting strategy initially assigns weights to each base classifier. During the testing phase, it generates prediction probabilities for every test sample belonging to various classes. Later, these probabilities are multiplied with the weights assigned to every class labels and then it is averaged. Test samples are finally classified into that class which achieves the highest average confidence probability. Mathematically, soft voting technique classifies data samples as argmax (argument of maxima) of the sum of assigned probabilities (Kittler et al. 1998;Kuncheva 2004). Figure 2 shows theoretical architecture of voting ensemble utilized for quantitative lithofacies modeling in this study.

Experimental evaluation
In this paper, two HEMs were utilized for the recognition of quantitative lithofacies modeling. The primary goal of this research work was to achieve higher classification performance using the HEMs approach. HEMs were trained and tested on real-field well logs data with other popular classifiers, namely RF, MLP, SVM and GB. All the ensemble methods were implemented on the Python Scikit-learn

A brief description of Kansas field
Kansas region is mainly composed of sedimentary rocks with a maximum width of 2850 m. A large number of unconformities occur in the Kansas region with the sedimentary strata having 15-50% of the post-Precambrian period (Merriam 1963). Northeastern Kansas is enclosed by Pleistocene glacial deposits. A thick layer of Mesozoic rock is present in the western Kansas region. Mesozoic rock layers are mainly made up of limestones, chalks, sandstone, marine shales and nonmarine shale contents. Panoma field and Hugoton field, existing in western Kansas, comprise of large natural gasproducing reservoirs. Pennsylvanian and Permian systems are broadest structures of rock containing bedded rock salts in several layers. The pre-Pennsylvanian system existing in Kansas contains dolomites, marine, limestones layered alternatively with sandstones and shales. The Precambrian basement composed mainly of quartzite, granite and schist. Permian strata contain the carbonate reservoirs that produce the majority of natural gas. In 1992, Mississippian strata produced 43% of cumulative hydrocarbon production of Kansas field out of which 19% contributed to cumulative oil production (Newell 1987a). The numerous unconformities available in the Kansas region help trapping and migration of petroleum. Basal Pennsylvanian in Kansas has a huge deposition of hydrocarbon along with its length. A detailed description of the petroleum geology of the Kansas region can be found in Newell (1987a, b), Merriam (1963), Adler et al. (1971) and Jewett and Merriam (1959). Manual interpretation of wells logs data of such a huge hydrocarbonproducing region is time-consuming and costly. Therefore, automatic detection and identification of subsurface lithofacies using machine learning algorithms is highly desirable to minimize cost and time.

Data description
The well logs data used for the development of ensemble methods were obtained from the Kansas Geological Survey (KGS) Web site (Kansas 2009) which is a very large available well logs data repository. The downloaded digital well logs, "Las" files, contain 13,000 data samples out of which 3425 samples are extracted belonging to nine different lithofacies, viz. dolomitic wackstone (DW) (1015), clay (CL) (320), dolomitic mudstone (DM) (240), dolomitic sandstone (DS) (455), siltstone (SS) (85), dolomitic packstone (DP) (265), carbonate mudstone (CM) (520), packstone (PS) (465) and wackstone (WS) (60). The above-said lithofacies are acknowledged as class labels for the classification of well logs data into their respective lithofacies. The downloaded "Las" files belong to Paradise A, Deforest and Strahm wells existing in the Kansas field. These files also contain information about mineral contents and lithofacies prevailing in these wells. "Las File Viewer" app has been downloaded from KGS Web site to visualize geological settings and facies of these wells. Table 2 contains the range of different well logs acting as input predictor variables. These logs and recorded information about different reservoir properties are used in the pattern recognition of lithofacies. The input well logs data were downloaded from the Kansas Geological Survey database (KGS) available on the KGS  Figure 4 shows generalized workflow for the HEMs to recognize the subsurface lithofacies. Diaz et al. (2018) suggested that the preprocessing of petroleum data, such as resampling, normalization, noise filtering, attribute selection, etc., helps to improve the classification or estimation accuracy of intelligent algorithms. Initially, resampling of well logs data was done to eliminate samples containing null, garbage and missing values. After resampling, the input data were normalized to reduce the impact of larger values on the smaller values of predictor variables. Input data can be normalized as given below.

Data preprocessing
where X Max and X Min are maximum and minimum values of the respective predictor variables. Equation (1)  (1)

Noise filtering
Noise filtering of well logs data was done to minimize the effects of noise during the pattern recognition of lithofacies. Tewari and Dwivedi (2019) studied the influence of noise levels on the performance of supervised classifiers and reported its damaging effects on the classifier's performance. There are several denoising techniques that are available in the petroleum and geophysics literature such as low-pass filter, high-pass filter, Savitzky-Golay, wavelet denoising, moving average, Gaussian, etc. Savitzky-Golay (SG) smoothing filter has been found suitable and widely utilized noise filtering technique for geophysical data (Baba et al. 2014). This is a digital filter used for data smoothening which fits a polynomial of degree n by the linear least square method and maintains signal tendency through convolution (Baba et al. 2014). High peaks of well logs data were considered as noise components which were eliminated using SG filters. Noise contents harmfully affect the pattern recognition ability of intelligent classifiers. Figure 5 shows four important well logs with original and denoised waveform. SG smoothening filter was utilized for removing the noise content of input well logs. The degree of polynomial fitted in well logs for smoothening was found to be varying from 5 to 13. The higher components or spikes in well logs were treated as noisy contents and eliminated through smoothening the waveforms of well logs. Fifteen input well logs data were smoothened using the SG filter and passed to Relief algorithm for important well logs selection.

Attribute selection
Important well logs were selected to decrease the dimensionality of data by removing the redundant logs. The high dimensionality of logs data increases computational cost and Overlapping lithofacies can only be identified by recognizing their discriminatory information contained in the attributes or logs. Discriminatory information plays a decisive role, especially in classification tasks. Further, the Relief algorithm was applied for the attributes selection due to its capability of identifying discriminatory information. Relief algorithm recognizes conditional dependencies and correlations among the attributes or predictor variables and lithofacies (Jia et al. 2013;Farshad and Sadeh 2014;Urbanowicz et al. 2017). Relief algorithm assigns weights and ranks to every predictor variable depending upon their relevance for the pattern recognition of lithofacies. Figure 6 shows input well logs with predictor importance weights plotted on the y-axis and ranks on the x-axis. The well logs having negative weights were removed as they did not add any contribution in the pattern recognition process. NPOR, GR, DPOR, SP, MI, MN, SPOR, DT and RILD were important well logs contributing to the identification and recognition of mudstone lithofacies as shown in Fig. 6.

Data partition
The processed input data were further divided into training sets and testing sets using a cross-validation technique. There are three cross-validation techniques, namely K-fold, leave-one-out and hold out, that are popular in the machine learning domain for the generation of training and testing datasets. K-fold cross-validation technique was utilized in this research work for splitting the processed input data into training and testing subsets (K = 10). Tenfold cross-validation (10-FCV) technique has been reported to have minimum variance error as compared to other cross-validation techniques (Kohavi 1995). Cross-validation helps to minimize the chances of overfitting and underfitting of models (Xie et al. 2018). The input well logs data were randomly divided into K subsets during K-FCV. (K − 1) subsets were used for training the intelligent models and Kth for testing it. This was repeated in iterations until each subset gets selected as

Optimization of model parameters
The optimum value of every model parameter is essential to be determined during the training phase so that models can be generalized for unseen data samples. Grid search algorithm was utilized for tuning model parameters to achieve maximum classification accuracy on unseen test data samples. Models with optimally tuned parameters were saved, to classify unseen test data samples, and were also evaluated for their generalizability and reliability. Grid search algorithm is one of the popular tuning paradigms in the petroleum domain (Tewari et al. 2019a). Machine learning models always have the possibility of getting overfitted or underfitted during pattern recognition. A separate validation score test was conducted to examine the overfitting and underfitting tendency of intelligent models. Validation curve was utilized to shrink the search range for various parameters. It clearly illustrates the overfitting and underfitting regions of the respective classifiers with a specific parameter variation. In an underfitting state of the intelligent model, training and validation scores are normally recorded to be low, whereas overfitting states result in high training and low validation scores. The parameter search range is primarily comprised of upper and lower constraints of a stable region. In stable region, no dramatic variation in training and validation scores takes place. However, the model still needs an optimization algorithm that explores within the stable search range to find the best possible value of the model parameters.
The search range and optimum values for various model parameters are depicted in Table 3. Figures 7 and 8 show the validation curves of GB and RF classifiers for four important parameters, namely Estimators, Min_samples_split, Max_depth and Min_samples_leaf. Figure 9a, b shows the validation curves of SVM for regularization constant (C) and gamma (ϒ) versus accuracy score (Fig. 9). The optimal settings of MLP parameters were determined through several computational trials to obtain the maximum possible classification accuracy. The speed of convergence and training loss function of MLP decides its classification performance during training and testing phases. Figure 10 compares several learning strategies available for training of MLP classifier using training loss versus iterations plots. The initial learning rate of MLP was set at 0.001 for lithofacies prediction. MLP utilized Adam solver for weight optimization because of its fast convergence and low training loss for large data as depicted in Fig. 10. MLP updates its parameters iteratively during training operation using the fractional derivatives of the loss function. Cross-entropy loss function was utilized to calculate the probability for data samples belonging to a particular class or lithofacies. Figure 10 shows the comparative graph for diverse training strategies of MLP.

Performance evaluation
The performance of optimally tuned intelligent models was evaluated with testing data using five statistical performance indicators, namely recall, precision, f1-score, accuracy and Matthew correlation coefficients (MCC). The accuracy parameter was used to evaluate the classification performance of each classifier for the recognition of lithofacies. This parameter is recommended only for balance data conditions and becomes unreliable with uneven data distribution. Precision and recall also act as performance metrics for classifiers. Every classifier should maximize the values of precision and recall for good classification results. F1-score investigates the accuracy of precision and recall value and mostly used in information retrieval domains. However, in the case of data with high imbalance conditions, these performance indicators may give misleading results. Therefore, the test performance of each classification model is also evaluated using the MCC parameter which is unaffected by data imbalance issues. The performance indicators used for the evaluation of machine learning models are given below.
(2) Precision = True Positive (TP) True Positive (TP) + False Positive (FP) (4) f 1-score = 2 × Precision × Recall Precision + Recall   where TP is a number of correctly classified data samples of target lithofacies, FP is a number of correctly classified data samples other than target lithofacies, and FN is the number of incorrectly recognized samples classified as target lithofacies. In MCC, TC k is the number of times prediction of k class truly happened, SC is correctly classified data samples, TS is the total number of data samples, and PC k is the number of times k class predicted.
(5) Accuracy = Correctly classified data samples Total number of data samples

Results and discussion
This section discusses the experimental results obtained during the recognition of nine mudstone lithofacies belonging to Kansas oil and gas fields. The performance of stacking and voting ensembles was compared with four popular classifiers, namely GB (Friedman 2001), RF (Ho 1995), SVM (Cortes and Vapnik 1995) and MLP (Windeatt 2006). Stacking and voting are two HEMs that were implemented to predict the complex lithofacies. Figure 4 depicts a generalized conceptual workflow for HEMs to predict lithofacies of the formations. The performance of HEMs was tested by two separate data-driven experiments for the prediction of lithofacies. In the first experiment, 10-FCV was performed to split the input data samples into training and testing subsets so that generalized prediction outcomes can be obtained. The performance of each classifier has been reported in the form of precision, recall and f1-score for individual lithofacies. Tables 4, 5 and 6 show precision, recall and f1-score acquired by HEMs and base classifiers for each lithofacies during 10-FCV. Overall, the classification performance of stacking has been found higher than all the other classifiers considered in this study. Voting ensemble has secured second place in terms of overall classification performance as shown in Tables 4, 5 and 6. GB and RF classifiers have given similar performance scores for the identification of mudstone lithofacies as shown in Table 5. SVM classifier has also maintained good classification performance during 10-FCV for all the lithofacies. MLP becomes the worst performing classifier in terms of evaluation metrics, viz. average precision, average recall and average f1-score, as shown in Table 6. It is also found that voting, GB, RF and MLP have fluctuations in their performances for smaller classes, namely SS and WS. However, stacking and SVM classifiers are successful in maintaining their performances even for smaller classes as shown in Tables 4 and 6. Smaller classes  have contributed a lesser number of data samples during training and testing of machine learning models. These classes also represent facies having thin layers that are difficult to identify using conventional well logs interpretation techniques. WS and SS facies are intentionally included with lesser data samples to magnify data imbalance conditions that make classification harder even for strong classifiers such as GB, RF, voting, etc. Voting and stacking ensembles have utilized the same base classifiers for the classification of facies; however, stacking performed better than voting due to the presence of meta-classifier for combining the outcomes of base classifiers. In the second experiment, a separate test was also performed with randomly selected training and testing data samples without 10-FCV. Table 7 depicts the overall performance of every classifier utilized in this study with processed input data split into (80%) training subset and (20%) testing subset. The testing accuracy for individual lithofacies is depicted diagonally in confusion matrices. Figure 6 illustrates the confusion matrices of HEMs and their base classifiers generated during the testing phase. Training and testing classification accuracies of HEMs are found higher than all other machine learning models utilized in this study as shown in Table 7. Naturally, subsurface layers exist inside the formations with uneven thickness and patterns. Therefore, uneven data distribution has been considered to represent real-field conditions. This also provides us an opportunity to understand worst to best possible performance of machine learning classifiers for individual layers during imbalance data conditions. The uneven data distribution is in particular chosen for this study to understand the effect of data imbalance conditions. Facies having lesser data points such as WS, SS, etc., are designed for magnifying data imbalance effects. Stacking ensemble has shown great potential to extract lithofacies information from well logs data even for smaller classes due to the presence of metaclassifier in its architecture. Stacking ensemble has scored 83% accuracy for WS and 94% for SS which are challenging smaller lithofacies. This research work is specially designed to evaluate worst-to best-case scenarios for lithofacies modeling. Layerwise classification accuracy of HEMs along with its base classifiers can be summarized as follows: In the case of data with high imbalance conditions, performance indicators (viz. accuracy, precision, recall, and f1-score) may give misleading results. Therefore, the testing performance of each classification model is also evaluated using the MCC parameter which is unaffected by data imbalance issues as shown in Table 7. It is found that MCC scores of applied models also justify their performance as shown in Table 7. DP has emerged as one of the most challenging subsurface rock layers during the testing phase. In this study, all the classifiers have identified data samples related to DP mostly as CM. It may be possible that the presence of calcareous mud inside DP has confused base classifiers with CM. This uncertainty may be removed by increasing the number of training data samples that will help in learning discriminating features between similar layers (Fig. 11).

Conclusions
A rigorous facieswise comparison has been made between stacking and voting ensembles for the detection and identification of lithofacies. Stacking has shown nearly 4% and 2% improvement in test accuracy as compared to SVM and RF. Four popular machine learning algorithms have been combined in HEMs as base classifiers to provide more accurate and generalized results. In this study, HEMs have combined MLP, SVM, GB and RF classifiers to achieve better classification accuracy than their individual performances. The individual performance of the abovementioned classifiers has been evaluated using Kansas oil and field data with proper parameter optimization in their stable search ranges. Validation curve and grid search algorithm have been properly utilized for the model parameters tuning to achieve maximum classification accuracy. The research work carried out in this paper has led to the following conclusions.
• The performance of HEMs depends upon the selection of efficient base classifiers for the quantitative lithofacies modeling. • Validation curve has been found as an efficient measure for identifying stable search range for machine learning parameters. • Stacking ensemble has shown great potential to extract lithofacies information from well logs data. • The training and testing classification accuracies of HEMs have been found highest among the other classifiers used in this study. • DP layer is found to be the most challenging facies among all the nine target lithofacies. Stacking ensemble has given the highest individual identification accuracy for all the layers of lithofacies. • Prediction accuracy of individual facies ranges from 67.9 to 95.8% (worst to best possible testing accuracy), and maximum overall accuracy is (training = 92.78% and testing = 88.32%) obtained for stacking ensemble.
In this study, HEMs have shown its potential for quantitative lithofacies modeling and have outperformed the other classifiers. A combination of diverse base classifiers will lead to higher accuracy and better model generalization as established from the results obtained in this study. The analysis of results reveals that HEMs are practical and more accurate models, with a significant improvement in classification accuracy for lithofacies identification, as compared to the individual base classifiers.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/. Kindly refer user guide of Scikit-learn Python toolbox for detailed explanation of these parameters S. no. Classifiers Model parameters Description of models' parameters 1. MLP Activation function These are nonlinear differentiable mathematical functions that help MLP to correlate complex input variables with response or target variables. It is also essential for backpropagation optimization of weights and error reduction, e.g., "relu," "logistic," etc. Solver It provides weights optimization strategy to MLP, e.g., "lbfgs," "sgd," etc. alpha MLP is penalty parameter for misclassification. Learning_rate_init It decides the step size of weights updating and provides initial value to learning rate Learning _rate It decides rate of weights adjustment during error minimization Max_iteration It provides maximum iterations for solver for optimization of weights 2.
SVM C It is a regularization or penalty parameter to resolve ill-posed optimization complications and improve generalizability of training SVM model Gamma Coefficient of RBF kernel Kernel It is a special type of mathematical functions that helps to handle nonlinearity and high dimensionality of input data 3.
RF Stacking Base classifiers Diverse supervised classifiers are selected as estimators and combined in single stacking ensemble architecture Meta-classifier It is a supervised classifier which processes and combines the outcomes of base classifier as a second computational layer in stacking ensemble architecture