Adaptive boosting of random forest algorithm for automatic petrophysical interpretation of well logs

The power of Machine Learning is demonstrated for automatic interpretation of well logs and determining reservoir properties for volume of shale, porosity, and water saturation respectively for tight clastic sequences. Random Forest algorithms are reputed for their efficiency as they belong to a class of algorithms called ensemble methods, which are traditionally seen as weak learners, but can be transformed into strong performers and they promise to deliver highly accurate results. The study area is located offshore Australia in the Poseidon and Crown fields situated in the Browse Basin, which are gas fields in tight complex clastic reservoirs. There are 5 wells used in this study with one well manually interpreted which is subsequently used in developing a machine learning model which predicts the output for the other 4 wells. The basic open hole logs namely Natural gamma ray, Resistivity, Neutron Porosity, Bulk Density, P-wave and S-wave sonic travel-time, are used in interpretation. One of the wells has a missing S-wave travel-time log which was also predicted by developing a Random Forest Machine Learning model. The results indicate a very robust improvement in performance when Random Forest algorithm was combined with Adaptive Boosting when interpreting the well logs. The training accuracy using Random Forest alone was 98.21%, but testing was 77.62% which suggested over-fitting by the Random Forest model. The Adaptive Boosting of the Random Forest algorithm resulted in the overall training accuracy of 99.40% and an overall testing accuracy of 97.03%, indicating a drastic improvement in performance. S-wave travel-time log was predicted by preparing a training set consisting of Natural gamma ray, Resistivity, Neutron Porosity, Bulk Density, and P-wave travel-time logs for the 4 wells using Random Forest which gave a training accuracy of 99.79% and a testing accuracy of 98.54%. Machine learning algorithms can be successfully applied for interpreting well log data in complex sedimentary environment and their performance can be drastically improved using Adaptive Boosting.


Introduction
The idea of developing an automated methodology towards well log interpretation has been a research theme for many years as it offers the tremendous advantage of reduced time and computing power. The usage of Fourier and Wavelet Transform in well logs using multi-scale analysis for finding hidden correlations between open-hole logs and lithologies through sequence stratigraphy for various wells has been attempted in the past (Mukherjee et al. 2016;Perez-Muñoz et al. 2013;Srivardhan 2016;Panda et al. 2015). The ideas associated help in identifying lithofacies with well log signatures. Deep learning associated with neural networks have been used in the recent past in identifying lithologies and creating neural network models in order to predict clay volume, effective porosity, water saturation, and permeability (Peyret et al. 2019;Gupta and Soumya 2020). These methods have demonstrated good accuracy and the prediction models developed have been specific to geologic formations. The Deep Learning and Neural Network models though require more computing as compared to Machine Learning models. The application of Machine Learning (ML) algorithms in well logs has demonstrated tremendous utility. They have been used extensively in order to interpret facies (Bestagini et al. 2017;Pratama 2018;Alexsandro et al. 2017), synthetic well log generation (Akinnikawe et al. 2018), Rock Physics modelling , and predicting lithologies ahead of drill bit using LWD logs (Zhong el al. 2019). Well log interpretation has in the past been carried out deterministically (Senosy et al. 2020) using relationships between reservoir properties and well log responses, and comparatively recent methods like interval inversion (Szabó et al. 2022;Dobróka et al. 2016) in which the number of unknown lithological and reservoir properties are determined as an overdetermined problem by developing suitable relationships between well log responses and their causative reservoir properties which are to be determined. In this paper the performance of Random Forest algorithm is tremendously enhanced using Adaptive-Boosting algorithm which gives automatic near perfect interpretation of well logs in tight gas bearing clastic reservoirs.
The Random Forest is a Machine Learning Algorithm and an ensemble learning method which can be used for providing classification and regression solutions. It is a supervised learning algorithm and is inspired by Decision Trees, but generally found to give superior results as Decision Trees generally are sometimes found to overfit during training (Hastie et al. 2008;Piryonesi et al. 2020Piryonesi et al. , 2021. It uses the concept of Bagging or Bootstrap Aggregating when finalizing the ensemble Regressor or Classifier over simple averaging which is done in Decision Trees. Random Forest has been applied successfully in many areas like Insurance (Lin et al. 2017), Image Classification (Akar and Güngör 2012), Land Cover and Ecological studies (Kulkarni and Lowe 2016;Mutanga et al. 2012), Medical Science (Sarica et al. 2017), and Consumer Behaviour (Valecha et al. 2018). Adaboost or short form for Adaptive Boosting is a statistical technique for improving the performance of weak learners and can be applied to other machine learning algorithms for improving the learning rate and avoid over fitting of the data. In regression problems the algorithm tends to fit the dataset through a regressor and then further fits additional instances of the regressor on the same dataset, but using different weights in accordance with the error in the prediction. It is commonly used in classification and regression problems. It has been found to improve the performance of Random Forest algorithm and has also been used with other Machine Learning algorithms to improve their performance and has been demonstrated in banking (Sanjaya et al. 2020), structural engineering (Feng et al. 2020), cybersecurity (Sornsuwit and Jaiyen 2019), plant species identification (Kumar et al. 2019) and many other avenues. In this study the application of Random Forest algorithm is demonstrated to predict reservoir properties of 4 wells using 1 well as a training dataset, and then using Adaptive Boosting to increase the performance of the Random Forest algorithm. The predictions are then compared with manual interpretation of the wells and there is very good accuracy in results. Random Forest is also used to predict missing Shear-Sonic log in one of the wells which also gives a good performance.

Geology and description of study area
The Browse Basin is located in NW Shelf of Australia with water depths reaching up to 4000 m. The Browse Basin was formed during Early Carboniferous-Early Permian as a series of intracratonic half graben systems due to the breakup of Gondawana and formation of the Neo-Tethys Ocean along the northwest Australian margin (Symonds et al. 1994;Struckmeyer et al. 1998). Structural developments during extension established the sub-basin architecture and compartmentalization, which would later control the rate of sedimentation and tectonism up to Miocene . The evolution of the Browse Basin reflects the breakup of the Gondwana supercontinent and the creation of the Westralian Superbasin and also includes other basins in NW Australia including Carnarvon, offshore Canning, and Bonaparte Basins (Stephenson and Cadman 1994;Struckmeyer et al. 1998). During the Late Triassic-Early Jurassic, an inversion event correlated with onset of rifting on the NW Shelf which reactivated Paleozoic faults resulting in partial inversion of the Paleozoic half-grabens and formation of large-scale anticlinal and synclinal features within their hanging walls (AGSO Browse Basin Project Team 1997). Continued extension in Early Jurassic resulted in the collapse of numerous Triassic anticlines , which culminated in Callovian-Oxfordian. Post rift thermal subsidence commenced in Mid-Callovian and the basin transitioned into passive margin. Sediment supply was controlled by eustasy levels, sediment supply from hinterland, small scale growth faults, and reactivation of pre-existing faults. Carbonate deposition increased in Upper Cretaceous with many fluvio-deltaic sediment systems in the Cenozoic (Poidevin et al. 2015).
The wells Kronos-1, Boreas-1, Poseidon-2 are located in the Poseidon discovery, while wells Pharos-1 and Proteus-1 are on the adjacent Crown discovery (Fig. 1). Both the Poseidon and Crown discoveries are located in the Browse Basin. The main reservoir for these discoveries are the Plover formation in Upper-Middle Jurassic which are syn-rift fluvio-deltaic clastic sequences. The gas is sourced from the Middle-Lower Jurassic sequences of the Plover formation which are fluvio-deltaic claystones. The hydrocarbons are sealed by intraformational shales of the Plover formation (Rollet et al. 2018). Volcanic activity was prevalent in nearby areas of the discovery in the Browse basin in the Jurassic and possible deposition of volcanoclastics is reported along with fluvio-deltaic sediments. In this analysis based on the regional geological history, the petrophysical interpretation for the wells has been performed in order to determine the reservoir properties including Vshale, Porosity, and Water Saturation.

Random forest theory
Random Forest algorithm is an ensemble technique for creating regression models using Boostrap Aggregating or Bagging. Consider a training dataset of size m with TR = {TR 1 , TR 2 ,…TR m }, with its corresponding output OP = {OP 1 , OP 2 ,….OP m }, the random forest algorithm aims to find out a loss function f(a) to predict output b, where b є OP.
In the concept of Bagging, at each given iteration i which is repeated for say B times where b = {1,2,3,…B} and i є B, there are m pair of corresponding input and output samples randomly selected from TR and OP respectively and a function f b is created for the regression operation using the samples created which is the best fit decision tree for the given input and output samples. At each iteration input and output samples which are not part of boostraping are considered separately as unseen samples or out-of-bag samples. The operation is repeated B times and the overall regressor model f′ is created as the aggregation of all trees applied on the unseen x′ samples at each iteration and averaging them which can be represented in Eq. 1: The mean accuracy of the model f′ during training for the predicted output OPred = {OPred 1 ,OPred 2 ,…OPred m } is calculated as r 2 (Coefficient of Determination) and determined as in Eq. 2: In the above equation the OPmean is the mean of the output data TR. For testing the model the same r 2 can be determined for the testing samples of the input data. In the present analysis a ratio of 60% to 40% has been considered for dividing the input data for Training and Testing purposes. The r 2 is used to check the overall performance of the algorithm and is a measure of how good the model fits all the data points. It varies between 0 and 100% with 100% having the best possible fit for all the data points. It is also a measure of accuracy of the Machine Learning model.
(1) Fig. 1 The location of the Poseidon and Crown fields along with drilled wells is shown (coordinates taken from https:// ihsma rkit. com/ produ cts/ oil-gas-tools-edin. html)

Adaptive boosting
The AdaBoost or Adaptive Boosting algorithm was introduced by Freund and Schapire (1997) and discussed first in 1995. The performance of weak machine learning algorithms can be enhanced without knowing any prior knowledge through multiplicative-weight update technique. Considering the output of a weak learning algorithm f′ as OP′ 1 , OP′ 2 ,… .OP′ m where the objective of the weak learner is to fit a function f' between TR and OP through least square error, that is (OP-f′(x′)) 2 -and x′ є TR, the error function in adaptive boosting is e −OPf′(x′) which takes into account that only the sign of the final result is considered and the final error is the multiplicative addition from each stage, that is . At each stage and segment of the iteration the weights are updated by the algorithm so that segments which tends to increase the error are identified and the weights are adjusted so that the error is brought down.
In the case of training of Random Forest there is a chance for the model to overfit as in nodes which are seen are strong learners the weights assigned are left unaltered and only the nodes which are weak learners are further iteratively progressed and their weights altered to arrive at a state where the overall error is reduced during training. When this model is applied on the testing data, it may lead to overfitting since the weights associated with strong learning nodes are not altered and the nodes and leaves derived from it tend to become biased. In case of adaptive boosting the weights from both strong learners and weak learners are changed based on performance, such that iteratively the weights of strong learning nodes are reduced and the weights of weak learning nodes are increased so that the nodes and leaves which are further derived from these do not have any bias. This process is done iteratively so that the overall model has different weights at different nodes depending on its learning rate and the overall error of the model is also reduced.

Methodology
There are 5 wells in the study area namely Boreas-1, Kronos-1, Pharos-1, Poseidon-2, and Proteus-1 as shown in Fig. 1. All the wells are recorded with Natural gamma ray (GR), Resistivity (RD), Neutron Porosity (TNPH), Bulk Density (RHOB), P-wave travel-time (DTP), and S-wave travel-time (DTS) logs. The wells have all intersected Gas and Water in the Early-Middle Jurassic clastic sequences comprising of fluvio-deltaic to marine siliclastics which are interspersed with syn-sedimentary volcanics. The log motifs of the 5 wells are shown in Fig. 2.
The S-wave travel-time log available only partly in Proteus-1 as seen in Fig. 3a. A training dataset was made using the input curves Depth, GR, RD, TNPH, RHOB, and DTP from the other 4 wells with output being DTS log. A Machine Learning model relating input and output was made using Random Forest algorithm. The following Table 1, indicates the input parameters and the results which were obtained. A plot comparing the predicted DTS log and the partially recorded DTS log is shown in Fig. 3b. There is an accuracy of 97.04% between predicted log and actual available log in Fig. 3b.
The Poseidon-2 well was subsequently used as a training well which was manually interpreted using the input log curves and reservoir properties such as Vshale, Porosity, and Water Saturation were deduced as shown in Fig. 4. The green shade shows the amount of shale present in the reservoir and the yellow shade shows the fraction of available reservoir.

3
The same well and the interpretation results was used for training the Machine Learning (ML) model using Random Forest. The relationship between the various input curves are shown in Fig. 5, for reservoir and non-reservoir facies using litho-facies (Fac) interpreted from the well logs. The colour codes green (facies 0) denotes non-reservoir and consist of dominantly shales and siltstones, orange (facies 1) denote reservoir facies which are sandstones, and blue (facies 2) denotes coal which is very occasionally present. There were 3 different ML models created for determination of Vshale, Porosity, and Water Saturation respectively using Random Forest. During training of the algorithm, cross-validation and stratification (Ojala and Garriga 2010) on the input dataset was performed with different k-fold permutations (50-500) for better training of the algorithm. The results of the training and testing on the dataset are shown in Table 2. The difference in training scores points out that the model is overfitting the dataset and the learning rate does not improve even if the number of trees are increased. The overall training and testing scores are arrived using averages for the three cases.
The adaptive-boosting of the Random Forest algorithm was performed with the same parameters and the results tabulated in Table 3. There has been a massive improvement in performance with testing accuracy improving to 97.03%. The model was used on the other 4 wells and manual petrophysical interpretation was also subsequently performed on the 4 wells individually in order to compare the performance. The comparison between the ML model results and the manual interpretation is shown in Fig. 6a, b.   Fig. 2 A well log correlation of recorded logs for the wells selected for this study which were interpreted Fig. 3 a The partially available Shear-Sonic log is depicted for Proteus-1. b The predicted Shear-Sonic log is compared with the partially recorded shear log which shows a very good prediction 1 3

Discussion and conclusion
The study successfully demonstrates the ability of making weak learning Random Forest algorithms into strong performing algorithms through Adaptive Boosting. The study also demonstrates the utility of using Random Forest algorithms for automatic petrophysical interpretation, which reduces a lot of time and effort. Typically the manual interpretation of the well logs for interpreting Vshale, Porosity, and Water Saturation requires concerted effort in understanding various data which includes, understanding the geological conditions for deposition, collating and studying well cuttings data, understanding reservoir and non-reservoir characteristics on recorded logs, grain densities from core data, electrical properties of reservoirs, and salinity information of the formation water. The manual interpretation of wells including Poseidon-2 well as shown in Fig. 4 and Fig. 6b respectively, was done using information from core data from some of the wells which included a matrix density of reservoir ~ 2.67 g/cm 3 . The densities of shale varied between ~ 2.55 and 2.70 g/ cm 3 . The shale volume and porosity was estimated using Gamma and Neutron-Density logs. The Indonesian Equation was used to estimate Water Saturation (Poupon and Leveaux 1971) with electrical rock properties considered as a = 1, m = 2, and n = 2. The Archie parameters or electrical rock properties (Archie 1942) namely 'a', 'm', and 'n' are called the tortuosity factor, cementation factor, and saturation exponent respectively. The tortuosity factor relates to the tortuous pathway of the pore spaces available in the rock. The cementation factor relates to the hindrance in the pathways available in the pore spaces of the rock for fluid migration. The saturation exponent relates to the presence of non-conductive fluid in relation to conducting fluid water available in the pore spaces of the rock. A formation water salinity of ~ 12,000 to 14,000 ppm was used in the study as per reports from the drilled wells.
The petrophysical interpretation of well logs using ML was done based on the input logs which are required to be continuous for the entire well or the zone of interpretation. The log curves needs to be corrected for inaccuracies which may arise during acquisition or due to borehole conditions. As shown in Fig. 5 the relationship between input curves after interpretation of Poseidon-2 well is shown. The reservoir section can be differentiated from the non-reservoir shale and coal in the RHOB vs DTSM and RHOB vs DTCO plots. The GR, TNPH, and Resistivity logs are also good indicators for differentiating hydrocarbon bearing reservoirs from other non-reservoirs. The RD log in Fig. 5 is plotted in Log 10 scale. The weak machine learning algorithms tend to overfit the model especially during training when due to limited sampling of the data the model tends towards the more dominant relationship amongst input curves. This is evident during testing of the data by the model as the minority points in the training become more dominant in the testing dataset which gives a lesser accuracy as evident in Table 2. The adaptive boosting of the Random Forest ML algorithm becomes necessary to weigh all the points equally and adjust the relative weights assigned depending on how they fit in the model in order to prevent it from overfitting and make them strong performers as evident from Table 3 and Fig. 6a, b. The   Random Forest ML algorithms were are also good performers at predicting shear logs from the input dataset as seen in Fig. 3 and Table 1, which gave good results and subsequently used in the interpretation.