Introduction

The exploration and evaluation of petroleum resources from source rocks heavily rely on petroleum geochemistry, encompassing crucial parameters such as kerogen type, hydrogen index (HI), and oxygen index (OI). These parameters serve as pivotal proxies for identifying the hydrocarbons' nature and understanding organic matter maturation, thus determining the hydrocarbon generation potential (Durand 1980; Breyer 2012; Dembicki 2016; Chen et al. 2017; Lee 2020). However, the complexity of extracting unconventional petroleum resources presents significant challenges, urging the need for more efficient and less error-prone evaluation methods.

Traditional techniques like Rock-Eval pyrolysis have long been established but suffer from several limitations. These include labor-intensive procedures, the risk of sample contamination, inaccuracies in depth determination, high costs, and time-consuming processes (Lafargue et al. 1998; Behar et al. 2001). As the demand for unconventional petroleum resources escalates, these methods face criticism and calls for improved alternatives.

In recent years, machine learning techniques have emerged as potential problem solvers in the domain of petroleum geochemistry and related fields (Schmoker 1979, 1981; Fertle & Rickie 1980; Meyer & Nederlof 1984; Schmoker & Hester 1983; Fertle 1988; Herron 1988; Carpentier, 1989; Passey et al. 1990; Tariq et al. 2020; Rui et al. 2019; Wang et al. 2019; Khalil Khan et al. 2022). However, these solutions require more extensive validation and practical application, a task that this study aims to undertake.

This study focuses on the geochemical parameters of kerogen type, HI, and OI, hypothesizing that these can be successfully estimated through a novel application of machine learning techniques. The main objective is to establish a cost-effective and efficient methodology for evaluating source rocks using hybrid machine learning techniques, filling a pressing need within the field. The analysis will primarily concentrate on the Perth Basin, drawing insights from a comprehensive range of studies (Khoshnoodkia et al. 2011; Mahmoud et al. 2017; Alizadeh et al. 2018; Johnson et al. 2018; Rui et al. 2019; Shalaby et al. 2019; Lawal et al. 2019; Wang et al. 2019; Handhal et al. 2020; Tariq et al. 2020; Kang et al. 2021; Safaei-Farouji & Kadkhodaie 2022; Deaf et al. 2022; Nyakilla et al. 2022; Zhang et al. 2022; Khalil Khan et al. 2022; Maroufi & Zahmatkesh 2023).

While the forthcoming sections will explore the application of various machine learning techniques to predict OI and HI, such as Support Vector Machines (SVM), Group Method of Data Handling (GMDH), Multi-Layer Perceptron (MLP), Decision Tree (DT), Adaptive Neuro-Fuzzy Inference System (ANFIS), and Radial Basis Function (RBF), it is essential to acknowledge the limitations of the methodology. Some potential limitations include the availability and quality of well-log data, the representativeness of the selected dataset, and the generalization of the results to other geological settings. Additionally, the performance of machine learning models may be influenced by hyperparameter tuning and the choice of input features. It is vital to address these limitations to ensure the robustness and reliability of the study's findings.

Ultimately, the implications of this research extend beyond the academic sphere, providing practical solutions for petroleum geochemistry. By offering an innovative, efficient, and accurate methodology for assessing the production potential of Perth basin source rock, this study contributes significantly to the field. The findings herein stand to not only streamline the process of source rock evaluation but also enhance our understanding of the hydrocarbon production potential in unconventional petroleum resources. By acknowledging and addressing the limitations, the study aims to bolster the confidence and applicability of the proposed methodology in real-world scenarios.

Geological setting

The Perth Basin is a sizable sedimentary basin with a north-to-northwest trend that stretches for around 1,300 km along the western edge of the Australian continent. It was formed during the separation of Australia and Greater India in the early Permian to Early Cretaceous. It comprises a crucial onshore component and extends offshore to the continent-ocean boundary to water depths of approximately 4500 m (Rollet et al. 2013).

The Darling Fault defines the basin's eastern edge, while the Indian Ocean covers its offshore region to the west. The tectonic evolution of the basin is mainly under the direction of the Darling Fault (Owad-Jones & Ellis 2000).

Two primary tectonic stages with a tensional system are associated with the basin's origin and history. A rifting basin is linked to the late Permian's initial development phase. The second event occurred between the Late Jurassic and Early Cretaceous and is associated with splitting of the Australian plate from India. Most of the rocks in the basin are clastic and date from the Permian to more recent times (Marshall et al. 1989).

The basin is divided into a complex graben system with several sub-basins by several normal faults with a north–south trend and younger northwest-southeast trending shift faults (Crostella & Backhouse 2000). The thicknesses of the same sedimentary unit are spatially highly variable, reflecting the relative differences in subsidence rates. The continental and marine sedimentary settings fluctuated throughout the basin's history due to this differential subsidence and related relative sea-level fluctuations (Delle Piane et al. 2013).

In this basin, sediments from the Late Permian through the Cretaceous have been deposited in various settings, from marine to terrestrial. Sandstone, siltstone, and mudstone are typically present, along with low quantities of coal, conglomerate, and carbonate (Playford et al. 1976; Crostella 1995).

The offshore northern Perth Basin consists of three main depocenters; the Abrolhos, Houtman, and Zeewyck sub-basins. The Abrolhos Sub-basin is an elongated north–south-oriented depocenter carrying up to 6000 m of Cisuralian to Lower Cretaceous sedimentary rocks deposited during multiple rift events (Jones et al. 2011; Grosjean et al. 2017). Figure 1 shows the structural setting of the Perth Basin together with its faults and sub-basins.

Fig. 1
figure 1

Perth Basin map of Structural components showing major faults, depocenter age, and sub-basins (modified from Bradshaw et al. 2003)

The most important source and cap rock in the basin was thought to be the Early Triassic Kockatea Shale (Playford et al. 1976; Karimian Torghabeh et al. 2014). The Dandaragan Trough, where sediment thicknesses of more than 1000 m were deposited, experienced active subsidence during the Kockatea Shale deposition that covered the northern Perth Basin (Iasky & Mory 1993). It comprises limestone beds, siltstone, minor sandstone, and black shale. The unit comprises thin, red, purple, or brown ferruginous siltstone or fine-grained sandstone outcrops (Crostella 1995). Except for three wells in the Houtman Sub-basin, every well drilled in the offshore northern Perth Basin converges on the Lower to Middle Triassic Woodada Sequence. The thickness of the Woodada Sequence ranges from 45 to 183 m, except for Wittecarra 1, where the sequence is up to 685 m thick (Jorgensen et al. 2011). Interbedded fine-grained sandstone and carbonaceous siltstone make up the Woodada Formation. The unit has a transitional character on wireline logs between the Lesueur Sandstone, the overlying coarser material, and the Kockatea Shale, the underlying fine-grained material (Mory & Iasky 1996). Figure 2 shows the stratigraphic chart for the offshore northern Perth Basin.

Fig. 2
figure 2

Generalized stratigraphic chart for the offshore northern Perth Basin from (Rollet et al. 2013). Sequences shown are based on the chronostratigraphic framework described by (Jorgensen et al. 2011) and (Jones et al. 2011; 2012). Geological timescale after (Gradstein et al. 2012)

Materials and methods

Data

One hundred thirty-eight cutting samples from the Kockatea and Woodada formations were collected for this study to evaluate their geochemical properties. In order to achieve the ultimate temperatures of 800 °C in the pyrolysis oven and 850 °C in the oxidation oven, the Rock-Eval 6 pyrolysis was used based on the presented workflow of (Espitalie et al. 1977) and (Lafargue et al. 1998) under standard test conditions with a temperature plan of 25 °C min-1. Each sample was removed of iron filings from the drill bit and micas generated from lost circulation material, pulverized after being ground, and measured to a weight of 60–70 mg before being sent through the device. Rock–Eval 6 pyrolysis offers a potent method for assessing a hydrocarbon source rock's quantity, thermal maturity, and type —three crucial elements. The system outputs S1: free hydrocarbons [mg HC/g rock], S2: hydrocarbons cracked [mg HC/g rock], TOC: total organic carbon (wt.%), and Tmax: the temperature at which the maximum amount of hydrocarbon generation occurs in a sample (°C), which include the main acquired characteristics as well as several calculated parameters from results obtained, such as PI (Production index): S1/(S1 + S2), HI (Hydrogen index): (S2/TOC) × 100 [mg HC/g TOC], and OI (Oxygen index): (S3/TOC) × 100 [mg CO2/g TOC].

Numerous researchers, including (Schmoker (1979, 1981); Carpentier et al. (1989); Fertle and Rieke (1980); Herron (1988); Fertle (1988); Meyer and Nederlof (1984); Schmoker and Hester (1983); Passey et al. (1990); Bolandi et al. (2017); Zhao et al. (2019)), have investigated the relationships between geochemical parameters and responses of well-logging tools. Therefore, the input parameters for calculating the HI and OI values for the rock samples chosen for this study were collected from well-log data. The well-log data were collected using sonic (DT), gamma-ray (GR), neutron (NPHI), and bulk density (RHOB) logs. Table 1 and Fig. 3 present the statistical description and the complete data set utilized to train ML models in the Kockatea and Woodada formations.

Table 1 Ranges of the data used for ML modeling in this study
Fig. 3
figure 3

Well-logs and Geochemical diagrams of the Kockatea and Woodada formations, Perth basin, Australia

Figure 4 shows the relationship between conventional well-logs and HI and OI. Gamma-ray (GR) log quantifies the radioactivity of a formation. OM contains uranium, thorium, and potassium, so GR readings rise when it exists. This shows that OI decreases while HI increases as GR increases. (Fig. 4a, e). Cross-plotting GR versus HI (Fig. 4a) resulted in the highest coefficient of determination (R2) (0.0481).

Fig. 4
figure 4

The relationship between laboratory-derived HI and OI and sonic log, bulk density log, gamma-ray log, and neutron log

An elastic wave's velocity in a formation can be calculated using the sonic (DT) log, quantifying the time it takes to move through the formation. Porosity, lithology, and fluid types like water, oil, gas, and kerogen all impact the sonic log. Consequently, an increase in OM concentration could be the reason for its values to increase (Kamali & Mirshady 2004).

The bulk density of a formation is measured by the density (RHOB) log, which is affected by fluids and matrix elements. Because of the lower OM density (~ 1 g/cm3), source rocks typically have low bulk densities. Consequently, higher OI and lower HI are related to greater density. Plotting OI versus RHOB yielded the R2 (0.0184) (Fig. 4g).

The concentration of hydrogen atoms in a formation is monitored using the neutron log [63]. When subjected to high HI, organic-rich intervals, neutron porosity values rise because a formation's hydrogen atoms and porosity are strongly correlated with the OM content. In the hydrogen index case, the relationship between the NPHI log and hydrogen index (0.0246) exhibits the most indirect behavior (the R2 value is the lowest) (Fig. 4d). The opposite is seen for oxygen index (OI), where the oxygen index (OI) vs. NPHI log plot (Fig. 4h) shows the most direct relationship (R2 = 0.0645).

Machine learning methods

Machine learning (ML) techniques have been widely used in various scientific and engineering disciplines, including petroleum engineering, since the early 1990s. They are simple, flexible tools that allow accurate prediction and require little modeling work (Lary et al. 2016; Mohaghegh 2017; Al-Fatlawi 2018).

Even with plentiful studies based on soft computing techniques, studies have yet to be done in kerogen-type prediction by conventional well-log data. The group method of data handling (GMDH), decision tree (DT), random forest (RF), support vector machine (SVM), multilayer perceptron (MLP), Radial basis function neural network (RBF), adaptive neuro-fuzzy inference system (ANFIS), extreme gradient boosting (XGBoost), light gradient boosting (LGBM), and gradient boosting (GB) are among the ten distinct ML techniques used in this paper, which is the first attempt to achieve this purpose. The schematic forms of the methods of this paper are presented in Figs. 5 and 6.

Fig. 5
figure 5

Schematic of applied prediction methods

Fig. 6
figure 6

Schematic of applied classification methods

Group method of data handling)GMDH(

A feed-forward neural network called the group method of data handling (GMDH) can address complicated nonlinear issues (Ebtehaj et al. 2015). This algorithm is a group of self-organized neurons. This technique employs an equation of quadratic polynomial to add a neuron in the subsequent layer by connecting distinct pairings of neurons in every layer (Nariman-Zadeh et al. 2005; Kalantary et al. 2009). The optimal model structure, the impact of input variables, and the number of layers and neurons are all automatically determined by the GMDH algorithm. The GMDH algorithm's primary goal is to reduce the square of the discrepancy between the actual output and the predicted values (Ebtehaj et al. 2015). Detailed information on GMDH and its flowchart are presented in Table 6 and Fig. 7, respectively.

Fig. 7
figure 7

Schematic of GMDH workflow

Support vector machine (SVM)

A reliable ML technique, the support vector machine (SVM), is helpful in many scientific fields, including industrial and medical engineering (Wong & Hsu 2006; Adewumi et al. 2016). Unlike other neural networks, the SVM does not experience under-fitting or over-fitting (Cortes & Vapnik 1995; Vapnik 1999). By creating hyperplanes among the data, SVM effectively executes regressions and classifications, both nonlinear and linear. This structure converts inputs into a higher spatial dimension using the kernel function. Data mapping into a larger dimensional space, which would distinctly differentiate data classes, is necessary to create the hyperplane furthest from the data borders (Scholkopf et al., 2002). Figure 8 displays a straightforward schematic version of the SVM. Tables 7 and 8 also illustrate the optimized parameters for the SVM for prediction and the optimal parameter values for the SVM classifier after hyperparameter optimization.

Fig. 8
figure 8

Schematic of SVM workflow

Multilayer perceptron (MLP)

The MLP is the most used ML model frequently used to simulate various geochemical characteristics of source rock. There are various layers in an MLP algorithm, including input, output, and hidden layers. The input layer is the initial layer. The number of input variables is the same as the number of neurons in the input layer. The output layer, the final layer in an MLP's structure, typically only has one neuron in issues (Mohaghegh 2000; Mohagheghian et al. 2015). Also known as hidden layers, these are located Between the output and input layers. To develop a solid link between the inputs and outputs of the model, the number of hidden layers and associated neurons should be adjusted in this framework. Each neuron's value in the preceding layer is boosted by a specific weight to get its value in the hidden and output layers. A bias term then increases the sum of these values. The result is then subjected to a nonlinear activation function (Mohaghegh 2000; Mohagheghian et al. 2015). In Fig. 9, the MLP flowchart is displayed. Additionally, the properties of the used Multilayer perceptron, including two hidden layers and optimized parameters for the MLP classifier, are shown in Table 9 and 10.

Fig. 9
figure 9

Schematic of MLP workflow

Radial basis function neural network (RBF)

A feed-forward neural network utilized for classification and regression issues is the RBF (Broomhead & Lowe 1988). Three layers make up RBF: input, hidden, and output (Hemmati-Sarapardeh et al. 2018). This method uses a single hidden layer to connect the input and output layers (Zhao et al. 2015). Output has a dimensionally larger space has the input. Nodes in the hidden layer reside within a circle whose center and radius are predetermined. It is calculated using the separation between the input vector and the respective center of each neuron. Afterward, a radial basis transfer function transfers the computed distances from the hidden neurons to the output neurons (kernel function). The kernel functions and particular weights connect the hidden and output layers linearly. Table 11 and Fig. 10 show the operating RBF network's properties and flowchart.

Fig. 10
figure 10

Flowchart of the RBF procedure

Adaptive neuro-fuzzy inference system (ANFIS)

Another rigorous data-driven approach approved for numerous modeling problems is the adaptive neuro-fuzzy inference system. The ANFIS approach was previously introduced by (Jang et al. 1997) as a layered system with five layers that combines a fuzzy system and a neural network. Combining the back-propagation method and the hybrid learning algorithm is a common way of training ANFIS models (Afshar et al. 2014). Detailed information on ANFIS and its flowchart are presented in Table 12 and Fig. 11, respectively.

Fig. 11
figure 11

Schematic of ANFIS workflow

Decision tree (DT).

One of the top-tier supervised machine learning algorithms is the decision tree. Both classification and regression problems can be solved using this technique. According to historical development, the first DT was Automatic Interaction Detection (AID), which Morgan and Sonquist introduced in 1963 (Morgan & Sonquist 1963). This algorithm's concept has a hierarchical structure. To achieve this, the modeling technique uses fundamental components, including the root, leaf, internal nodes, and branches. It is important to note that identifying a DT structure depends on the variety of the data used. This means that even a tiny change in the data will significantly change the ideal DT structure. In Fig. 12, the decision tree's flowchart is shown.

Fig. 12
figure 12

Schematic of DT workflow

Random forest (RF)

The RF is a nonparametric prediction structure that can be used for classification and regression (Breiman 2001). Due to its adaptability to multiple types of inputs (categorical or continuous) and its capability to describe complicated relationships in the data, it has grown in prominence in many geospatial applications. Overfitting is a problem with classification and regression trees (CART) that random forest addresses. It uses sampling with replacement to generate various CARTs and bootstrap aggregating, sometimes known as bagging, to create subsets of the training data (Breiman 1996). For a particular x-valued data point, each CART makes up the forest predictions or votes, and the forest then provides the majority vote for classification or the average forest prediction for regression. The random forest's voting method makes it feasible to identify complicated relationships in the data that might not otherwise be possible. RF has given the noise and overfitting issues that would influence a single decision tree a strong performance. Additionally, RF is naturally suited for multi-class situations and can efficiently handle enormous data sizes (Robnik-Šikonja 2004). The optimal parameter values for the RF classifier's hyperparameter optimization method are displayed in Table 13Gradient boosting.

Gradient boosting is a system used in creating prediction models. The technique is frequently used in regression and classification procedures (Hastie et al. 2009; Piryonesi & El-Diraby 2020). A prediction model is produced via gradient boosting from a group of weak models, typically decision trees. Throughout the learning process, successive trees are produced. With the aid of this approach, the first model may estimate the value and determine the loss or the discrepancy between the predicted value and the actual value. A second model is created to predict the loss after the first. This procedure continues until a pleasing outcome is obtained (Guelman 2012). The optimal parameter values for the gradient boosting classifier, as determined by the hyperparameter optimization procedure, are displayed in Table 14.

Extreme gradient boosting (XGBoost)

Recent years have seen the XGBoost algorithm rise to prominence as a data classification and prediction technique. It has demonstrated strong performance in many ML tasks and innovative scientific endeavors (Chen & Guestrin 2016). With shrinkage and regularization approaches, XGBoost, based on the gradient tree-boosting algorithm (Hastie et al. 2009), can handle sparsity and prevent overfitting (Chen & Guestrin 2016).

A boosting algorithm's main goal is to combine weak learners' outputs to improve performance sequentially (Hastie et al. 2009). XGBoost uses gradient boosting to merge several classification and regression trees (CART). The standardized objective function for enhanced generalization, shrinkage, and gradient tree boosting for additive training and column subsampling to prevent overfitting are the three essential components of XGBoost (Chen & Guestrin 2016). Some of the crucial XGBoost classifier parameters are included in Table 15

Light gradient boosting machine (LGBM)

Data mining activities, including classification, regression, and sorting, use the gradient-boosting decision tree technique known as the light gradient-boosting machine. It combines the predictions of various decision trees to get the final prediction that generalizes well. Integrating several "weak" learners into one "strong" learner is the fundamental tenet of LGBM. The construction of ML algorithms based on this concept is motivated by two factors. First of all, "weak" learners are simple to pick up. Second, grouping numerous learners typically results in more substantial generalization than just one learner (Ke et al. 2017). The optimal parameter values for the LGBM classifier, as determined by the hyperparameter optimization method, are displayed in Table 16

Results

OI and HI estimation

This research implemented a comprehensive methodology to predict the geochemical properties of source rocks, explicitly focusing on the Oxygen Index (OI) and Hydrogen Index (HI). The employed models for this process were the Group Method of Data Handling (GMDH), Support Vector Machine (SVM), Decision Trees (DT), Radial Basis Function (RBF), Multilayer Perceptron (MLP), and Adaptive Neuro-Fuzzy Inference System (ANFIS).

The available data was strategically divided into two subsets: 70% was utilized for training the models, while the remaining 30% served as the test set for model evaluation. This data division strategy proved appropriate for our data volume, backed by its proven effectiveness in prior research.

A rigorous assessment of the six models was conducted using a range of statistical parameters to gauge their efficiency and accuracy. Metrics such as the coefficient of determination (R2), average percent relative error (APRE), root mean square error (RMSE), average absolute percent relative error (AAPRE), and standard deviation (SD) were employed. The complete results of these measures are detailed in Tables 2 and 3.

Table 2 Statistical error analysis for the developed models for OI estimation
Table 3 Statistical error analysis for the developed models for HI estimation

Figures 13 and 14 visually encapsulate the comparative assessment of the models using these statistical parameters, presenting APRE (subfigure a), AAPRE (subfigure b), RMSE (subfigure c), SD (subfigure d), and R2 (subfigure e) for each model. These illustrations emphasize the prediction of HI and OI values, offering a clear depiction of the performance of each model.

Fig. 13
figure 13

Comparison of developed models on the basis of statistical parameters of APRE a, AAPRE b, RMSE c, SD d, and R2 e for OI prediction

Fig. 14
figure 14

Comparison of developed models on the basis of statistical parameters of APRE a, AAPRE b, RMSE c, SD d, and R2 e for HI prediction

Moreover, the analysis was expanded by investigating the percentage error of each model's predictions across three unique datasets, each consisting of 46 samples. This percentage error offers a robust measure of the models' performance and supports the data in Figs. 13 and 14.

Figures 15 and 16 provide further insights through cross-plots, visually comparing HI and OI's predicted and actual values. These figures encompass all six models, covering the training, testing, and total datasets. In tandem, Fig. 17a and b graphically illustrate the percentage error in estimating HI and OI by each model across the three datasets.

Fig. 15
figure 15

Cross plot of prediction of OI by the models versus experimental

Fig. 16
figure 16

Cross plot of prediction of HI by the models versus experimental

Fig. 17
figure 17

Illustration of the percentage error across three unique datasets, each comprising 46 samples, for six predictive models. Subfigure a depicts the percentage error for the estimation of Hydrogen Index (HI), while subfigure b shows the same for the Oxygen Index (OI)

These analyses collectively demonstrated the superior performance of the SVM model. It outperformed the others consistently, exhibiting the closest fit to the experimental values, the most negligible dispersion, and the slightest degree of deviation. The superiority of SVM is further reflected in its R2 values of 0.993 and 0.989 for the testing set, 0.983 and 0.986 for the training set, and 0.987 for the total data set, marking it as the superior model for estimating OI and HI (Tables 2 and 3). Additionally, SVM achieved the lowest AAPRE, RMSE, and SD (Tables 2 and 3). Equations (1, 2, 3, 4, 5), used to compute these statistical metrics, are elaborated in the Methods section.

Finally, Fig. 18 provides a schematic depiction of the actual and predicted kerogen types across the total (a), training (b), and testing (c) datasets using the pseudo van Krevelen diagram. This figure further enriches the analysis by comparing the actual and estimated values, adding another layer of comprehension to the models' predictive capabilities and their application in real-world geochemical analysis.

$$R^{2} = 1 - \frac{{\sum\limits_{i = 1}^{N} {\left( {x_{i,\exp } - x_{i,pred} } \right)^{2} } }}{{\sum\limits_{i = 1}^{N} {\left( {x_{i,pred} - \overline{{x_{i,\exp } }} } \right)^{2} } }}$$
(1)
$$APRE = \frac{100}{N}\sum\limits_{i = 1}^{N} {\left( {\frac{{x_{i,\exp } - x_{i,pred} }}{{x_{i,\exp } }}} \right)}$$
(2)
$$AAPRE = \frac{100}{N}\sum\limits_{i = 1}^{N} {\left| {\frac{{x_{i,\exp } - x_{i,pred} }}{{x_{i,\exp } }}} \right|}$$
(3)
$$RMSE = \sqrt {\frac{1}{N}\sum\limits_{i = 1}^{N} {\left( {x_{i,\exp } - x_{i,pred} } \right)^{2} } }$$
(4)
$$SD = \sqrt {\frac{1}{N - 1}\sum\limits_{i = 1}^{N} {\left( {\frac{{x_{i,\exp } - x_{i,pred} }}{{x_{i,\exp } }}} \right)^{2} } }$$
(5)
Fig. 18
figure 18

Cross plot depicting the real and estimated kerogen type for total a, training b, and testing c data using the pseudo van Krevelen diagram

In these formulas, xi,exp, xi,pred, and N, respectively, represent the experimental OI and HI data and model-predicted values of OI and HI, and the quantity of pieces of data.

Kerogen type estimation

During the subsequent stage of the research, the prediction of kerogen types was addressed by utilizing a spectrum of machine learning classifiers. The process of classification was informed by the pseudo van Krevelen diagram, resulting in the segregation of the kerogen types into four distinct categories: type II, type III, mixed II & III, and type IV. This categorization was graphically exhibited in Fig. 19 and numerically reported in Table 4.

Fig. 19
figure 19

Kerogen types based on pseudo van Krevelen diagram

Table 4 Number of samples of each type of kerogen

A variety of machine learning algorithms, including the Light Gradient Boosting Classifier (LGBM), Extreme Gradient Boosting Classifier (XGBoost), Random Forest Classifier (RF), Multilayer Perceptron Classifier (MLP), Support Vector Classifier (SVM), and Gradient Boosting Classifier, were employed for the classification of these kerogen types. The parameters of these classifiers were optimized in a comprehensive process, the details of which are presented in Tables 12 through 16.

The classifiers' performance was evaluated upon completing the classification stage, leveraging a testing dataset. The selected evaluation metrics, precision, recall, accuracy, Area Under Curve (AUC), and F1 score, facilitated a detailed appraisal of the efficacy of each classifier.

A comparative assessment of performance metrics for all models is demonstrated in Table 5 and Fig. 20, providing an objective evaluation of the efficiency of each machine learning classifier. The average precision scores for all classification models are graphically presented in Fig. 21.

Table 5 Comparative analysis using various performance metrics
Fig. 20
figure 20

Performance analysis of various classifiers

Fig. 21
figure 21

Average precision scores all classification models

To address the occasional misleading outcomes of the ROC curve, especially in instances of class imbalance, the precision-recall curve was employed alongside the ROC curve. This practice offered a more comprehensive perspective on the performance of each classifier, consistent with the findings of Davis (2006). Figure 22a, b, c, d, e, f illustrate the ROC curves for each kerogen type across all models, emphasizing the superior performance of the Gradient Boosting classifier.

Fig. 22
figure 22

ROC–AUC analysis for a Random Forest, b SVM, c MLP, d Gradient Boosting, e LGBM, and f XGBoost (class 0 = kerogen type II, class 1 = kerogen type II & III Mixed, class 2 = kerogen type II, and class 3 = kerogen type IV)

The performance of the classifiers was then consolidated and evaluated using a confusion matrix. This matrix calculated the core metrics—precision, accuracy, recall, and F1-score. Furthermore, the capability of a classifier to distinguish between different classes was gauged through the Area Under Curve (AUC) metric, as per Fawcett (2006). These performance metrics were calculated by Eqs. (6) through (9).

Notably, the Gradient Boosting classifier excelled, achieving the highest scores across all metrics. It registered an impressive accuracy of 93.54% and scores of 0.94, 0.93, 0.93, and 0.95 for precision, recall, F1-score, and AUC, respectively. Moreover, it demonstrated the least misclassifications, as illustrated by the confusion matrix in Fig. 23. As a result of this comprehensive evaluation, the Gradient Boosting classifier was confirmed as a highly effective method for identifying kerogen types.

$$\Pr ecision = \frac{TP}{{TP + FP}}$$
(6)
$${\text{Re}} call = \frac{TP}{{TP + FN}}$$
(7)
$$F1 - Score = 2*\frac{{\Pr ecision*{\text{Re}} call}}{{\Pr ecision + {\text{Re}} call}}$$
(8)
$$Accuracy = \frac{TP + TN}{{TP + TN + FP + FN}}$$
(9)

where TP = True Positive, TN= True Negative, FP = False Positive, FN = False Negative.

Fig. 23
figure 23

Confusion matrix of Gradient Boosting classifier

Discussion

OI and HI estimation

The study's results underscore the effectiveness of the SVM model for predicting the Oxygen Index (OI) and Hydrogen Index (HI) in the source rock. Based on the calculated statistical parameters, SVM consistently outperformed the other five models—GMDH, DT, RBF, MLP, and ANFIS—regarding accuracy and reliability.

While comparing the performance of the models, a suite of statistical metrics was employed, namely R2, APRE, RMSE, SD, and AAPRE. These parameters provided a holistic evaluation of the models' performance, encompassing measures of fit efficiency, relative deviation, dispersion, and precision. The SVM model registered the highest R2 value, which signifies the strongest fit between the model predictions and experimental data. Lower values of AAPRE, RMSE, and SD in SVM confirm that it has a lesser degree of dispersion and deviation, thus underscoring its superior precision.

The models' performance was further evaluated through a visual examination using cross plots. These graphical analyses reinforced the statistical findings, revealing the SVM model's high accuracy in estimating OI and HI. The SVM model demonstrated a higher concentration of data points near the Y = X line, implying a closer match between experimental and estimated values.

These findings not only point towards the strength of SVM in estimating OI and HI but also draw attention to the versatility and applicability of different models in geosciences. While SVM showed a strong performance, the other models also delivered reasonably accurate results, indicating their potential utility in other tasks or data contexts.

Kerogen type estimation

In the kerogen type estimation, six machine learning algorithms were evaluated, and amongst these, the Gradient Boosting model emerged as the top performer. While the SVM model excelled in OI and HI prediction, it could have been more effective in kerogen-type classification, emphasizing that model selection should be guided by the specific task.

Performance metrics such as accuracy, precision, recall, AUC, and F1 score were used to assess the classifiers' performance. The Gradient Boosting model achieved the highest values across all these metrics, thereby establishing its superiority in classifying kerogen types. In addition to the numerical scores, visual assessments through precision-recall and ROC curves further supported the dominance of Gradient Boosting in this task.

While Gradient Boosting was the standout performer, the results also highlight the strong performance of other algorithms, including Random Forest, LightGBM, and XGBoost. This points to the potential of these models as alternatives for kerogen-type classification, given the appropriate tuning and optimization of hyperparameters.

These findings illustrate that model selection and performance largely depend on the specific task and data characteristics. This underscores the need for careful model selection and fine-tuning to cater to the data's specificities and the analysis's objectives. Future studies could extend these findings by exploring the application of these models to more extensive or different datasets or investigating other promising machine learning models and techniques.

Conclusions

This research provides compelling evidence of the transformative impact of machine learning techniques within geosciences, specifically in predicting organic richness indicators and kerogen types. Through creating and validating bespoke machine learning models across various algorithms, this study has demonstrated their formidable capability in accurately predicting kerogen type, hydrogen index, and oxygen index from well-log data—a significant accomplishment given the often constrained availability of geochemical data. Essential conclusions from this study are summarized as follows:

  1. 1.

    The Support Vector Machine model has distinguished itself as an exceptional performer in accurately predicting source rocks' Oxygen Index and Hydrogen Index. Its superior performance amplifies the significance of machine learning applications in enhancing prediction accuracy within geosciences.

  2. 2.

    The models' resilience and reliability are manifest in their robustness against overfitting, a prevalent concern in machine learning implementations. This underscores the meticulous design and optimization processes invested in developing these models, reinforcing their value for real-world applications.

  3. 3.

    Among the classification models, the Gradient Boosting and Random Forest models have proven to be particularly efficient in kerogen-type classification, further affirming the transformative role machine learning can play in refining classification tasks in geosciences.

  4. 4.

    The proposed machine learning approach illuminates a path toward greater economic efficiency within geosciences. Optimally harnessing readily available well-log data can considerably reduce the demand for costly geochemical laboratory analyses, thus fostering more cost-effective operational practices.

  5. 5.

    The versatility of this research is evident in the continued effectiveness of the proposed machine learning models, even in scenarios where data is sparse or completely absent. This underscores the models' resilience, positioning them as robust solutions to prevalent data-related challenges in the field.

  6. 6.

    The innovative application of the pseudo-van Krevelen diagram approach for kerogen-type classification adds to the uniqueness of this study. This novel approach promotes a more precise categorization of kerogen types, thereby enhancing the effectiveness of these analyses.

In summary, this research contributes a new perspective to the existing literature by introducing a machine learning-centric approach for predicting organic richness indicators and kerogen type using well-log data. As a significant step towards improved efficiency and cost-effectiveness in geosciences, this study fuels promising opportunities for future exploration and research. The conclusions drawn here underscore the profound potential of machine learning within geosciences, inspiring further innovative research and application and marking a significant contribution to the continuous evolution of this crucial scientific discipline.