Introduction

As hydrocarbon reservoirs have heterogeneity in macroscopic and microscopic scales, an accurate description requires a full study of reservoir uniformity. One of the most popular techniques used to describe the characteristics and reservoir heterogeneity is to determine the number of flow units. In drilling, production and reservoir studies, determination of reservoir rock types, and the number of flow units are very important. Hydraulic flow units are part of a reservoir with unique characteristics that have a significant role in fluid flow in reservoir. They may be interacting with other flow units. The reservoir quality and the rock type are determined by this feature and the relationship between porosity and permeability. More accurate estimation of porosity and permeability requires correct classification of flow units. (Ebanks Jr 1987, Amaefule et al. 1993, Guo et al. 2005, Tiab and Donaldson 2015; Elnaggar 2018, Sharifi-Yazdi et al. 2020).

Flow units are mainly used to describe the hydrocarbon reservoir. Flow units determination is necessary for accurate reservoir petrophysical modeling (Hosseini Bidgoli et al. 2014). The rock types are determined to a reservoir classification into separate units that have deposited with similar diagenetic changes or have the same geological status. To have a more accurate estimate of the relevance between permeability and porosity and more realistic simulation results, the flow units must be determined correctly (Guo et al. 2005, Zargari et al. 2013).

In order to determine hydraulic flow units (HFUs) of reservoirs, Amaefule et al. introduced a new technique using the Kozeny-Carman equation and the medium hydraulic radius. This technique creates a line with a fixed slope for each hydraulic unit in the Reservoir Quality Index (RQI) versus Pore to Matrix Ratio (PMR) diagram. The intersection point of the line created by the Normalized Porosity Index (NPI) with line PMR = 1 is called the FZI. This parameter indicates a unique HFU. After that the correlation of permeability and FZI could by calculated by regression models (Amaefule et al. 1993). Using RQI and FZI, Gunter et al. show that the rock type is extremely beneficial in permeability and initial water saturation modeling which are used in geological modeling and reservoir simulations. They introduced acceptable graphical methods and used them to identify the rock type and analyze the flow unit in carbonate and sandstone reservoirs (Gunter et al. 1997). However, although the method used in that study led to acceptable results, determining the rock types in sandstone reservoirs still requires a more comprehensive method.

Abnavi et al. identified the number of gas reservoir flow units in southern Iran using histogram analysis methods and normal probability diagrams. The result of the study shows that the normal probability diagram is a more reliable method for detecting HFUs (Abnavi et al. 2018). Shalaby et al. study, using core data (porosity and permeability) of Qasr field, various methods were used to analyze and describe the sandstone of the Khatatba Formation. In their study, the number of flow units has been defined by the RQI, FZI, and NPI. The Winland R35 equation was used to describe the geometry of the pores and the diameter of the pore throat, which eventually led to the classification of sandstones into three classes of flow units and three different rock types (Shalaby 2021). Moghtadar et al. used the concept of hydraulic flow units and electric current units to describe and evaluate the sandstone reservoir of the Nubia Formation in Gebel Abu Hassle. They also determined the number of flow units and the rock type using RQI and Winland R35 methods (El-Sayed et al. 2021). Nayak et al. collected porosity and permeability data from 32 core samples of the calcareous field from four different Mumbai regions. The porosity range of these samples was from 0.3% to 20.5%, and the permeability range was from 0.002 to 1.484 millidarcy, and the depth was 1618.86 to 1634.14 m. Using the obtained data, the FZI is calculated, and the HFUs are determined. The least squares regression (LSR) method has been used to determine the flow units (Nayak et al. 2021).

In the meantime, many researchers have also used machine learning knowledge in their studies. In 2020, Khadem et al. developed a system for detecting the rock type and the HFUs of detrital reservoirs with uniform pores. The system is being implemented on an oil field in the Persian Gulf. First, physical models, classify field rocks into three types with different characteristics. Then, using the core data, the number of flow units was calculated and expanded the information obtained by using simultaneous inversion and the rock physics models throughout the reservoir (Khadem et al. 2020). Sengel et al. developed a dynamic model to predict the future performance of the Germik reservoir in southeastern Turkey. At the first, the hydraulic flow units instead of the reservoir facies model were determined in cored wells, and then using artificial neural networks (ANN), the flow units for the other wells and through the model were estimated. The results are used to build a reservoir permeability model. Finally, the results show that the date-simulated model can be safely used for enhanced oil recovery (EOR) screening (Sengel and Turkarslan 2020). In 2021, Abnavi et al. determined the number of flow units in the hydrocarbon field in the south of Iran using the core data by the FZI method. Then, using an artificial network fuzzy inference system (ANFIS), permeability of the studied well was estimated. ANFIS estimates the permeability with an error of 1.83%. This algorithm estimates the permeability of non-cored wells with a 21.5% error (Abnavi 2021). A summary of some of the articles studied is given in Table 1.

Table 1 Summary of some of related articles studied in HFU determination

Based on studies of the literature and previous studies in this field, the importance of studying flow units is determined. Such cases are very important in the studies of enhanced oil recovery. In order to study more about the issue, it is referred to the studies of Wu et al. (2018) and Wu et al. (2016). It is obvious that in most studies, conventional methods have been used to determine the rock types, and studies on new methods of machine learning (ML) in this field need more research. Conventional methods for classifying the number of flow units require direct core test data such as porosity and permeability. This is even though coring, operations in reservoir formations are carried out in a limited number of field wells due to their high cost and time-consuming nature, and it is not possible to access the cores of different parts of a reservoir in the oil field. This causes the conventional methods of classifying flow units to work with fewer input features and so not provide accurate and acceptable results. As an alternative, machine learning methods can be used for this purpose since they use well log data besides core data (given that petrophysical log data are available in most wells and contain information from well columns).

In this study, HFUs have been classified by using well log data and machine learning methods. If most of the previous studies have used core data for this issue, considering that there is no core data for the entire length of the well and access to core data is very costly and time-consuming, replacing petrophysical log data with core data is a suitable method for classifying HFUs. For this purpose, the well log and core data have been collected. Then, the number of HFUs was determined using the conventional methods of Winland R35, FZI, DRT, and k-means. Machine learning methods including ANN, support vector machine (SVM), LogitBoost, logistic regression, and random forest (RF) have been studied and used to classify HFUs considering the classification calculated from the FZI method as the optimal classification of HFU reservoir. Finally, the performance of these methods has been compared. According to the machine learnings’ performance in the classification of flow units, this method can be extended to the entire length of the well and the flow units of points of the well that lack core data can be predicted. Among the innovations of this research is the use of various applied machine learning methods in the classification of HFUs and the comparison of these methods and their performance in the Kazhdumi Formation, which has sandy shale facies (only in some southwestern fields of Iran).

Case study and available data

The studied field is located in the coastal part of the Persian Gulf sedimentary basin called the Khark Romeyle. The Persian Gulf is an epi-continental and marginal sedimentary basin that is in multiple sedimentary environments (Siebold 1969). The Persian Gulf is part of the Arabian plate, at the intersection of the Arabic and Eurasian lithosphere plates. The time of its formation in the current situation is the Late Miocene and dates back to the formation of the Zagros Mountains. The tectonics of this basin is similar to the tectonic conditions of the foreland basin on the edge of the Zagros Mountains. The deepest part of the Persian Gulf basin, from the Middle Jurassic to the Lower Cretaceous, is located the northwestern corner of the Persian Gulf (Rabbani 2013). The Persian Gulf basin can be introduced as one of the richest hydrocarbon basins in the world since more than 50% of the world’s gas and oil are located in the Persian Gulf basin (Rabbani 2007). The studies field is an anticline trap with an almost NS trend which is located in NW of the Persian Gulf (Fig. 1). The reservoir formation of this field is Kazhdumi which is deposited in Early Albian to Middle Albian. Although this formation is often known as the source rock in the Zagros sedimentary basin with lithology of shale, the middle parts of this formation in the NW of the Persian Gulf include sandy sequences which could act as a high-potential reservoir rock (Motiei 1995). The Kazhdami reservoir in the Khark Romeyle basin is deposited in a shallow marine and deltaic environment. The lithology of this formation is composed of fine to coarse sandstone and has a highly faulted reservoir (Nairn and Alsharhan 1997).

Fig. 1
figure 1

Location of the southwestern Iran oil and gas fields (Zargar et al. 2020)

The stratigraphic column of the NW of the Persian Gulf basin is shown in Fig. 2. In the studied area, from surface to depth, formations are Bakhtiari, Aghajari, Gachsaran, Asmari, Jahrom, Tarbour and Gurpi, Sarvak, Kozhdami, Darian Gadvan, Fahlian, and Heath Anhydrite, respectively.

Fig. 2
figure 2

Stratigraphic column of the studied area (Mohebian et al. 2013)

In this study, in order to classification of rock type, 212 core sample data including porosity and permeability and petrophysical well logs of an oil field in southwest if Iran have been used. The available logs were RT, DT, HCAL, NPHI, RHOZ, and PEFZ. The range of each input parameters such as porosity, permeability, and depth is reported in Table 2.

Table 2 Value range of available data

Methodology

Conventional methods

FZI method

As mentioned, reservoir rocks can be divided into several different flow units from a geological or engineering point of view to describe how they behave during different production applications (Gomes et al. 2008). Since the FZI depends on the geological properties and the geometry of the different rock types and is also a function of the reservoir quality and porosity ratio, it is a desirable parameter to determine hydraulic flow units (HFU) (Abed 2014). Each HFU has a specific value of FZI which is determined by log analysis (porosity and permeability logs) and calculated from the RQI and normalized porosity (Yi et al. 2021):

$$RQI = 0.0314\sqrt {\frac{K}{{\varphi_{e} }}}$$
(1)

where \(K\) and \({\varphi }_{e}\) are the permeability and effective porosity of the rock, respectively. The normalized porosity is obtained from Eq. 2 which is used in the FZI calculations (Amaefule, Altunbay et al. 1993):

$$\varphi_{z} = \frac{{\varphi_{e} }}{{1 - \varphi_{e} }}$$
(2)

Finally, the FZI is obtained by Eq. 3:

$${\text{FZI}} = \frac{{{\text{RQI}}}}{{\varphi_{Z} }}$$
(3)

By applying mathematical operations, Eq. 4 can be deduced:

$$\log \left( {{\text{RQI}}} \right) = \log \left( {\varphi_{Z} } \right) + \log \left( {{\text{FZI}}} \right)$$
(4)

In the \(\mathrm{log}\left(\mathrm{RQI}\right)\) vs. \(\mathrm{log}\left({\varphi }_{\mathrm{Z}}\right)\) diagram, all the FZl samples with similar values are placed on a straight line with the same slope. The points placed on a straight line have similar pore properties (Fig. 3). The constant FZI value could be obtained from the intersection point of the unit slope with \({\varphi }_{Z}\)=1(Amaefule et al. 1993). To identify all the distributions presented in the original data, it is necessary to create a histogram of \(\mathrm{log}(FZI)\). Since the FZI is multiple of all logarithmic normal distributions, the \(\mathrm{log}(FZI)\) histogram represents n number of normal distributions for n-flow units. In situations where the clusters are distinctly separated, the histogram can intelligibly identify apiece HFUs (Al-Ajmi and Holditch 2000; Abed 2014).

Fig. 3
figure 3

Support vector machine hyper-plane

The normal probability diagram is used to evaluate the compliance of a set of data with a standard bell-shaped curve. To calculate the normal probability graph, \(\mathrm{log}(FZI)\) data should be sorted. After that, percentiles with uniform distances from the normal distribution could be determined (Nouri‐Taleghani et al. 2015). Since the FZI mean values cannot be reached from the probability plot, the FZI instance value of each HFU is obtained by averaging all the FZI values in the corresponding HFU range. It should be noted that the overlap effect may vary or deform straight lines in the probability plot (Al-Ajmi and Holditch 2000).

Winland R35 method

Winland defined his equation using 300 samples from the Spindle field. In 1972, by examining different mercury saturations, he showed that the best value for mercury saturation is 35% for calculating the pores radius which show the best path for fluid flow (Winland 1972). The performance of this method is based on capillary pressure curves (Soleymanzadeh et al. 2019).

Winland calculated the most appropriate curve at 35% mercury saturation by regression analysis to establish an equation between porosity, permeability and the size of the pore throat which leaded to Eq. 5 (Winland 1972; Kolodzie 1980):

$$\log \left( {R35} \right) = 0.735 + 0.588 log\left( k \right) - 0.864 log\left( \varphi \right)$$
(5)

By this equation, the data can be categorized and the quality of the reservoir determined based on the size of the pore throats (Spearing et al. 2001).

DRT method

Using the Winland equation, the continuous values of FZI are converted to discrete ones. Following the discretization of the FZI values, using Eq. 6, the core data are classified into separate categories. The equation mentioned by Chakani and Kharat in 2012 was used for carbonate reservoirs (Chekani and Kharrat 2012):

$${\text{DRT}} = {\text{Round}}\left( {2\log \left( {{\text{FZI}}} \right) + 10.7} \right)$$
(6)

It should be noted that this equation is also used to predict permeability in reservoir static modeling. FZI values are determined in the reservoir grid blocks, and the obtained DRT values from Eq. 6 are propagated through the model. According to the relation between porosity and permeability in each DRT set, a certain amount of permeability is assigned to each reservoir grid blocks (Chekani and Kharrat 2012).

K-means method

K-means is an unsupervised algorithm that can easily divide a data set into several separate subsets (MacQueen 1967). This method can be introduced as a complement to other clustering methods. In addition, this method can optimally reduce the number of class members and classify large datasets. This method can be considered a complement to other clustering approaches. In addition, this method can reduce the size of the data set by applying previous classifications, although large data sets can also be clustered (Zahmatkesh et al. 2021). The k-means method is known for its relatively simple implementation and acceptable results. However, a direct algorithm of the k-means method requires significant time to product the number of vectors and clusters per iteration, especially for large data sets. The k-means algorithm, despite being a simple classification method, shows an acceptable performance. (Sidqi and Kakbra 2014).

K-means can be considered as an optimization issue to reduce the clustering error of a target. The purpose of the k-means algorithm is to optimize and minimize the objective function, which represents the square error function (MacQueen 1967):

$$J = { }\mathop \sum \limits_{j = 0}^{k} \mathop \sum \limits_{i = 0}^{n} \left| {X_{i}^{\left( j \right)} - { }C_{j} } \right|^{2}$$
(7)

where \(J\), k, n, X, and C is the objective function, the number of clusters, the number of data points, the data points, and the center of clusters, respectively. In this method, the data subsets are identified by a center, and the data points are assigned to clusters based on their similarities (Euclidean distance from their center of mass) which are often determined after data partitioning(McCreery and Al-Mudhafar 2017).

Machine learning methods

The use of neural networks in various branches of engineering is increasing, so that knowing how it works and how to use it is essential for petroleum engineers. In the following, while introducing the structure and operation of some of the most important machine learning methods, some of its applications in petroleum engineering are mentioned.

Several studies have presented the use of artificial intelligence in the petroleum engineering (Dougherty 1972; Braswell 2013, Kuang et al. 2021). In the oil upstream industry, the use of machine learning and optimization methods could be divided into the following three categories:

  1. a)

    Exploration

    1. i.

      Determination of petrophysical parameters (Kiran and Salehi 2020, Mohammadian et al. 2022)

    2. ii.

      Geophysical processing and interpretation (Wang et al. 2018)

    3. iii.

      Determination of geomechanical parameters (Ebrahimi et al. 2022, Syed et al. 2022)

    4. iv.

      Determination and interpretation of well survey charts and wireline logs (Bestagini et al. 2017, Akinnikawe et al. 2018)

    5. v.

      Constructing a static and geological model of the reservoir (Bai and Tahmasebi 2020, Otchere et al. 2021)

    6. vi.

      etc.

  2. b)

    Development and knowledge of reservoir and field

    1. i.

      Upscaling (Menke et al. 2021, Wang et al. 2022)

    2. ii.

      Preparation of (Sircar et al. 2021), and well operation (Junior et al. 2022)

    3. iii.

      Improving drilling operations (Bello et al. 2015; Noshi and Schubert 2018)

    4. iv.

      Improving reservoir simulation (Wang et al. 2020, Samnioti et al. 2022)

    5. v.

      History matching (Jo et al. 2021, Srinivasan et al. 2021)

    6. vi.

      Characterizing reservoir fluid (Ramirez et al. 2017; Onwuchekwa 2018)

    7. vii.

      Enhanced oil recovery screening (Cheraghi et al. 2021, Pirizadeh et al. 2021)

    8. viii.

      etc.

  3. c)

    Field production

    1. i.

      Improvement of extraction operations (Teixeira and Secchi 2019, Pandey et al. 2021)

    2. ii.

      Maintenance of down hole pumps (Bangert and Sharaf 2019; Bangert 2021)

    3. iii.

      Improvement of artificial lift system (Syed et al. 2022)

    4. iv.

      Improvement of injection operations (Artun 2020, He et al. 2021)

    5. v.

      Improvement of well stimulation operations (Wang and Chen 2019, Liu et al. 2022)

    6. vi.

      Improvement of hydraulic fracturing operation (Morozov et al. 2020)

    7. vii.

      Arranging the pipelines (Soomro et al. 2022)

    8. viii.

      Smart well completion operation and well pattern (Castiñeira et al. 2018)

    9. ix.

      etc.

Data-driven methods show engineers a path that enables them to quickly confirm well and field performance in a very short period of time. Machine learning models such as artificial neural networks are definitely not a substitute for conventional methods such as numerical simulation; instead, a hybrid approach of machine learning modeling can provide more reliable results.

SVM method

The SVM method, which originates from statistical theory, is one of the supervised learning methods. This method is used for classification and regression (Noble 2006). The goal of this method is to find the hyper-plane that has the greatest distance from the data of the two classes (Reynolds 2001). This goal is achieved by training the SVM algorithm by means of a set of data (VAPNIK et al. 1998).

Support vectors are essentially a set of points in the n-dimensional space on which the boundary of the classes is determined. That is, by moving one of them, the output of the classification may change (Üstün et al. 2005). If each data is represented by \({x}_{i}\) and has the number of D attributes which are labeled with specific values of \({y}_{i}\), Eq. 8 can be written as follows (Boser et al. 1992):

$$D = \left( {\left( {x_{i} , y_{i} } \right)\left| x \right. \in R^{d} ,y_{i} \in R , i = 1,2, \ldots ,n} \right)$$
(8)

The purpose of this algorithm is to find the equation between input and output data, which is defined in Eq. 9:

$$f\left( x \right) = W^{T} .\emptyset \left( X \right) + b$$
(9)

where W and b are weight vectors and bias values, respectively. The SVM is a linear regression whose dimensions are the number of data attitudes. This algorithm tries to reduce the complexity of the model by minimizing ||W||2. In this algorithm, the objective function is defined by Eqs. 10 and 11:

$$\min \frac{1}{2}W^{T} W + C\mathop \sum \limits_{i = 1}^{N} (\varepsilon_{i} + \varepsilon_{i}^{*} )$$
(10)
$$\left\{ {\begin{array}{*{20}c} {y_{i} - \left( {W^{T} .\emptyset \left( X \right) + b} \right) \le \varepsilon + \varepsilon_{i} } \\ {\left( {W^{T} .\emptyset \left( X \right) + b} \right) - y_{i} \le \varepsilon + \varepsilon_{i}^{*} } \\ {\varepsilon_{i} , \varepsilon_{i}^{*} \ge 0 , i = 1,2, \ldots N} \\ \end{array} } \right.$$
(11)

where \({\varepsilon }_{i}\) and \({\varepsilon }_{i}^{*}\) are ineffective variables for target values which are less and greater than \(\varepsilon\), respectively. C is used to balance model complexity and training error (Alonso et al. 2013, Mehdizadeh et al. 2014). Figure 3 shows the SVM hyper-plane for a sample data.

LogitBoost method

Boosting algorithms was originally proposed to combine several weak classifiers together to improve the classification performance. LogitBoost is additive logistic regression model. This algorithm, which is a subset of the meta-learning algorithm, is a modified model of the AdaBoost algorithm. This algorithm is introduced by Friedman et al. which uses incorrect classifications of previous models and creates a new classification class with higher accuracy (Friedman et al. 2000, Peng and Chiang 2011, Fakhraei et al. 2014). The AdaBoost algorithm uses a binomial probability logarithm to change the number function linearly. For this reason, it has limitations in noise management. The LogitBoost model is like the Ada Boost model. The main idea behind LogitBoost is to apply boosting in building a Logitmodel. The LogitBoost algorithm is designed to solve this problem (Friedman et al. 2000). The AdaBoost algorithm is more popular for classification, but the LogitBoost algorithm performs better in outbound data. Readers are referred to study of Friedman et al. for further reading on classification steps in LogitBoost algorithm.

LogitBoost is designated as a “weak” or “basic” learning algorithm. LogitBoost iteratively takes different training examples because the base learning algorithm generates a new weak prediction rule, which causes many rounds, and the subsequent boosting algorithm must transform these weak rules into a strong prediction rule, which is usually much more accurate than weak prediction (Friedman et al. 2000).

ANN method

The structure of ANN is similar to the biological network of the human body. This network can extend to imitate the function of the human brain in some way (Shepherd 1990). This algorithm is made up of artificial neurons which are the smallest unit of data processing (Sengel and Turkarslan 2020).

ANN are usually composed of several layers, which are known as input, hidden, and output layers. This network follows complex mathematical equations. These mathematical equations make connections between neurons and weights. It also optimizes network weights to achieve an optimal output. Each of the neurons processes the inputs to produce the outputs (Rezrazi et al. 2016).

In this algorithm, a random weight is determined for calculating the output for each neuron:

$${\text{Output}}_{i} = W_{ij} X_{i}$$
(12)

where \({X}_{i}\) and \({W}_{ij}\) are the input and the weight, respectively. The output of neurons is the corresponding neuron input in the next layer. After generating the first output layer, the activation function is applied to all of them. There are different types of activation functions, but the most common in classification is the sigmoid activation function, which is calculated by Eq. 13 (Okon et al. 2021):

$$F\left( {Output_{i} } \right) = \frac{1}{{1 + e^{{ - Output_{i} }} }}$$
(13)

where F(Outputi) is the value of the sigmoid function and Outputi is the output layers. The result of this activation function is 0 or 1, which indicates whether each neuron is active or inactive. By this way, the classification of data is done. Figure 4 represents a multilayer neural network.

Fig. 4
figure 4

Artificial neural network

RF method

The RF algorithm was developed by Breiman in 2001. The RF algorithm has been extensively used in prediction and classification. It is a hybrid machine learning algorithm and tree-based classifier (Breiman 2001, Liu et al. 2012, Biau and Scornet 2016). This algorithm consists of a combination of tree predictors. Each tree makes a single choice for the most desirable classification in combination with a set of classified trees, and then the final result is given by combining these results. RF fits many classification trees to a data set, and then combines the predictions from all the trees. This algorithm, due to its high precision in classification, detects remote well data and separates it from the original data. RF algorithm is consist of a set of structured tree classifiers h(x,k), which is not dependent on random distribution vectors and each tree gives a single choice for the most desirable classification at the x input (Kumar et al. 2016). The kth tree is shown as θk, and each tree is set and distributed evenly and independently based on a set of training samples and a random variable in the Breiman RF model. Therefore, to create a classification of more than one system of classifier h (x, θk) in which x is the input vector after k load, the classifier sequences h1(x), h2(x)…,Hk(x) is obtained. The final result of this system is chosen by a majority vote. The decision function is shown in Eq. 14 (Liu et al. 2012):

$$H\left( x \right) = \arg \max \mathop \sum \limits_{i = 1}^{k} I\left( {h_{i} \left( x \right) = Y} \right)$$
(14)

where \((x)\),\({h}_{i}\), Y and \(I({h}_{i}(x) = Y)\) is a combination of the classification model, decision tree model, output variable, and pointer function, respectively. In the RF algorithm, selecting the best classification result for a given input variable is such that each tree has the right to vote for the most desirable outcome. Figure 5 shows the schematic of RF method structure.

Fig. 5
figure 5

Schematic of RF method structure

Logistic regression method

A statistician named Galton used regression for the first time to describe his observations in the nineteenth century. Carl Pearson developed regression as a mathematical basis and used it to express the relationship between two quantities (Anderson et al. 2003). Logistic regression expresses the odds ratio of a variable in the presence of several explanatory variables. Multivariate logistic regression is a statistical technique that is used to estimate the probability of the output of variables. For example, the presence or absence of death (Sperandei 2014). Independent variables affect the probability of occurrence of the dependent variable (Anderson et al. 2003). The logarithm of chance is modeled as shown in Eq. 15 (Sperandei 2014):

$$Log\left( {\frac{\pi }{1 - \pi }} \right) = \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \ldots . + \beta_{m} x_{m}$$
(15)

where \(\pi\) is the probability of an event, \({\beta }_{i}\) are the regression coefficients associated with the reference group and the explanatory variables \({x}_{i}\). The reference group is denoted by \({\beta }_{0},\) and \({\beta }_{0}\) is formed by the members that represent the reference level of each of the variables\({(x}_{1 ... m})\). In addition to the above explanations, the logistic regression equation is presented in another form, which is shown in Eq. 16 according to the article (Cramer 2002).

$$P\left( z \right) = \frac{\exp (Z)}{{1 + \exp \left( Z \right)}}$$
(16)

P is similar to the density distribution function symmetric to the midpoint of zero, Z is an integer, and the P value is between 0 and 1.

Application of machine learning and optimization methods in petroleum engineering

Comparison of machine learning algorithms

As mentioned, the algorithms used in this study include SVM, LogitBoost, ANN, RF, and logistic regression. These algorithms are among the most common algorithms used in petroleum engineering. Table 3 summarizes the advantages and disadvantages of each method.

Table 3 Summary of the advantages and disadvantages of each of the methods used in this article (Haghighat et al. 2013; Mohamed 2017; Kour and Gondhi 2020)

Results and discussion

Petrophysical properties of the reservoir have been collected using 212 core data and several petrophysical well logs in an oil field southwest Iran. In this oil field, the reservoir porosity and permeability are varied from 2.1% to 34.1% and 0.1 to 44.3 mm at depth of 2157 to 2393.2 m, respectively. The field flow units are estimated using conventional methods including Winland R35, FZI, DRT, and k-means. These methods determine the number of flow units using core data (porosity and permeability).

By implementing the FZI method on the used data, the studied field has four flow units. Figure 6 shows FZI points that are on a line and have similar pore characteristics. Figure 7 shows four set of data represented by a multidimensional distribution for case study data. By plotting the cumulative probability diagram, the number of flow units could be determined by data trend breaks. According to Fig. 8, four flow units were obtained by this diagram.

Fig. 6
figure 6

RQI diagram in \({\varphi }_{Z}\) for all sampled areas

Fig. 7
figure 7

FZI logarithm histogram

Fig. 8
figure 8

Cumulative probability diagrams to determine FZI boundaries

Based on this study, the Winland R35 method indicates the presence of four HFUs in the Kazhdumi reservoir (Fig. 9).

Fig. 9
figure 9

Kazhdumi HFU classification results based on Winland R35 method

The results of the DRT method on the studied field data are shown in Fig. 10 which suggests four HFUs for the Kazhdumi reservoir.

Fig. 10
figure 10

Kazhdumi HFU classification results based on DRT method

Results of applying k-means on available data show that the minimum square error decreases with increasing number of categories. However, increasing the number of categories by more than four has no significant effect on reducing the minimum square error. Figure 11 shows the process of reducing the least squares of error. As a result, four HFUs were considered for the studied reservoir by this method (Fig. 12).

Fig. 11
figure 11

Reduction in sum square error by increasing the number of clusters

Fig. 12
figure 12

Kazhdumi HFU classification results based on k-means algorithm

Access to core data (porosity and permeability) is only possible through the core and in the laboratory. Also, the core data is not available for all wells and at all depths. To overcome this issue, in this study, petrophysical logs data which are available in most wells including RT, DT, HCAL, NPHI, RHOZ, and PEFZ have been used as input parameters to machine learning methods. Using artificial intelligence algorithms, the number of flow units could be calculated.

For this purpose, the number of flow units calculated by the FZI method is considered as target values. Log data corresponding to the data depth used in the FZI method is preprocessed as input data. SVM, LogitBoost, ANN, RF, and logistic regression algorithms are trained by 70% of the input data, and 30% of the data is used for testing. The number of field flow units is classified by petrophysical log data. The accuracy of the algorithms used is measured according to Eq. 17:

$${\text{accuracy}} = \frac{TP + TN}{{TP + FP + FN + TN}}$$
(17)

To select the best number of training data, the performance of different algorithms was measured in different percentages of train/test data, and finally, 70% was selected as the optimal number of training data. The performance of the algorithms is shown in Fig. 13.

Fig. 13
figure 13

Accuracy of the algorithms with different percentages of training data

Based on available data in this study, at the first, the data is normalized, then divided into two categories of train and test with ratio of 70/30. The SVM algorithm with linear kernel function classified all data (including train and test data) with 90.46% accuracy. The confusion matrix of this algorithm for test data is shown in Fig. 14 with accuracy of 85.94%.

Fig. 14
figure 14

Confusion matrix of support vector machine algorithm applied to data

Applying the LogitBoost method to available data in this study for 70/30 ratio of test to train data shows 94.84% and 95.31% accuracy for the classification of all data and test data, respectively. Figure 15 illustrates the confusion matrix of classification.

Fig. 15
figure 15

Confusion matrix of LogitBoost algorithm applied to data

The result of ANN algorithm in this study for 70/30 ratio of test to train data shows 88.12% and 73.44% accuracy for the classification of all data and test data, respectively. Figure 16 demonstrates the ANN confusion matrix classification.

Fig. 16
figure 16

Confusion matrix of ANN algorithm applied to data

In this study, RF algorithm shows 91.87% accuracy in classification using 70% and 30% of all data as train and test, respectively. It also shows 90.63% accuracy for the classification of test data. The confusion matrix of this algorithm is shown in Fig. 17.

Fig. 17
figure 17

Confusion matrix of random forest algorithm applied to data

Logistic regression algorithm has reached 91.56% accuracy with 30% test data and 70% training data in this study. Figure 18 shows the confusion matrix of logistic regression algorithm. Based on this confusion matrix, logistic regression algorithm shows 95.31% accuracy for the classification of test data.

Fig. 18
figure 18

Confusion matrix of logistic regression algorithm applied to data

Finally, machine learning methods are compared to select the best method used. The results obtained from the best performance of each algorithm are shown in Fig. 19. As shown in this figure, the LogitBoost method performed better than other algorithms in classification with an accuracy of 94.84%.

Fig. 19
figure 19

Accuracy of different machine learning algorithm used in this study

Conclusions

In this article, a variety of conventional methods and machine learning algorithms were investigated in determining hydraulic flow units (HFUs), and the performance of each method was evaluated. The following results, which also flesh out the innovative nature of the work, are as follows:

  • The k-means method and the use of sum of squares error (SSE) operate independently of the user and show an effective performance for determining the optimal number of HFUs, and it is also fully consistent with the Flow Zone Index (FZI) method.

  • Conventional methods determine flow units only by using porosity and permeability parameters obtained from core analysis, while these data may not be available along the entire length of the reservoir. Therefore, the use of intelligent data-driven methods that use petrophysical log data can be a suitable alternative to conventional methods, especially in wells where core data is not available.

  • In this article, support vector machine (SVM), artificial neural network (ANN), random forest (RF), LogitBoost (LB), and logistic regression (LR) machine learning methods were used to determine flow units using petrophysical logs. The results showed that among the different machine learning algorithms used in this study, the LB method has the best performance in determining HFUs. After that, RF, LR, SVM and ANN methods have the best accuracies, respectively.