Introduction

The rate of penetration (ROP) is a critical drilling parameter indicating how fast the drill bit is penetrating the formations (Bourgoyne et al. 1991). Although it is necessary to decrease the drilling time by increasing the ROP and, hence, decreasing the drilling cost, the speed at which the drillbit is penetrating the downhole formations (i.e., ROP) is limited by the cuttings lifting capacity of the drilling fluid which is required to maintain the wellbore clean of the cutting (Mahmoud et al. 2020a; b, c, d). The use of high ROP could also lead to other problems during the drilling process such as drillstring vibration which is accounted in many situations to the wellbore instability or loss of bottom hole assembly (Akgun 2002).

ROP is affected by many other drilling parameters, the drillstring composition, drilling fluid properties, well trajectory, and others as indicated in Fig. 1, as the figure shows the statics and dynamic drilling parameters that greatly affect the ROP. Most of these parameters are affecting each other, in other words, changing one of these parameters in many cases affects the other parameters contributing to the ROP, this fact makes the possibility of predicting the ROP more complicated, and it also complicated the possibility of evaluating the impact of only a single factor on the ROP (Mhossain and Al-Majed 2015; Osgouei 2007).

Fig. 1
figure 1

The parameters affecting ROP optimization

Nowadays, many ROP models and empirical correlations are available, each of these models was developed after setting specific assumptions and it is also using specific input parameters to assess the ROP. Because of the variety of assumptions and inputs, these ROP models are considerably different in terms of their accuracy (Lyons and Plisga 2004; Mitchell and Miska 2011; Rabia 2001). Table 1 summarizes the available ROP correlations.

Table 1 The list of some of the available ROP correlations

The most commonly used empirical correlation in the last decade was the one developed by Bourgoyne and Young (1974); this correlation defined the ROP as based on eight functions as indicated in Table 1. Any of these functions is based on specific drilling parameters. Although the relationship between the ROP and these eight functions is very complex and not linear, Bourgoyne and Young (1974) simply considered the ROP as the multiplication and addition of these functions, which limits the accuracy of ROP prediction.

Recently, different machine learning models were successfully applied to different aspects of science and engineering (Najafzadeh 2019; Saberi-Movahed et al. 2020; Thanh et al. 2020; Elzain et al. 2021; Thanh et al. 2022a; Thanh and Lee 2022; Thanh et al. 2022b), and petroleum engineering is not an exception (Barbosa et al. 2019; Elkatatny et al. 2019; Mahmoud et al. 2020b; Mahmoud et al. 2020d; Alsaihati et al. 2021; Siddig et al. 2021). Recent research was performed to enhance ROP prediction using machine learning capabilities, and in these studies, different machine learning techniques, input parameters, and other technical aspects related to the drilling operations and well planning were considered for ROP prediction while drilling carbonate formation (Mahmoud et al. 2020a; Osman et al. 2021), natural gas-bearing sandstone formation (Al-AbdulJabbar et al. 2022a), and complex lithology formations (Gamal et al. 2020).

In 2011, Bahari et al. (2011) hierarchically employed the general regression neural network (GRNN) to uncover the complex and nonlinear relationship between the ROP and the eight functions defined by Bourgoyne and Young (1974) to improve ROP prediction. The results indicated that the GRNN was powerful to define the relationship between the ROP and these eight functions, and it improved ROP prediction.

Anemangely et al. (2018) utilized two hybrid models for ROP prediction based on mud log data and petrophysical logs. The hybrid models are composed of a multilayer perceptron neural network (MPNN) coupled with the particle swarm optimization algorithm (PSA) in the first model and with a Cuckoo optimization algorithm (COA) in the second model. The authors found that the optimized models were superior in ROP estimation. The results also indicated that although increasing the number of inputs is important to enhance ROP prediction, the use of five or more inputs did not significantly improve ROP predictability using their optimized models.

Recently, Al-AbdulJabbar et al. (2022b) suggested an artificial neural network (ANN)-based empirical correlation for estimating the ROP during horizontal drilling of carbonate formation. To generalize their correlation, the authors optimized the ANN model using data collected from five different wells, the inputs used are real-time available drilling parameters of Q, SPP, WOB, torque, and DSR, which enabled real-time prediction while drilling another three wells in the same reservoir, and the empirical correlation accurately predicted the ROP.

In another study, Al-AbdulJabbar et al. (2022a) proposed an estimation of the ROP into sandstone formation using the ANN model based on the same inputs considered by Al-AbdulJabbar et al. (2022b). The results also indicated the high precision of the ANN in evaluating the ROP for this sandstone formation. A summary of some available data-driven-based ROP models is presented in Table 2.

Table 2 Some of the machine learning-based ROP models

In this study, machine learning models of ANN, fuzzy inference system with subtractive clustering (FIS-SC), SVR, and functional neural network (FNN) were optimized for ROP estimation while drilling through sandstone formations in a horizontal model. The four machine learning models were optimized to estimate the ROP from the DSR, standpipe pressure (SPP), WOB, and torque, as well as a new regression-based rate of penetration parameter or ROPregression which was calculated from the DSR and WOB. Based on the trained ANN model, a correlation for assessment of the ROP was derived in this work.

The current study provides novel contributions over the published work by presenting the full approach for developing four machine learning developed models for predicting the penetration rate during the drilling operation from the surface drilling parameters that are available during the drilling operation through horizontal drilling sandstone formations. In addition, this study presented a new method for enhancing the machine learning capability for ROP prediction by presenting a new calculated parameter ROPregression based on mathematical derivation for the mechanical specific energy (MSE) equation to relate the MSE and the DSR to the ROP. Besides, the current research proposed a newly developed empirical equation for ROP estimation which is based on the ANN model. All these novel contributions will enhance the ROP estimation and provide real-time guidance for the drilling engineers to optimize the controllable drilling parameters for the best penetration rate for cost savings during the drilling operation.

Methodology

Four different machine learning techniques were used to develop various real-time ROP models from only the surface measurable drilling parameters and the ROPregression parameter. Before training the models, the data collected for this work were studied to eliminate the non-real values and outliers. Based on linear regression, the expression for the ROPregression was determined to account for the ROP as a function of the drilled hole area, the drillpipe diameter, WOB, and DSR.

Training data preparation and preprocessing

In this study, two wells were selected (Well-X and Well-Y). It is worth mentioning that the two wells penetrated the same geology scheme, so drilling the same formations during the drilling phase. The drilling data of WOB, SPP, DSR, and the torque recorded on a real-time base by the surface sensors were obtained from the two wells under consideration; both wells were drilled using the conventional bottom hole assembly. These data were obtained while drilling sandstone formations; both wells were drilled using a top drive rotary system. Originally, 3082 datasets from Well-X were collected to learn the machine learning models and test the learned models, while 4662 datasets from Well-Y were used to validate the learned models. Different processes of data quality control (QC) and quality assurance (QA) such as non-real values and outlier removal were performed; these processes were considered to ensure that only the valid data was considered to optimize the models.

During the QA/QC stage, the MSE term developed by Teale (1965) to describe the applied energy by the drill bit to penetrate the formation (Dupriest and Koederitz 2005) was considered for non-real values determination. The MSE should be optimized to have values similar to the UCS (Teale 1965).

The sandstone formation considered to obtain the data needed in this work has UCS between 25,000 and 45,000 psi. Figure 2 shows the plot of MSE versus ROP for all Well-X data. As shown in Fig. 2, several MSE values are considerably lower or greater than the range of the UCS, this huge difference confirms that at all these points the collected data are unrealistic and the data corresponding to these points must be removed since they represent inefficient drilling. The MSE values for the data used in this work are between 15,000 and 75,000 psi, which was determined based on the formation UCS ± a margin; all points with MSE out of this range were removed from the training data.

Fig. 2
figure 2

The MSE versus ROP for all Well-X data

1031 of the datasets gathered from Well-X represent locations of inefficient drilling (from Fig. 2), these datasets were eliminated from the inputs at this stage. Then, 2051 of the data collected from Well-X are considered realistic.

Before considering the data for training the machine learning models, the outliers in the inputs were also removed. For this purpose, only the input values within ± 3.0 standard deviation are considered non-outliers, while the other values were removed from the training data. After this process, 1748 datasets of Well-X data were considered to validate the machine learning models.

Developing an expression for the regression-based ROP

To improve ROP predictability using the machine learning models, a new input parameter was developed based on regression analysis and it was called regression-based ROP (ROPregression); the expression for this parameter was developed based on Teale (1965) equation for the MSE, by neglecting the torque in Teale (1965) equation for the MSE, this will lead to express MSE as presented in (1), and this expression was used to develop the expression for ROPregression.

$$ {\text{MSE}} = \frac{{{\text{WOB}}}}{A} + \frac{2\pi \times T }{{A \times p}} $$
(1)

where T is the required torque to remove a layer of rock with thickness p in a single revolution. The first step in this derivation is to relate the MSE and DSR to the ROP. The MSE and ROP values of the training data of Well-X are plotted in Fig. 3 that shows the MSE-ROP plots for different levels of DSR [60, 80, and 100 rpm], as indicated in this figure, the MSE and ROP values at every single DSR value fitted to a single curve of a power function. The governing equations for the three curves in Fig. 3 have the same exponent of  − 1.0 and various constants (a) of 82,209, 108,525, and 136,380 for DSRs of 60, 80, and 100 rpm, respectively.

Fig. 3
figure 3

MSE and ROP plots for training data corresponding to DSRs of 60, 80, and 100 rpm

The three functions that relate the MSE and ROP to DSR as indicated in Fig. 3 could be represented generally as in (2).

$$ {\text{MSE}} = a\,\,{\text{ROP}}^{ - 1.0} = \frac{a}{{{\text{ROP}}}} $$
(2)

Now if the three values of ‘a’ (extracted from Fig. 3) and their corresponding DSR are plotted as in Fig. 4 as the plot shows the relationship between ‘a’ and DSR is representable by a straight line which could be represented by (3).

$$ a = 1354\,\,{\text{DSR}} + 696 $$
(3)
Fig. 4
figure 4

The plot of the constant ‘a’ values and the corresponding DSR

Now by calling the MSE from (2), after substitution into (1) and replacement of T and p with DSR and ROP as suggested by (Al-Abduljabbar et al. 2021), this leads to the expression in (4).

$$ \frac{a}{{{\text{ROP}}}} = \frac{{1354\,\,{\text{DSR}} + 696}}{{{\text{ROP}}}} = \frac{{{\text{WOB}}}}{A} + \frac{{2\pi \times {\text{DSR}} }}{{A \times {\text{ROP}}}} $$
(4)

Rearranging (4) we will get:

$$ {\text{ROP}} = \frac{{ - \left( {2\pi \frac{{{\text{DSR}}}}{A} - a} \right)A}}{{{\text{WOB}}}} = {\text{ROP}}_{regression} $$
(5)
$$ {\text{ROP}}_{regression} = \frac{{ - \left( {2\pi \frac{{{\text{DSR}}}}{A} - \left( {1354\,\,{\text{DSR}} + 696} \right)} \right)A}}{{{\text{WOB}}}} $$
(6)

where the ROPregression parameter derived in (6) is the new expression for the ROP which we call the regression-based ROP or ROPregression. (6) is defining the ROPregression based on DSR and WOB only. Figure 5 shows the cross-plot of actual ROP versus ROPregression that is calculated from (6). The good correlation coefficient between ROP and ROPregression of 0.835 is enough to guide the machine learning models toward better ROP prediction. As discussed previously, ROPregression was considered as an input to learn the machine learning models in addition to the four surface measurable drilling parameters.

Fig. 5
figure 5

The cross-plot of the ROPregression and ROP for Well-X data (1748 data points)

Optimizing the machine learning models

The machine learning models were optimized to assess the ROP based on five parameters; four surface measurable parameters of the WOB, torque, DSR, and SPP, and the fifth parameter is the ROPregression calculated using (6). The machine learning models were learned using 1224 datapoints of the surface measurable parameters and the ROPregression to estimate the real ROP; the training data are 70% of Well-X’s. Figure 6 shows the plot of the input parameters considered for learning the machine learning techniques. Table 3 lists the applicability range for the optimized models and the statistical properties of the learning data.

Fig. 6
figure 6

The training input parameters

Table 3 The statistical characteristic of the training datasets

The design parameters of the four models considered in this work were optimized using sensitivity analysis. The first model considered in this study is the ANN. The effect of the training function, transferring function, the number of hidden (training) layers, and the number of neurons were studied during this stage of sensitivity analysis. For the FIS-SC, the cluster radius was optimized between 0.2 and 0.9 and the number of iterations between 102 and 3 × 103 was studied.

The third machine learning model considered is the functional neural network (FNN). For this technique, the effect of the training method and function type on estimating the ROP was evaluated. Different training methods such as the forward selection (FS), forward–backward selection, backward-forward selection, and backward elimination methods were evaluated. Different training function types were studied in this work such as the linear function without iteration term, nonlinear function without iteration term, and nonlinear function with iteration term (NLFIT).

The effect of different design parameters of the SVR model such as the kernel, kernel options, lambda, epsilon, and C was studied. The effect of different kernels such as the Gaussian and multi-quadratic, various kernel options from 1 to 10, lambda between 10–7 and 10–5, epsilon from 10–5 to 10–1, and C between 102 and 3 × 103 was evaluated. A summary of the optimum design parameters is provided in Table 4.

Table 4 The optimized parameters of the machine learning techniques

Extracting the new ROP correlation

The ROP empirical correlation developed in this work is based on the weights and biases of the different neurons of the ANN model after optimization. As summarized earlier in Table 4, the optimized ANN model has one training layer with 15 neurons with Bayesian regularization backpropagation training function and one output layer with the tangential sigmoid transferring function. The expression that represents this model could be presented as in (7).

$$ y = \left[ {\mathop \sum \limits_{j = 1}^{M} w_{j1} {\text{tansig}}\left( {\mathop \sum \limits_{i = 1}^{N} w_{ij} x_{i} + b_{j} } \right)} \right] + b_{2} $$
(7)

where y represents the output or ROP in this case, w denotes the different weights, x denotes the input parameters, and b represents the biases. Table 5 lists the values of w and b.

Table 5 The extract weights and biases of the optimized ANN model

(7) could be written for the ANN model optimized for ROP estimation which has 15 neurons, 5 inputs, and b2 of 5.44 as in (8).

$$ {\text{ROP}} = \left[ {\mathop \sum \limits_{j = 1}^{15} w_{j1} {\text{tansig}}\left( {\mathop \sum \limits_{i = 1}^{5} w_{ij} x_{i} + b_{j} } \right)} \right] + 5.44 $$
(8)

After expanding (8), the ROP could be expressed by (9).

$$ {\text{ROP}} = \left[ {\mathop \sum \limits_{j = 1}^{15} w_{j1} \left( {\frac{1}{{\left[ {1 + e^{{ - 2*\left( {w_{{1_{i,1} }} \left( {{\text{DSR}}} \right) + w_{{1_{i,2} }} \left( {{\text{SPP}}} \right) + w_{{1_{i,3} }} \left( {{\text{Torque}}} \right) + w_{{1_{i,4} }} \left( {{\text{WOB}}} \right) + w_{{1_{i,5} }} \left( {{\text{ROP}}_{{{\text{regression}}}} } \right) + b_{{1_{i} }} } \right)}} } \right]}}} \right)} \right] + 5.44 $$
(9)

It is important to mention here that the parameters used to predict the ROP using (9) should be normalized between  − 1 a 1, and the value of ROP calculated using this equation is in normalized state, which should be denormalized to have the actual ROP value. More information about normalization could be found in our previous publication Al-Abduljabbar et al. (2020).

Testing and validating the optimized machine learning models and the new ROP correlation

The developed empirical Eq. (9) and the optimized FIS-SC, SVR, and FNN models were tested using 524 data points (Well-X) and then validated using 2213 data points (Well-Y). The predictability of the developed equation and models Well-Y was compared with that of the available correlations to investigate the enhancement in the assessment of the rate of penetration using (9) and the models developed in this study.

Results and discussion

Training the machine learning models

All machine learning models were firstly trained on 1224 datasets of Well-X data (70% of Well-X data). Figure 7 shows the actual versus the predicted ROP from different models for the 1224 training dataset. Comparing the actual ROP (blue diamonds in Fig. 7) and the ROP estimated with the different models (the continuous line in Fig. 7) confirms the perfect matching between these values, which confirms the superior accuracy of the considered machine learning models. As shown in Fig. 7, the ROP for the training data set was assessed accurately with AAPEs of only 0.4%, 2.3%, 2.6%, and 3.6% using the ANN, SVR, FIS-SC, and FNN models, respectively. The ROP predicted with ANN, SVR, FIS-SC, and FNN models have Rs of 0.999, 0.998, 0.998, and 0.997 with the real ROP, respectively, as shown in Fig. 7. The previously discussed results proved the high precision of the ANN, FIS-SC, FNN, and SVR models in assessing the ROP.

Fig. 7
figure 7

The actual and evaluated ROP for the 1224 training datasets

Testing the developed equation and the optimized machine learning models

The optimized FIS-SC, SVR, and FNN models and (9), which was based on the ANN model, were tested on 524 unseen datasets from Well-X (30% of Well-X data). Figure 8 displays the actual versus the predicted ROP from different models for the 524 testing datasets. As indicated in Fig. 8, although all models predicted the ROP accurately, (9) was the most precious in estimating the ROP for this dataset of Well-X. From Fig. 8, (9), SVR, FIS-SC, and FNN models assessed the ROP with very low AAPE’s of 0.3%, 2.7%, 3.4%, and 3.6% and R’s of 0.999, 0.998, 0.997, and 0.992, respectively. Visual comparison of the actual and predicted ROP as indicated in Fig. 8 confirms the high reliability of (9) and the optimized FIS-SC, SVR, and FNN models in estimating the ROP.

Fig. 8
figure 8

The actual and evaluated ROP for the 524 testing datasets

Validating the developed equation and the optimized machine learning models

The predictability of (9) and the other optimized machine learning models was also evaluated on the 2213 data points of Well-Y. At this stage, the predictability of (9) and the optimized machine learning models were also compared with four of the available ROP empirical correlations; these are Bingham, Maurer, and Bourgoyne and Young’s correlation. Table 6 lists the calculated constants needed to be used with the different ROP empirical models.

Table 6 The constants associated with the ROP correlations

Figure 9 represents the results comparison of the ROP estimation in Well-Y (2213 data points) using Eq. (9) with various available models. As shown in Fig. 9, (9), FIS-SC, and FNN models are most accurate in contrast with the SVR model and the correlations developed earlier for ROP estimation as indicated by the perfect agreement between the real and assessed ROP when (9), FIS-SC, and FNN models are used. All previous empirical correlations assessed the ROP with a very low R of less than 0.24, and the R between the actual ROP and these evaluated with the SVR was 0.50, while (9), FIS-SC, and FNN models assessed the ROP with R’s of 0.99, 0.99, and 0.97 as shown in Fig. 9.

Fig. 9
figure 9

Comparison of the ROP estimation in Well-Y (2213 data points) with various available models

Figure 10 demonstrates the ROP accuracy comparison for Well-Y data using with various available models. As indicated in Fig. 10, the SVR model and all previous correlations assessed the ROP with very high AAPE and low root-mean-square error (RMSE). The AAPE and RMSE for the rate of penetration predicted using the SVR model were 26.5% and 1.1 ft/hr, respectively. While the AAPEs for the ROP predicted with the Bingham, Maurer, and Bourgoyne and Young correlation are 51.0%, 47.6%, and 36.6%, while the RMSEs are 1.7, 2.0, and 1.3 ft/hr, respectively. On the other hand, the rate of penetration was predicted with very low AAPEs of 1.0%, 3.4%, and 8.2% and RMSEs of only 0.1, 0.2, and 0.4 ft/hr using (9), FIS-SC, and FNN models, respectively. These results confirmed the high accuracy of (9) and both FIS-SC and FNN models in evaluating the ROP while horizontally drilling sandstone formations.

Fig. 10
figure 10

ROP accuracy comparison for Well-Y data using with various available models

Summary and conclusions

In this study, the predictability of the oil well drillability while horizontally drilling sandstone formations using four artificial intelligence tools was evaluated. The following conclusions can be withdrawn:

  1. 1.

    The machine learning models were learned to assess the rate of penetration from only the surface measurable drilling parameters.

  2. 2.

    All machine learning models were firstly learned using 1224 datasets from Well-X. The learned artificial neural network model was then used to develop a correlation for the rate of penetration assessment.

  3. 3.

    The developed empirical correlation and the optimized models’ models were tested on 524 datasets from Well-X and validated on 2213 datasets from Well-Y.

  4. 4.

    This research presented high accurate new correlation for ROP prediction that is machine learning-based that showed high degree of match with the actual ROP during the model training, testing, and validation phases.

  5. 5.

    The new correlation and the optimized fuzzy inference system with subtractive clustering and functional neural network model do a great prediction over the other techniques based on the results obtained in this study.

Based on the research findings, the new correlation will add a great contribution regarding the rock drillability prediction and optimization for drilling oil and gas wells. The machine learning models will help for enhancing the drilling operation automation for cost savings and safe drilling operations.