Introduction

Drill-fluid losses cause problems for many subsurface operations. They are often costly and time-consuming to mitigate. Early warning of such events, being aware of the appropriate treatment strategies suited to specific formations, and having supplies at hand of the necessary lost circulation materials typically help drilling teams to react quickly to minimize their impacts. If not dealt with adequately, severe losses of drilling fluid from the well bore run the risk of the drill string becoming stuck and can potentially lead to blowouts, in addition to the high costs of replacing the lost drilling fluid volumes. Moreover, invasion of reservoir formations by drilling fluid and inappropriate loss-control materials can lead to formation damage and loss of long-term production (Van Der Zwaag 2006; Zhao et al. 2019; Alkinani et al. 2020a, b). The severity of drill-fluid losses in the subsurface varies and it can be usefully classified based on the measured rate of loss.

There are various factors that influence the occurrence of drilling fluid losses from an uncased wellbore. Conditions in which wellbore fluid pressure substantially exceeds a permeable formations fluid pressure are likely to result in fluid losses from the wellbore to the formation. Also, highly porous and permeable formations, particularly those impacted by fractures, faults and/or with large vuggy pore space (a characteristic of some carbonate zones), are particularly prone to wellbore fluid losses (Savari et al. 2016; Su et al. 2019). Figure 1 illustrates diagrammatically the type of subsurface formation conditions most at risk to substantial wellbore fluid losses.

Fig. 1
figure 1

Subsurface conditions commonly associated with drilling fluid losses

Applying excessive wellbore fluid densities while drilling often induces fractures in the more brittle rock formations drilled that may exist in the subsurface with relatively sparse natural fractures. Such induced fractures are a common cause of wellbore fluid losses (Feng and Gray 2018; Alkinani et al. 2019a). In cases where wellbore fluid losses are minor it is often possible for drilling to proceed to a planned depth with the losses then being resolved by setting a casing string. However, with more substantial fluid losses immediate responses are required to safeguard well integrity. For some problematic zones it is possible to adopt underbalanced drilling techniques to reduce wellbore fluid losses (Abel and Gokhan 2004). However, in many situations the formation pressures of surrounding formation prevent the use of underbalanced drilling. Also, in severely depleted fractured reservoirs, wellbore fluid losses may still persist even when underbalanced drilling conditions are adopted.

Substantial wellbore fluid losses in non-reservoir formations can sometimes be cured by introducing a plug of cement or lost circulation material (LCM), as formation damage is typically not relevant in such zones. Substantial wellbore fluid losses into producing reservoir formation are, however, more difficult to resolve without risking damage to the reservoir formation. Various LCMs have been successfully applied in various drilling conditions to either cure or significantly reduce fluid losses while drilling. However, LCM tends to be more effective when they are tailored to specific subsurface conditions. However, in many drilling operations, when fluid losses occur, the type of LCM used is either chosen by trial and error or governed by what LCM is available at the location (Salehi and Nygaard 2011).

In field development circumstances, where drilling information from multiple wells is available, statistical assessment, modelling and data mining of that information, particularly with the assistance of artificial intelligence methods, can provide useful insight. Such approaches can help to better determine the type of fluid losses occurring and what should be anticipated as drilling progresses. They can also help to identify the type and formulation of LCM that is most likely to resolve the problem (Alkinani et al. 2018, 2019b; Mardanirad et al. 2021).

Loss of wellbore fluids in the subsurface poses difficulties to model and predict from drilling variables, because there are a substantial number of variables involved that are often poorly correlated with each other or fluid-loss conditions, and those relationship tend to vary from one formation to another. Various attempts have been made to develop analytical models to predict wellbore fluid losses, including the application of Darcy’s law to fluid filtration and formation permeability (Parn-Anurak and Engler 2005) and quantifying fracture characteristics in a range of porous media into fractured media using quantitative approaches (Sanfillipo et al. 1997; Shahri and Mehrabi 2012; Lavrov 2013). Some numerical solutions have applied the Herschel–Buckley general power law for non-Newtonian fluids to estimate wellbore fluid losses in fractured reservoirs (Majidi et al. 2010). Alhameedi et al. (2017) applied regression analysis to develop statistical formula relating a range of drilling variables considered to influence wellbore fluid losses to suggest controls that could be applied to drilling conditions in order to minimize lost circulation. The diversity of subsurface conditions, both within the wellbore and the formations penetrated, means that none of the proposed formulations can be relied upon, especially in vuggy fractured carbonate reservoirs.

Numerous, data-driven, artificial intelligence models, both machine learning (ML) and deep learning (DL) have been proposed over the past decade or so to predict wellbore fluid loss based on drilling variables. These have been recently described and summarized (Mardanirad et al. 2021). To be effective, it is apparent that data need to be gathered and assessed from multiple wells in a specific field area in order to establish confidence in ML/DL predictions generated. Cristofaro et al. (2017) compared the performance of several classifiers, including customized Naïve Bayes, K-nearest neighbour (KNN), radial basis function (RBF) and multi-layer perceptron (MLP) to determine optimum LCM solutions to apply offshore Brazil. They used that analysis to derive an instance-based interface combining several ensemble ML models, i.e. combining the results of several models into a single score. Hou et al. (2020) applied an MLP model to successfully predict wellbore fluid losses while drilling a high-pressure high-temperature (HPHT) field offshore in the South China Sea.

Support vector machine/classifier (SVM/SVC) models are known to be potentially effective wellbore fluid-loss predictors. Behnoudfar and Hosseini (2016) applied an SVM model to predict fluid losses during underbalanced drilling in Egypt. Manshad et al. (2017) successfully applied an SVM model to a heavy oil field in Iran. Ahmed et al. (2019) applied an SVM model with an RBF kernel to predict lost circulation in two wells from Saudi Arabia with high accuracy. A least squares support vector machine (LSSVM) model optimized with a cuckoo search algorithm delivered better loss circulation prediction accuracy for the Marun oil field than MLP optimized models (Sabah et al. 2021). Jahanbakhshi and Keshavarzi (2015) included geomechanical factors, including in situ stresses and rock strengths, as well as wellbore drilling variables to predict lost circulation and demonstrated that SVM models outperformed MLP models with their dataset.

Adaptive neuro-fuzzy inference systems (ANFIS) have been compared with optimized MLPs and decision tree (DT) models (Marun field, Iran), showing that DT outperformed the other models for that dataset (Sabah et al. 2019). An ANFIS model was also applied to data from that field by Agin et al. (2020). Li et al. (2018) compared MLP, random forest (RF) and SVM models to predict lost circulation from drilling data from an Iraqi oil field, demonstrating that the RF model outperformed the other models. Magzoub et al. (2021) compared four models, RF, KNN, gradient boosting and Adaboost in assessing the relationship between drilling fluid components, rheology and lost circulation. They found that the gradient boosting model performed the best. Optimized data matching ML models, such as the transparent open box (TOB), which like KNN do not rely on variable correlation relationships, have also been applied to predict wellbore fluid loss from drilling parameters (Wood 2018).

In recent years, DL methods have been evaluated as wellbore fluid-loss prediction tools from larger dataset with some success. Wang et al. (2020) applied a one-dimensional convolutional neural network (CNN) to distinguish normal down-hole working conditions based on a time series from abnormal conditions. Mardanirad et al. (2021) compared the performance of CNN with gated recurrent units (GRU) and long short-term memory (LSTM) networks applied to a large 20-well dataset and established that their CNN model outperformed the other DL models in predicting classes of wellbore fluid losses. Aljubran et al. (2021) compared DL CNN and LSTM models for rapid detection of wellbore fluid losses with RF and MLP ML models using time series analysis and found the CNN model yielded the most accurate predictions.

An issue that has yet to be addressed in lost circulation prediction studies is the “class imbalance” problem (Guo et al. 2008), i.e. where the distribution of data records in the classes to be predicted is strongly skewed, such that a large proportion of the records fall within certain classes and very few in other classes. It is a well-studied issue in classification ML modelling (Ling and Sheng 2011; Maheshwari et al. 2017) considered in many fields, and addressed with respect to some specific oil and gas related problems, such as gas pipelines (Choubineh et al. 2020) and enhanced oil recovery screening (Pirizadeh et al. 2021). When classifying lost circulation from multi-well drilling data into classes varying from “no loss” to “complete loss” the number of data records in the no/low-loss classes tend to be orders of magnitude more than in the severe/complete-loss classes. Hence, large loss circulation datasets tend to be highly imbalanced and that imbalance should be considered when assessing the prediction accuracy of ML/DL models. Most of previously published work that has predicted and classified lost circulation from drilling variables applying ML models has done so using relatively small sets of data (i.e. less than a few thousand data records). More recently, the DL models mentioned have been applied to large datasets. However, the work to date has largely ignored the class imbalance issue which becomes more significant in larger datasets.

The novelties of this study are that it considers a large, 20-well, dataset (65,376 data records), it evaluates and compares the wellbore fluid-loss class prediction accuracy using 11 distinct ML/DL methods (12 models evaluated) and takes into account the class imbalance issue with the aid of an expanded confusion matrix presentation not previously employed.

This work is organized as follows: “Dataset description” section characterizes the dataset for the Azadegan oil field and the fluid-loss classes defined; “Prediction methods evaluated to predict loss circulation classes” section describes the ML/DL methods evaluated and their control parameter configurations; “Class prediction accuracy measures assessed” section defines the classification prediction accuracy metrics assessed; “Pre-processing and process workflow applied to the fluid-loss dataset” section explains the pre-processing steps applied to the dataset and the general workflow adopted for the ML/DL model evaluations; “Results” section presents the study’s results; “Discussion” section discusses the results in terms of datasets class imbalance with the aid of expanded confusion matrices; and “Conclusions” section draws conclusions.

Dataset description

A large (65,376 data records from twenty wells), published, drilling parameter dataset (southern portion Azadegan field in southwest Iran) has been compiled (Mardanirad et al. 2021) and is used here to evaluate and predict drilling fluid-loss conditions with multiple models. It is a giant heavy oil (< 20°API gravity) field that is only partially developed (Fig. 2 Map). It produces oil mainly from the Upper Cretaceous, Sarvak carbonate, reservoirs located at about 2700 m below surface (Du et al. 2016). However, there other oil-bearing reservoirs and permeable zones present in the field (Fig. 3) linked to a prolific petroleum system (Liu et al. 2013). The field consists of a large domal structure with two crestal points (north and south) with the oil water contact sloping from south to north across the southern crestal portion of the field (Du et al. 2016). There are four oil producing zones within the Sarvak reservoir, of which the S-3 zone is the most prolific producer and reserves holder. Most drilling fluid-loss events are recorded in the S-3 zone.

Fig. 2
figure 2

Azadegan oil field location maps

Fig. 3
figure 3

Stratigraphic section encountered in the Azadegan oil field. Modified after Du et al. (2016)

The compiled data records are arranged in order of increasing depth, with variables drawn from available well reports from each well. Minor (< 10 bph) and no fluid losses are reported in certain zones of almost all formations penetrated by the wells. More substantial fluid losses (> 10 bph) tend to be concentrated in five specific formations beneath the Asmari, The lower Tertiary Pabdeh-Jahrum and Pabdeh formations, and the Cretaceous Gurpi, Sarvak and Gadvan formations. The most severe fluid losses (> 100 bph) have been recorded in the carbonate zones of the Jahrum, Pabdeh-Jahrum and Sarvak formations. The fractured Jahrum dolomites are particularly problematic in relation to fluid losses with many of the complete lost circulation events recorded in that formation.

The possible occurrence of fluid losses to varying degrees in all formations penetrated by drilling in the Azadegan field means that reliable systems that provide the rapid analysis of drilling variables to predict the type of fluid losses occurring are valuable. Armed with such analysis it is possible for drilling teams to deploy appropriate mitigation strategies and select the most suitable lost circulation materials to apply for the type of losses observed. Artificial intelligence models based on drilling variables have the potential to provide such reliable systems. This study is designed to evaluate and compare the performance of machine and deep learning methods in predicting different classes of fluid losses from recorded drilling variables.

Table 1 provides a statistical summary of the drilling parameters recorded in the 20-well dataset evaluated to predict lost circulation conditions. The statistical details reveal that for the most part, the variables are symmetrically distributed and cover a wide range of conditions.

Table 1 Statistical distributions of the eighteen variables considered in the dataset compiled from twenty wells in the Azadegan oil field

All dataset records are distinguished as belonging to one of five lost circulation categories (Table 2). The categories (classes) are defined based on generally accepted loss severity value ranges (Nelson and Guillot 2006). These are: no-loss class 1 (0 bhp; barrels per hour); seepage-loss class 2 (0 < bph < 10); partial-loss class 3 (10 <  = bph <  = 100); severe-loss class 4 (> 100 bph); and complete-loss class 5 (no drilling fluid returns to the surface). The dataset is highly imbalanced with respect to the data record numbers associated with each fluid loss (Table 2). Indeed more than 95% of the data records belong to classes 1 and 2, whereas < 0.5% belong to classes 4 and 5. Although such a distribution is to be expected from typical drilling operations, it poses substantial challenges to the prediction algorithms in terms of accurately distinguishing the relatively sparse data records belonging to fluid-loss classes 4 and 5.

Table 2 Distribution of the 65,376 data records among the fluid-loss classes

Figure 4 displays a correlation coefficient matrix among the dataset variables in the form of a heat map. This reveals that the fluid-loss classes (dependent variable) is not well correlated with any of the input variables. The best correlations that exist are between loss class and MD (+ 0.28), GS10s (− 0.16), GS10m (− 0.15), and TOQ (+ 0.14). The general lack of moderate to high correlations between loss classes and input variables poses a challenge for those prediction algorithms that rely particularly on the variable correlation relationships.

Fig. 4
figure 4

Correlation coefficient matrix for the variables considered in the Azadegan oil field lost circulation dataset

Prediction methods evaluated to predict loss circulation classes

The five classes of drilling fluid conditions, varying from “no loss” to “complete loss”, distinguished in the Azadegan field dataset are predicted using eight widely applied machine learning (ML) and three deep learning (DL) methods. These methods are summarized in Table 3.

Table 3 Description of the twelve ML/DL models applied to classify the Azadegan field loss circulation dataset

These methods are based on distinct mathematical algorithms, mainly involving regression techniques capable of handling complex and nonlinear relationships between the input variables and dependent variable. The KNN algorithm is distinctive in that it employs data matching rather than correlations and/or regression to formulate its predictions. The algorithm codes for these DL/ML are executed in Python and based on customized versions published packages. The ML algorithms (ADA, DT, KNN, MLP, NBC, QDA, RF and SVR) are executed with Scikit-Learn algorithm functions (SciKit Learn 2021). The deep learning algorithms (CNN, GRU and LSTM) are executed with TensorFlow/Keras functions (TensorFlow 2021). Two versions of the CNN algorithm, with distinctive structures, are evaluated resulting in twelve models being considered in total. This study focuses on customized applications of these established ML/DL algorithms. Therefore, the mathematical formulas involved in the execution of each algorithm are not presented here. Instead, readers are referred to published sources that provide the detailed methodology for each algorithm evaluated.

Each DL/ML algorithm considered requires specific configurations. These mainly constitute tuning of their control variables to modify their architecture and/or learning rates to work effectively with compiled dataset. The most effective control parameter settings have been obtained by multiple trial-and-error evaluations of each algorithm.

The DL/ML methods evaluated in this study (except QDA) have been previously applied, in various configurations, to predict lost circulation datasets in several published studies, which are cited in Table 3. Those cited studies provide methodology details and mathematical descriptions for each method. The majority of the studies cited apply either one method or compare a small selection of ML/DL methods. One novelty of this study is that compares the performance of eleven different ML/DL methods in the prediction of fluid-loss classes applied to the same large field dataset. For this study, the twelve models considered are configured with the control parameter values described in Table 4.

Table 4 Control parameters applied to twelve ML/DL models configured to predict fluid-loss classes in the Azadegan field dataset

Class prediction accuracy measures assessed

To access the loss classification prediction performance of the ML/DL models evaluated five distinctive statistical measures of accuracy commonly assessed for classification-type problems are used in this study. These are:

  • The absolute number of prediction errors recorded

  • Accuracy (A) as expressed in Eq. (1):

    $$A = \left( {\text{TP + TN}} \right){/}\left( {\text{TP + TN + FP + FN}} \right)$$
    (1)

where, in terms of numbers of predictions, TP = true positive predictions, TN = true negative predictions, FP = false positive predictions and FN = false negative predictions.

  • Precision (P) as expressed in Eq. (2):

    $$P \, = {\text{TP/}}\left( {\text{TP + FP}} \right)$$
    (2)
  • Recall (R) as expressed in Eq. (3):

    $$R \, = {\text{TP/}}\left( {\text{TP + FN}} \right)$$
    (3)
  • Balanced F-score (F1), the harmonic mean of P and R, as expressed in Eq. (4):

    $$F_{1} = \left( {2 \, \times \, \left( {R \, \times \, } \right)} \right)/\left( {R \, + P} \right)$$
    (4)

In addition to these five commonly used measures of classification accuracy, confusion matrices, customized for this study, offer a useful visual dimension with which to simultaneously view the prediction accuracy and errors associated with each of the five fluid-loss classes. The top-left to bottom-right diagonal of the confusion matrix displays the TP predictions. In a perfect prediction performance all predictions are located along that diagonal of the confusion matrix.

Four other prediction performance accuracy metrics are also used to assess and compare the training and testing performance of each ML/DL model evaluated. These are:

  • Mean Square Error (MSE) as expressed in Eq. (5):

    $${\text{MSE}} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left( {(X_{i} ) - (Y_{i} )} \right)^{2}$$
    (5)

where Xi = actual recorded loss class value of data record I, and Yi = predicted loss class value of data record i. MSE is used as the objective function in the training process of the ML/DL models.

  • Root Mean Square Error (RMSE) as expressed in Eq. (6):

    $${\text{RMSE = }}\sqrt {{\text{MSE}}}$$
    (6)
  • Correlation Coefficient (R) between actual Xi and predicted Yi values (on a scale between − 1 and + 1) as expressed in Eq. (7):

    $$R = \frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {Xi - X{\text{mean}}} \right) \left( {Yi - Y{\text{mean}}} \right)}}{{\sqrt {\mathop \sum \nolimits_{i = 1}^{n} \left( {Xi - X{\text{mean}}} \right)^{2} \mathop \sum \nolimits_{i = 1}^{n} \left( {Yi - Ymean} \right)^{2} } }}$$
    (7)
  • Coefficient of Determination (R2) (on scale between 0 and 1) as expressed as the value of R from Eq. (7) squared.

Pre-processing and process workflow applied to the fluid-loss dataset

The original dataset (Mardanirad et al. 2021) included 22 variables and several incomplete data records. The data records were filtered to remove incomplete data records and omit descriptive variables using qualitative string (text) descriptions. Also, data records associated with mechanical equipment problems (e.g. mud-pump failures) were removed. After data filtering, the dataset was reduced to 65,376 data records (Tables 1, 2 and 3) involving 17 input variables with complete and valid data values. The dependent variable “loss” was transformed from string descriptions of the five fluid-loss classes into numerical (1–5) values for each data record. That numerical transformation made it possible to use the mean squared error (MSE) loss function for algorithm optimization.

The filtered data record variables were then normalized to a range of -1 to + 1, thereby avoiding scaling biases impacting the algorithms. Normalization was achieved using Eq. (8).

$${\text{Norm}}x_{i}^{m} = 2*\left( {\frac{{x_{i}^{m} - x\min^{m} }}{{x\max^{m} - x\min^{m} }}} \right) - 1$$
(8)

where \({\mathrm{Norm}x}_{i}^{m}\) is the normalized variable value, \({\mathrm{x}}_{\mathrm{i}}^{\mathrm{m}}\) is the actual recorded value of the mth variable of ith data record, \({x\mathrm{min}}^{m}\) is the minimum value of the mth variable considering all filtered data records and \({xmax}^{m}\) is the maximum value of the mth variable considering all filtered data records.

Trial-and-error tests were also conducted with various random splits of the data records between training and testing subsets. These tests established that a random split of 75% of the data records to the training subset and 25% to the testing subset generated meaningful fluid-loss classifications. As the distribution of the fluid-loss classes in the dataset is quite imbalanced, with relatively few data records falling into the complete-loss class, the testing subset needs to be quite large in order to contain a reasonable number of data records falling into that class. The testing subset was held independently, and in no way involved in algorithm training. The trained models are then applied to predict the fluid-loss class of each data record in the testing subset. The dataset was randomly split into training and testing subsets ten times and evaluated by each algorithm and an average RMSE, MAE and R2 values of the ten random cases taken as representative of each model’s prediction performance.

Figure 5 illustrates diagrammatically the work flow applied in executing and evaluating the ML/DL algorithms to predict fluid-loss classes for the Azadegan oil field dataset.

Fig. 5
figure 5

Lost circulation workflow for ML/DL models

Results

Table 5 lists the loss class prediction accuracy achieved by the 12 ML/DL models evaluated for the training and testing subsets and the complete dataset in terms of RMSE, R2 and absolute number of prediction errors (#Errors). The model execution times are also listed in Table 5. It is clear that the models vary substantially in terms of both their prediction accuracy and execution speeds.

Table 5 Prediction accuracy comparison of the twelve ML/DL models evaluated to predict fluid-loss classes in the Azadegan field dataset

The tree-based models (RF, DT and ADA) and data matching (KNN) ML models outperform the correlation/regression-based models in terms of all measures of fluid-loss class prediction accuracy (Tables 5 and 6). SVR is the best performing regression-based model, comfortably outperforming the DL models. The more complex CNN1 model provides the best prediction accuracy of the DL models, which all achieve testing subset #Errors < 500 and R2 > 0.9. The three poorest performing models (MLP, QDA and NBC) do not result in good prediction outcomes for this dataset with #Errors > 2000 and R2 < 0.5 for the testing subset.

Table 6 displays the prediction performance of the six most accurate models in terms of their A, P, R and F1metrics for all data records and representative testing subsets

The model execution times, on the same laptop (1.8 GHz microprocessor with 4 MB cache and 2 cores), expressed in seconds, vary from 3.57 (NBC) to 7798 (CNN1). The complexity of the DL models is reflected in their execution times. The more complex architecture of the CNN1 model results in an execution time that is more than six times longer than the CNN2 model. Indeed, there is more than two order of magnitude difference in the execution speeds between the faster and the slowest models. Whereas the simpler regression-based models (QDA and NBC) execute in 3–4 s, they do not deliver useful fluid-loss class prediction accuracy. On the other hand, two of the tree-based models (ADA and DT) also execute in 3–4 s and those models do deliver high prediction accuracy (Table 5).

The RF model delivers the best prediction accuracy in terms of #Errors for the testing subset (just 35) and executes in 36.29 s (Table 5). The top-five models in terms of prediction accuracy all achieve perfect or near-perfect prediction accuracy for the training subset. This is indicative that these models are over-fitting the data to some extent. Despite that evidence of over-fitting these models consistent go on, in multiple random splits of the complete dataset, to predict the testing subsets (held independent of the training process) with substantially lower errors than the other models. Comparison of the training and testing subset results for the other models indicates that they have not over-fitted the training subset, yet deliver poorer prediction accuracy for the testing subset than the top-five performing models.

Consideration of the prediction accuracies in terms of the A, P, R and F1 metrics (Table 6) reveals some key differences between the performances of the models in terms of predicting the specific fluid-loss classes. The RF model clearly outperforms all other models in terms of its prediction performance for fluid-loss classes 1, 2 and 3 (i.e. the classes where most the data records reside) However, in terms of the P and F1 metrics the DT model achieves the best prediction performance for the complete fluid-loss class 5, and the ADA model prediction performance for the severe fluid-loss class 4 outperforming the other models. Although in statistical terms, due to the much smaller numbers of data records they relate to, these two classes are less significant to the overall prediction performance of the models, in decision-making terms. Classes 4 and 5 are the two categories of fluid losses that need to be predicted quickly and reliably so that mitigation actions can be promptly undertaken to resolve them. These results suggest that, from an application perspective, there is potential value in executing the RF, DT and ADA models in an ensemble comparing their predictions to provide best prediction accuracy for all five fluid-loss classes.

Discussion

The results presented in Tables 5 and 6 clearly reveal that the tree-based ML models outperform the other ML/DL model evaluated for fluid-loss classification predicted from drilling variables associated with the large and imbalanced Azadegan oil field dataset. Although the basic prediction accuracy measures (RMSE, R2 and #Errors) are useful for distinguishing the models delivering overall high prediction accuracy (Table 5), it is essential to consider the more specific classification accuracy metrics (A, P, R and F1) to reveal the models that deliver the best prediction accuracy (RF, DT and ADA) for each fluid-loss class (Table 6). However, in order to gain a better understanding of the detailed prediction performance of each model evaluated confusion matrices need to be considered.

For this study, a novel, expanded confusion matrix format is developed (Figs. 6 and 7) that combines the basic confusion matrix with details of error numbers and precision achieved for each fluid-loss class. This format, together with the class accuracy metric values (Table 6) makes it easier to visualize each model’s performance in more detail when considered for the full dataset or testing subset predictions.

Figure 6 displays the expanded confusion matrices for the best-six performing models, as ranked in Table 5. They more clearly display the class imbalance of the dataset and the superior performance of the RF model, particularly in terms of the most abundant fluid-loss classes 1, 2 and 3. For class 4, the RF and ADA models both achieve 100% precision (i.e. no false positives) but ADA achieves one less false negative and therefore a slightly higher recall (and F1 score) than the RF model. For class 5, the DT model is the only one to achieve 100% precision (i.e. no false positives). However, the top-5 performing models all correctly identify 32 of the 34 data records in class 5 as belonging to that class (i.e. they all achieve a recall of 0.9412, Table 6), and all confuse their two erroneous data record predictions as belonging to class 1.

Fig. 6
figure 6

Expanded confusion matrices for the six best performing models in classifying fluid-loss classes for the Azadegan oil field dataset

Fig. 7
figure 7

Expanded confusion matrices for the four DL models in classifying fluid-loss classes for the Azadegan oil field dataset

Whereas the top-five performing models all achieve high prediction performances for fluid-loss classes 4 and 5 (267 or 268 correct predictions for class 4; 32 out of 34 correct predictions for class 5), model CNN1 performs less well in its predictions for these two classes (261 correct predictions for class 4; 25 correct predictions for class 5). Also, the CNN1 model involves substantially more false positives for class 4 resulting in its lower precision score of 0.9094 for that class. Although the CNN1 model achieves a perfect precision score for class 5 (no false positives), its predictions involve nine false negatives for that class resulting in much lower R (0.7353) and F1 (0.8475) scores than the other models shown in Fig. 6.

The poorer prediction performance of classes 4 and 5 data records is a characteristic of all the DL models as revealed by the expanded confusion matrices for those models compared in Fig. 7. Indeed, the GRU and LSTM models fail to predict any of the class 5 data records. The GRU model performs little better for class 4 with only 1 correct prediction out of 270 class 4 data records. The LSTM model performs slightly better for class 4 with 99 correct predictions for that class, but this is also substantially lower than the high-performing models. The two CNN models provide more accurate predictions than the GRU and LSTM models, making them the best performing DL models considered. The simpler CNN2 model actual achieves better prediction results for fluid-loss class 5 than the CNN1 model (29 versus 25 correct predictions out of 34). However, the more complex CNN2 model outperforms the CNN1 model in its predictions for all other classes.

It is apparent that despite their network structural complexities and capabilities to handle large datasets, the DL models struggle to adequately process this highly imbalanced classification task with the fluid-loss dataset evaluated. They seem to focus their processing powers on the most populated classes (classes 1, 2 and 3) at the expense of the sparsely populated classes (classes 4 and 5). Consequently they must be considered as unreliable in terms of correctly predicting severe and complete fluid-loss incident from the drilling variables available for the dataset evaluated. The high model execution times, particularly for the CNN1 model, also pose a problem. In a real-time situation, while drilling, it is necessary to identify fluid-loss classes rapidly in order to make prompt mitigation decisions; in seconds (as achieved by RF, DT and ADA models) not by an hour or more as required by all but one of the DL models.

The technique presented for the Azadegan oil field dataset, and the novel expanded confusion matrix developed, could be applied more generally to other partially developed oil and gas fields to predict fluid-loss classes from large datasets of drilling variables and assess classification prediction performance in detail. Clearly, the datasets would require careful training and testing and several models should ideally be evaluated to identify the best performing model for each fluid-loss class. The results from this study suggest that it is incorrect to assume that large datasets, particularly those with highly imbalanced class distributions will always be better predicted by DL methods compared to ML methods. Hence, ML methods, particularly those that are tree-based or involve data matching should not be ruled out on the grounds of dataset size considerations only. Moreover, when comparing DL model performance against ML performance for classification problems, that comparison should include tree-based and data matching methods as well as neural networks and other regression-based models.

Conclusions

The prompt and accurate prediction of wellbore fluid losses based on drilling variables is desirable as it can help to reduce drilling costs, time and mitigate fluid-loss events from becoming serious well-control problems. Drilling datasets in relation to wellbore fluid-loss classes suffer from two significant characteristics: 1) individual drilling variables tend not to be highly correlated with fluid-loss classes identified, and 2) the distribution of data records across fluid-loss classes tend to be highly imbalanced with the majority located in the no/seepage-loss classes and very few in the severe/complete-loss classes. The first characteristic limits the class prediction accuracy that can be achieved by many regression-based models. The second characteristic means that prediction accuracy of models assessed purely in terms of total prediction errors can be misleading as it may mask poor prediction accuracy of the more sparsely represented classes in a dataset.

This study uses the large, 20-well dataset (65,376 data records) for the Azadegan oil field (Iran) with 17 drilling variables and 5 wellbore fluid-loss classes distinguished, to assess the prediction performance of twelve models related to eight machine learning (ML) and three deep learning (DL) techniques. Of these models, it is the tree-based algorithms (random forest (RF), decision tree (DT), Adaboost (ADA)) and K-nearest neighbour (KNN) data matching algorithms that deliver the best predictions (i.e. < 100 errors for all data records and only 35 errors for the RF model). These models outperform the other regression-based ML/DL models in their prediction accuracy and require much shorter execution times than the DL models.

Considering class accuracy metrics precision, recall and F1 score reveal that for wellbore fluid-loss classes 1 to 3 (less significant losses and most data records) the RF model is most accurate. On the other hand, for wellbore fluid-loss classes 4 and 5 (more significant losses and fewest data records) the ADA and DT models, respectively, are the most accurate. These results imply that it would make sense in practice to run all three of these tree-based models as an ensemble to provide prediction comparisons for each fluid-loss class.

Model class prediction accuracy can be usefully displayed using an expanded confusion matrix format to integrate and visualize the models performance in terms of accuracy and their ability to address the class imbalance characteristic of lost circulation datasets. These displays reveal the superior performance of the ML models mentioned to the DL models. In particular, the DL models (CNN, GRU and LSTM) are substantially inferior in their prediction accuracy for fluid-loss classes 4 and 5. The CNN model can be improved to an extent by adding layers but that adversely increases its complexity and execution time, and that modified model is still unable to match the performance of the high-performing ML methods with the Azadegan dataset.

The key findings of this study are summarized as:

  • An ensemble of tree-based ML models (RF, ADA and DT) provides the best prediction accuracy of loss circulation classes;

  • That ensemble overcomes the class imbalance issues typically posed by loss circulation datasets;

  • ML algorithms are more accurate and quicker to exercise than DL algorithms for the large dataset evaluated;

  • DL algorithms perform particularly poorly for those classes with the fewest data records; and

  • The novel expanded confusion matrix display provides excellent visualization of classification accuracy and class imbalance performance for loss circulation datasets.