Shale gas well productivity potential evaluation based on data-driven methods: case study in the WY block

Evaluating the productivity potential of shale gas well before fracturing reformation is imperative due to the complex fracturing mechanism and high operation investment. However, conventional single-factor analysis method has been unable to meet the demand of productivity potential evaluation due to the numerous and intricate influencing factors. In this paper, a data-driven-based approach is proposed based on the data of 282 shale gas wells in WY block. LightGBM is used to conduct feature ranking, K-means is utilized to classify wells and evaluate gas productivity according to geological features and fracturing operating parameters, and production optimization is realized through random forest. The experimental results show that shale gas productivity potential is basically determined by geological condition for the total influence weights of geologic properties take the proportion of 0.64 and that of engineering attributes is 0.36. The difference between each category of well is more obvious when the cluster number of well is four. Meanwhile, those low production wells with good geological conditions but unreasonable fracturing schemes have the greatest optimization space. The model constructed in this paper can classify shale gas wells according to their productivity differences, help providing suggestions for engineers on productivity evaluation and the design of fracturing operating parameters of shale gas well.


Corresponding empirical improvement in squared-error a(i)
Mean distance between i and all other data points in the same cluster d (i, j) Distance between data points i and j in the cluster C i b(i) Smallest mean distance of i to all points in any other cluster B k Between group dispersion matrix W k Within-cluster dispersion N Number of data points C q Point set in cluster q c q Center of cluster q c Center of matrix trace E n q Number of points in cluster q b Individual bootstrap samples B Sum of trees t *

Introduction
The research area in this paper focused on Weiyuan (WY) shale gas block. WY block, which is located in Sichuan basin, is a representative block with the achievement of industrial production in the Lower Cambrian shale gas exploration in China (Cao et al. 2020). The whole block has 1 3 a monoclinic structure inclined to the southeast, and faults are not well developed in this block. The reservoir properties are generally high, with total organic carbon (TOC) value of 2.1-8.1% and average of 3.2%, porosity of 2.1-9.6% and average of 5.9%, brittle mineral content of 60-82% and average of 74%, gas content of 3.3-8.5 m 3 /t and average of 5.5 m 3 /t. The shale gas play is characterized by small thickness (1-7 m and an average of 3.8 m) of high-quality reservoir, large horizontal stress difference (7.7-18.7 MPa) (Ma et al. 2020;Liu et al.2021).
Shale gas is an important unconventional natural gas resource with huge exploration potentials (Yao et al. 2019;Zhu et al. 2020). The success of the shale gas revolution in North America further highlights its resource potential and development prospects (Ma et al. 2018). Shale gas reservoirs are characterized by low porosity, low permeability and heterogeneity, and the flow law in the matrix of shale gas reservoirs no longer follows Darcy flow compared with conventional reservoirs (Yan et al. 2017). Due to the ultralow porosity and permeability, it requires the horizontal well drilling and the multi-stage hydraulic fracturing technology to produce the shale gas (Jiang et al. 2020). The productivity of shale gas wells is affected by the energy utilization rate of the fracturing fluid, the closure law of main fractures and induced fractures, etc. (Mohammadmoradi and Kantzas, 2018). The challenge in developing shale gas lies in the inherent problem of multi-parameter decision making, especially when substantial uncertainty may exist with underground conditions and major economic factors. Additionally, since the development cost of shale gas wells is generally higher than that of conventional oil and gas, especially the fracturing cost, it is important to develop wells with high production potential and design reasonable fracturing parameters to improve the economic benefits.
In petroleum geoscience, sweet spots are regarded as target regions within the best production potential (Hauge et al. 2017). Engineers have employed traditional approaches like core analysis, well logging and seismic test to identify the locations of sweet spots (Liu et al.2019;Zhao et al.2019). However, these methods are not persuasive, because there is no obvious equivalence between the measured parameters and the well's productivity, and the fact that there is no uniform standard for the identification of the sweet spot location from one field to another makes conventional methods less applicable. Some scholars have also adopted data-driven methods, such as neural network, fuzzy logic and support vector machine, to conduct productivity evaluation. The usual practice is to use logging data, seismic test data and production data to predict geological parameters such as TOC that related to the location of sweet spots or perform other analyses (Tahmasebi et al. 2017;Qian et al. 2018;Kurison et al. 2019). But the above techniques remain uncertain as they still do not fully explain the relationship between geological parameters and gas productivity. Therefore, the evaluation of shale gas production potential should be established on the determination of the controlling factors of productivity. Furthermore, well optimization will be more reliable when based on the important factors influencing gas recovery.
With the rapid development of AI, machine learning methods are widely applied in reservoir waterflooding (Ma et al.2019), heavy oil recovery (Lu et al.2021), performance evaluation of shale gas (Syed et al.2021) and other oil and gas fields. Compared with data-driven models, the mechanism-based reservoir simulation models for parameter sensitivity analysis are too time-consuming and computationally rigorous to conduct (Bettin et al. 2019;Mo et al. 2019). Selection of the principal parameters influencing gas production should consider both geologic and engineering parameters. Currently, the commonly used data-driven method to determine the main influencing factors of gas production is to establish a prediction model between the influencing factors and productivity parameters. He et al. (2019) established a prediction model of gas production in Marcellus shale gas area with 31 input parameters including well information, reservoir parameters, rock mechanics parameters, completion parameters, and fracturing parameters by using neural network algorithm, and optimized parameters such as perforation cluster number, fracturing stage and proppant volume through this model.  applied deep learning method to predict estimated ultimate recovery (EUR) based on 17 parameters of the geological, engineering and production factors involved with the EUR. Liang et al. (2021) built a multi-objective random forest method to predict the shale gas production performance throughout the entire shale gas production process. However, there were no specific conclusions about the controlling factors of gas productivity. Vikara et al. (2020) used gradient boosted regression tree (GBRT)-based predictive model to estimate ultimate recovery and concluded that net thickness and porosity were the vital influencing factors. However, whether geologic attribute is the first principal controlling factor remains unclear. Meanwhile, it is still not explicit to know how important the engineering factors such as fracturing parameters are to the productivity. Therefore, it is necessary to rank the importance of the factors that affect the productivity of shale gas wells and determine the category of parameter attributes with the largest influence weight. If the weight of geological attribute ranks first, it is of great significance for shale gas wells to develop different fracturing optimization schemes according to different geological conditions to increase the output of single well and reduce production cost.
In view of the current research status, this study seeks to achieve the well productivity evaluation, dominating factors analysis and production optimization through machine learning methods. The model constructed in this paper can give 1 3 engineers a more comprehensive understanding of the effect of various factors on productivity. Additionally, it can help provide suggestions for engineers on the design of fracturing parameters and resource allocation of new wells, which can also improve the economic benefits.

Data description and statistical analysis
The dataset obtained from the 282 fractured wells in the study area contains 12 different types of data items, including 7 geological features (vertical depth, TOC content, porosity, thickness of organic-rich gas reserves, gas saturation, formation pressure coefficient, brittle mineral content), 4 engineering features or fracturing operating features (average fracturing interval length, amount of fracturing fluid, amount of fracturing sand, average fracturing fluid displacement), and first-year daily gas production rate ( Q g ) is taken as the evaluation index of gas productivity. Due to the limited field conditions, some parameters such as permeability cannot be obtained temporarily, but the method proposed in this paper is still feasible and can be extended to more data. Table 1 is the descriptive statistics of all wells, serial number of geological features is 1-7, and that of fracturing operating features is 8-11, and No.12 is the label of the dataset. Among them, geological factors are difficult to be changed artificially, while fracturing factors are more controllable to field engineers, and can be adjusted manually, both of which are related to Q g . The productivity of these wells varies widely, with the maximum Q g value being more than 170 times of the minimum Q g value, and the variance of Q g is large, indicating that the productivity data is scattered. From a mechanical point of view, poor geological conditions or unreasonable fracturing parameters will lead to low fracturing productivity. Therefore, evaluating the productivity potential of shale gas wells prior to fracturing will greatly reduce the probability of low production caused by the failure of fracturing parameter design, thus improving the overall benefit. This paper will calculate the influence weight of each feature through feature ranking, find the principal factors affecting productivity, evaluate productivity potential through feature reconstruction and clustering, and finally establish a random forest predictive model to optimize the comprehensive index of fracturing operating parameters of low production wells. A preliminary data analysis was conducted before the above work. The relationship between each feature and the response variable was independently measured by the univariate feature selection method, and the linear correlation coefficient r between each parameter was calculated by Eq. (1). From Fig. 1, it can be seen that there is almost no strong linear relationship between geological and fracturing operation parameters with Q g , only vertical depth is positively related to formation pressure coefficient. The correlation coefficients of X1 and X2 with X12 are all 0.5, indicating thickness of organic-rich gas reserves and gas saturation are moderately correlated with Q g . Therefore, conventional single-factor analysis method has been unable to meet the demand of productivity potential evaluation due to the numerous and intricate influencing factors. Data-driven methods provide the possibility to solve the problem. And the scattered correlation matrix of these features is shown in Fig. 2. First-year daily gas production rate (Qg) 10 4 m 3 /d 5.49 9.50 0.19 33.30 1 3 r(X, Y) is the linear correlation coefficient between variables X and Y , Cov(X, Y) is the covariance between X and Y , (X) and (Y) are the variances. The value of r ranges from − 1 to 1. The greater the absolute value of r , the stronger the correlation between two parameters, and vice versa. When r is equal to 0, it can only indicate that there is no linear correlation between the two parameters, not irrelevant.

Methodology
There are wide variations in gas productivity of these wells, in order to improve shale gas production, it is of great significance to further explore the reasons for the productivity differences among wells and find out the controlling factors that affect the productivity.
Specifically, the data from 282 shale gas wells will be used to: (1) calculate the importance weight of each influencing factors and conduct feature ranking by LightGBM.
(2) Calculate comprehensive index of geological parameters and fracturing operating parameters through feature reconstruction, and combine new features with first-year daily gas production rate to form a new dataset. (3) Classify the new dataset so as to evaluate the gas productivity potential through K-means. (4) Using random forest predictive model to optimize wells with low Q g .

Feature ranking by LightGBM
The feature ranking in this paper is model-based, because some machine learning methods have a scoring mechanism for features. Therefore, LightGBM (Light gradient boosting machine learning) algorithm is selected to establish the prediction model between each feature and daily gas production, and its ranking mechanism is used to calculate the influence weight of geological features and fracturing parameters.
LightGBM is an improved variant of gradient boosting decision tree (GBDT) algorithm (Dj et al.2020); it combines the predictions of multiple decision trees to make the final prediction generalizes well and contains two novel techniques: Gradient-based One-Side Sampling and Exclusive Feature Bundling to deal with large number of data samples and large number of features, respectively (Ke et al. 2017). LightGBM algorithm divides the training set into multiple models, and in each iteration, the selection of the first k features is determined by local voting, and then the selection of the first 2k features is determined by global voting. The default training decision tree in Light-GBM algorithm uses histogram algorithm, which does not need to find splits on the classified feature values, but stores continuous feature values in discrete bins which are used to construct feature histograms during the training process. This reduces the number of data instances and features, thus shortening training time (Qin et al. 2021). LightGBM can be used for featuring ranking since greedy feature selection already happens in the algorithm when selecting splitting features, for example, splitting features and splitting points of regression trees are found to minimize the squared-error loss for any given partition of the data. Moreover, as a byproduct, a sorted list of relative importance of features is automatically generated for each boosted trees model (Feng et al.2009). Friedman (J.H, 2001) generalizes the global importance of feature x j is measured by the average value of the importance of feature x j in a single tree: M is the number of trees, and the relative importance of feature x j in each tree is calculated as: where the summation is over the internal nodes t of a L -terminal node tree T , v t is the splitting feature associated with node t , and Î 2 j is the corresponding empirical improvement in squared-error as a result of the split.
Through this method, the influence weight of each feature, the total influence weight of geological features and (2) Heat diagram of linear correlation coefficient of each parameter, X1 is vertical depth, X2 is TOC content, X3 is porosity, X4 is thickness of organic-rich gas reserves, X5 is gas saturation, X6 is formation pressure coefficient, X7 is brittle mineral content, X8 is average fracturing fluid displacement, X9 is average fracturing interval length, X10 is amount of fracturing fluid, X11 is amount of fracturing sand, X12 is Q g 1 3 fracturing operating features will be obtained. Thus, features have greater impact on gas production will be known.

Feature reconstruction
In order to facilitate the classification of shale gas wells, it is necessary to reduce the dimensionality of the original feature parameters, and feature reconstruction is used to obtain the comprehensive index of geologic features and fracturing operating features, respectively.
The commonly used method of feature dimension reduction is feature extraction. Feature extraction refers to the use of some specific feature extraction methods to transform the original data, such as PCA (Principal Component Analysis), LDA (Linear Discriminant Analysis), etc. (Martinez et al. 2001;Shlens, 2014). The dimension of the data is reduced after feature extraction. Although the new feature has statistical significance and retains the main information of the original data, the physical meaning is not clear. Feature construction refers to artificially constructing new features from original data, such as using mixed attributes or combined attributes to create new features, or decomposing and segmenting original features to create new features. Therefore, this paper will adopt feature construction, combined with domain knowledge to reconstruct new features to achieve feature reduction. The detailed calculation process is as follows.
A data set with N samples, M features and F attributes can be set as X N×M . Suppose m i is the number of features belonging to the i th attribute, thus . Therefore, c 1 , c 2 and Qg will form a new data matrix of 282 rows and 3 columns.

Productivity potential evaluation by K-means
After constructing the new data set, the wells will be classified by clustering algorithm, and the productivity potential will be evaluated, so as to select those low productivity wells with good geological conditions but unreasonable fracturing parameters design to facilitate subsequent productivity optimization. K-means algorithm is selected for well classification as it is one of the most commonly used and important clustering algorithms, the basic idea is to find the partition scheme of K clusters through iteration, and use the mean value of these K clusters to represent the corresponding sample types, so as to minimize the total error (Chu X et al. 2020). Suppose 1 , 2 , … , k are K cluster centers, r nk ∈ {0, 1} is used to indicate whether x n belongs to cluster K , the loss function is defined as: The cost function of above formula can only be minimized by iterative methods. The K-means algorithm clusters samples into k clusters, where k is given by the user. The solution process of the algorithm is described as follows: (1) Input N samples x 1 , x 2 , … , x N , x n ∈ R D , and input the proposed number of clustering k (2) Randomly initialize K vectors of D dimension, and select K different sample points as the initial clustering center (3) Repeat the following process until convergence For each object x n , calculate the class it should belong to, as in the following formula: Recalculate the centroid of the class: The iterative convergence definition is that the clustering center remains unchanged or the sum of distances from each sample to the corresponding clustering center no longer has a great change. Since the ground truth labels are not known, two internal evaluation methods, Silhouette coefficient (SC) and Calinski-Harabasz (CH) index, will be adopted as the evaluation indexes of clustering performance.
SC was first proposed by Peter (1999); it is based on the comparison of its tightness and separation, and can be used to evaluate the influence of different algorithms or different operation modes on clustering results based on the same original data. SC ranges from − 1 to 1, the closer the value is to 1, the more reasonable the clustering result is. If the value is close to − 1, it means that the sample should be classified into another cluster. And the closer the value is to − 1, the more reasonable the sample clustering should be classified into another cluster。. The SC for a set of samples is given as the mean of the SC for each sample.
The SC of object i is: a(i) is the mean distance between i and all other data points in the same cluster,d(i, j) is the distance between data points i and j in the cluster C i ,b(i) is the smallest mean distance of i to all points in any other cluster, of which i is not a member.
The CH index, also known as the variance ratio criterion-can be used to evaluate the clustering model, where 1 3 a higher CH score relates to a model with better defined clusters.
B k is between group dispersion matrix,W k is withincluster dispersion,N is the number of data points, C q is the point set in cluster q , c q is the center of cluster q , c is the center of matrix trace E , and n q is the number of points in cluster q.

Production optimization through random forest
Through the previous steps, wells with good geological conditions but low productivity will be selected for optimization, and random forest (RF) will be used to establish a regression model between new characteristic parameters and Q g . When optimizing the gas productivity of a single well, fix the geological index of the well, change the fracturing operating index and iterate through the trained predictive model to search for the corresponding fracturing operating index when the output is maximum, so as to achieve productivity optimization.
RF is an ensemble of CART (Classification and regression tree) and uses a bootstrap method based on the classification and regression tree analysis to predict a continuous response variable (Di Fina et al.2015;Ftpa B et al.2021). Regression in RF is generated as an ensemble estimate from a number of decision trees based on bagging; the final output is the average of outputs of individual trees (Prasad et al.2006;Adusumilli et al.2013) b is the individual bootstrap samples, B is the sum of trees, t * b is the individual learner, and the individual decision tree: z * bk (k=1,2,…,K ) is the k th training sample with pairs of values for the target variable ( y ) and covariates(x ):z * bi = (x k , y k ). (10)

Results and discussion
LightGBM algorithm is selected as the feature selection algorithm to determine the importance weight of each feature, and normalization processing was conducted to eliminate any differences in the order of magnitude among the various input parameters. In the training process, 80% of the data set is divided into training set and 20% is divided into test set, and the grid search method is used to adjust the hyperparameters of the LightGBM predictive model, and the hyperparameters are shown in Table 2. In this paper, Mean square error (MSE) is used as the evaluation index of the regression model, it is calculated by Eq. (15), where n represents the number of shale gas wells, ŷ i is true value, y i is predicted value, the smaller the value of MSE, the higher the accuracy of the prediction model. MSE value of training set is 0.006 and that of test set is 0.01, which is small enough to make the model feasible.
The influence weight of each feature is shown in Fig. 3. It can be seen that thickness of organic-rich gas reserves and amount of hydraulic fracturing sand are vital parameters affecting gas productivity. The total weight of geological features is 0.64 and that of fracturing operating features is 0.36, indicating that geological condition is more deterministic to gas productivity. The result also proves that gas production of well in poor reservoir condition is difficult to be improved by adjusting engineering parameters.
New sample matrix C 282×3 is constructed according to Eq. (4); the new data statistics is shown in Table 3. Through feature transformation, 11 features were compressed to 2-dimension according to 2 feature attributes, and a threedimensional dataset, including comprehensive index of geological features ( c 1 ), comprehensive index of fracturing operating features ( c 2 ) and Q g was formed for further analysis. Due to the large difference between the values of Q g , c 1 and c 2 , data preprocessing is required before classification. This article uses the method of MinMaxScaler to compress these values to between 0 and 1. Then, clustering will be performed by K-means after the normalization of new dataset. In order to obtain the best cluster number, sensitivity analysis of cluster number k is carried out by using two internal evaluation methods: SC and CH. From Fig. 4 and Fig. 5, it can be seen that when k is 2, the clustering effect is the best, while when k is 3 or 4, the clustering effect is also acceptable. In order to evaluate the productivity potential of the well more reasonably, the cluster diagrams when k is 2, 3, and 4 are drawn, as shown in Fig. 6. When the cluster number is 4, the difference between various types of wells is more obvious, which is more conducive to subsequent analysis. Therefore, this article divides the 282 shale gas wells in the study block into 4 categories, detailed feature description and data statistics of each cluster are shown in Table 4. The values of Q g and c 1 values in cluster 2 are very small, which demonstrate that this type of wells has the poorest geological conditions, low productivity potential, and low economic development benefits. The values of c 1 in cluster 1 and cluster 3 is relatively large, indicating that these two types of   wells have superior geological conditions, while the Q g value of cluster 1 is low, which proves that the fracturing operation parameter design of this type of wells is unreasonable and has large optimization space. If secondary fracturing is required, such kind of wells should be considered first. The fracturing parameter design of wells in cluster 3 is reasonable and has important significance. Table 5 summarizes the parameter range of wells in cluster 3, and the design of fracturing operating parameters can refer to the wells in cluster 3. Conclusively, the geological conditions of the well should be precisely evaluated to determine the productivity potential of the well before fracturing. For wells with good geological conditions, the investment can be increased as designing reasonable fracturing parameters for hydraulic fracturing will bring greater economic benefits, and for wells with poor geological conditions, the investment cost can be relatively reduced. By considering different reservoir characteristics, the fracturing scheme is customized to make it more specific and applicable. After completing the above work, six wells are selected from cluster 1 for productivity optimization, because these wells have large optimization space. Taking c 2 as the optimization variable, the objective function is to maximize Q g by changing the value of c 2 , and RF algorithm is used to establish a predictive model between c 1 , c 2 and Q g of 282 wells. The ratio of the training set to the test set of the predictive model is 7:3. When the accuracy of the predictive model reaches the standard, it starts to iterate. During the optimization process of each well, only the value of c 2 is adjusted while c 1 remained unchanged in every iteration, and the value of c 2 varies in the range of 0 and 1. The results are shown in Fig. 7, the red line in the figure represents the original Q g , and the blue dotted curve is the change trend of the optimized Q g with the number of iterations. It can be seen from the figure that each well has achieved the largest Q g value within the range of 70 iterations. The statistics of the parameters before and after optimization of the 6 wells are shown in Table 6. The experimental result shows the gas production rate of these six wells has been greatly improved, indicating that the previous design of fracturing operating parameters such as the amount of hydraulic fracturing fluid and the amount of hydraulic fracturing sand is not reasonable. These wells with greater productivity potential should be taken seriously as their gas production has not been fully exploited.
In addition, the data obtained in this study is limited, with only 7 geological parameters and 4 engineering parameters. If more features can be obtained in the future, such as permeability, stimulated reservoir volume, drilling and completion parameters, they will also be added to the model to improve its generalization ability.

Conclusions
This paper utilized machine learning methods to evaluate the productivity potential of 282 fractured shale gas wells in WY block, and the specific content of this study includes: featuring ranking, feature reconstruction, well classification, productivity potential evaluation and optimization. Several conclusions from this research are presented below: 1. Thickness of organic-rich gas reserves and amount of hydraulic fracturing sand are vital parameters affecting gas productivity, and geological condition is more deterministic to gas productivity. 2. The model established in this paper can accurately divide shale gas wells into four categories according to their geological conditions and productivity potential, mainly including wells with good geological condition, Fig.7 Change curve of optimized Q g with iterations from six wells in cluster 1 1 3 poor fracturing measures and low gas production; wells with poor geological condition and the productivity is low regardless of the fracturing measures; wells with superior geological condition, good fracturing measures and high production; wells with medium geological condition and moderate gas production. 3. The model constructed in this paper can classify shale gas wells according to their productivity difference, it can select wells with good fracturing measures and good geological conditions as sample wells. Meanwhile, wells with high productivity potential but low production due to poor fracturing measures can also be selected as wells that can be refractured. 4. Six low production wells with good geological conditions but unreasonable design of fracturing operating parameters have been optimized successfully, and the gas production rate of those wells increased by an average of five times after optimization. Besides, this kind of wells should be highly valued as their productivity potential has not been fully exploited.
In the future, the number of samples and features will be increased to improve the generalization ability of the model.

Conflict of interest
The author(s) declare(s) that there is no conflict of interest regarding the publication of this paper.
Ethical approval On behalf of all the co-authors, the corresponding author states that there are no ethical statements contained in the manuscripts.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.