A new score system using data-driven approach to rank carbonate gas reservoirs in Sichuan Basin

In the early stages of exploration, with only a limited amount of data available, it is difficult to evaluate a reservoir and optimize the sequence of the development plan. The score system is often used to rank the reservoir based on multidisciplinary factors that combine geology, production, and economics. However, current methods that are widely employed to classify the reservoir, such as analogy or single parameter, are qualitative or inaccurate, especially for carbonate gas reservoirs with complex geological conditions. In this study, we developed a score system using a data-driven approach to rank carbonate gas reservoirs in the Sichuan Basin. We developed two approaches, expert scoring and the random forest, to rank the quality of the reservoir, which agreed well with the field development plan. The expert scoring approach, which is highly dependent on the experience of experts in this area, is more suitable for reservoirs with limited data available, especially in the early exploration stage. The random forest model, which is more robust and able to reduce uncertainty from experience, is more suitable for developed areas with sufficient data. The developed score system can help rank new resource recovery and optimize the development plan in the Sichuan Basin.


Introduction
According to the statistics of AAPG, by the end of 2002, 95 large carbonate gas fields had been discovered in the world. Cumulative recoverable reserves were 73.8 × 10 12 m 3 , which accounted for 45% of the total global recoverable gas reserves. These carbonate gas fields are mainly distributed in the Middle East, Southwest Russia, Europe Mainland, North America, and Southeast Asia (Sun et al. 2017;Akbar et al. 2000;Kargarpour 2020; Moore and Wade 2013;Van Golf-Racht 1996;Esrafili-Dizaji and Rahimpour-Bonab 2019;Ahr 2008). The natural gas resources of Sichuan Basin are 14.33 × 10 12 m 3 , where carbonate gas reservoirs account for 75% of natural gas production. The target areas in Sichuan Basin include Cambrian Longwangmiao formation, Sinian Dengying formation located in Anyue Gasfield; Triassic Feixianguan formation, Permian Changxing formation located in Longgang Gasfield; and Permian Qixia formation located in Jianmen Gasfield. All of these are ultra-deep carbonate reservoirs, where the depth ranges from 4500 to 7300 m. The reservoirs' characteristics are listed as follows (Table 1): Carbonate gas reservoirs are geologically complex due to the hydrocarbon accumulation process. A large number of influencing factors, such as facies, lithology, reservoir types, drive mechanisms, and fracture types, make it difficult to evaluate and develop carbonate gas reservoirs (He et al. 2014;Kabir et al. 2015;Kang et al. 2017). Uncertainty, such as water invasion and heterogeneity, and the use of different development technologies greatly vary, making it difficult to predict and evaluate recovery efficiency and economics, especially in the early stages of exploration with limited data available (Sun 2002;Feng et al. 2019;Yan et al. 2012).
A lot of research has focused on methods for evaluating gas reservoirs worldwide, including analogy and single parameter evaluation (Muther et al. 2022;Abuamarah and Nabawy 2021;Malekmohammadi et al. 2011;Wang et al. 2022). The analogy method classifies gas reservoirs based on a database of many oil and gas fields that have similar reservoir types, drive mechanisms, lithology, and reservoir properties (Martin et al. 2013;Sun et al. 2017;Ehrenberg 2004;Martin et al. 2013). Single parameter evaluation uses a parameter, such as recovery factors or production, to rank gas reservoirs (Clark 2009;Denney 2011). These methods were also used in the Sichuan Basin, but it is difficult to evaluate such reservoirs comprehensively due to their shortcomings, such as being qualitative or inaccurate (Clark 2009;Denney 2011;Gherabati et al. 2018;Jin et al. 2002;Dzurman et al. 2013;Harris 2014).
In this study, we developed a new score system using a data-driven approach to rank carbonate gas reservoirs in the Sichuan Basin, using data from 25 carbonate gas reservoirs. The method ranks gas reservoir scores based on three factors: recovery factor (RF), daily production per kilometer depth (DPKD), and internal rate of return (IRR). We used the Phi_k and random forest methods to reduce the dimension and determine the importance of the corresponding factors. Then, we used two approaches, expert scoring and the random forest, to solve the multi-objective evaluation problem and rank reservoir scores (Biau et al. 2008;Goldstein et al. 2010;Dursun et al. 2014;Bhattacharya 2013).
The first approach evaluates gas reservoirs by scoring influencing parameters (inputs), and the second approach scores predicted outputs to directly evaluate gas reservoirs.

Methodology
The procedures for the two approaches to rank the gas reservoirs are shown in Fig. 1. The detailed procedures are listed as follows: (1) Firstly, we collect all available data from geological, reserve, and production reports, among others, in order to include the characteristics of carbonate gas reservoirs in the Sichuan Basin. The data are divided into two groups for analysis: the influential input parameters and the evaluation index (output).
(2) Secondly, data cleaning is performed, including handling missing values and converting unstructured data to structured data. (3) Thirdly, we use correlation analysis and feature importance to reduce input parameters and avoid overfitting. We apply the Phi_k method for correlation analysis to reduce highly dependent input parameters and the random forest for feature importance to remove input parameters that have little impact on the evaluation index. (4) We use the expert scoring approach to rank gas reservoirs. We determine the range and corresponding score for each input parameter from the development committee in the Sichuan Basin. We calculate the weights for each input factor from the random forest model in step 3. Then, we obtain the score of the index output by summarizing the score of the input factors times their corresponding weights. (5) We use the random forest approach to rank gas reservoirs. We quantitatively calculate the evaluation index (output) from the input parameters using the random forest model, then score the calculated index using expert ratings for ranking.

Data selection
The data are collected from reserve and development reports of the Cambrian Longwangmiao formation and Sinian Dengying formation located in the Anyue Gasfield; the Triassic Feixianguan formation and Permian Changxing formation located in the Longgang Gasfield; and the Permian Qixia Formation located in the Jianmen Gasfield. The data include reserves, well production, economic data, geology, engineering, and other information for 25 gas reservoirs. The quality of the reservoirs can be described by RF, DPKD, and IRR (Clark 2009;Denney 2011;Jin et al. 2002). RF directly describes geological information and the difficulty of reserves exploitation, DPKD represents the productivity of a reservoir, and IRR reflects the economics of development. RF, DPKD, and IRR are chosen as the evaluation index for gas reservoirs in the Sichuan Basin. According to the characteristics of the reservoirs, we selected influential factors as inputs for the scoring system (Table 2).

Data cleaning
As the dataset is limited, deleting missing values from samples can greatly impact model results. Therefore, we use adding missing values based on other observations in this study. For example, there are four missing values for the permeability of the Feixianguan formation as shown in Table 3.
According to the similarity with LG 1 gas reservoir, we use 22.5 mD for the missing values. We use one-hot encoding to map textual data into numbers, such as trap types where a lithologic trap is mapped to 0, a tectonic trap is mapped to 1, and a lithotectonic trap is mapped to 2. We finally choose a complete dataset with 22 input parameters for 25 carbonate gas reservoirs for data mining.

Correlation analysis and feature importance
The goal of correlation analysis is to validate the relationship between two variables and reduce dimensions to avoid overfitting. Common approaches used for correlation coefficients include Spearman, Pearson, Kendall, Phi_K, and others. We use Phi_K in this study due to several advantages. First, Phi_K follows a uniform treatment for interval, ordinal, and categorical variables. Second, it captures nonlinear dependencies. Third, it is similar to Pearson's correlation coefficient in the case of a bivariate normal input distribution. The value of Phi_K, which ranges from 0 to 1, represents the degree of correlation. A larger value indicates that the two variables have a stronger correlation. The correlation coefficient between the two parameters is calculated by the correlation function, combining field experience. If there is a physical or causal relationship between the parameters with strong correlation, one of them can be selected for dimensionality reduction processing, and then, the input parameters are obtained as the training dataset. Therefore, Phi_K is more suitable for the samples in this area after analysis, which is a new and practical analysis based on several refinements to Pearson's hypothesis of independence of two variables. Thus, we reduce highly dependent input variables to avoid redundancy.
Feature importance indicates the relative importance or weight of each input parameter to the output parameter. This can also be used to reduce irrelevant input parameters to avoid model overfitting. XGboost, principal component analysis (PCA), and random forest (Liaw and Wiener 2002) are commonly used for feature importance analysis. Random forest offers several advantages over XGboost and PCA, such as greater accuracy with a higher R-square, greater efficiency, and relative ease of use. Additionally, random forest can avoid overfitting the training set based on decision trees (Amit and Geman 1997;Amaratunga et al. 2008). Random forest is a technique used in modeling predictions and behavior analysis and is built on decision trees. It contains many decision trees representing a distinct instance of the classification of data input into the random forest. The random forest technique considers the instances individually, taking the one with the majority of votes as the selected prediction. Feature importance analysis is generally used to determine the contribution of different features. The parameters that are top-ranking are more important for the model. Feature importance analysis is to determine how much each feature contributes to each tree in the random forest, then take an average value, and finally, compare the contribution data between features. Therefore, we use the random forest method in this study to remove variables that have little impact on the predicted parameter. Figure 2 shows the correlation analysis for RF in the Sichuan Basin. The color of the figure represents the correlation scale between two variables, where the darker the color is, the stronger the correlation is. RF is negatively correlated with p a /z a and water saturation and positively correlated with p i / z i and permeability, which are consistent with physical fundamentals and observations. No strongly correlated input parameters are observed through the Phi_K method.  Figure 3 shows the weights of each parameter to RF, where the most important factor for RF is p a /z a , followed by water saturation and p i /z i . The other two parameters, gas driving mechanisms and permeability, have little impact on RF and are therefore removed from the input. After dimensionality reduction and validation, we choose p a /z a , water saturation, and p i /z i as the key input parameters to evaluate RF. Figure 4 shows the histogram distribution of the input parameters in Sichuan Basin. Figure 5 shows that DPKD is weakly correlated with trap type, permeability, gas saturation, and well spacing with a correlation coefficient less than 0.2. DPKD is positively correlated with reserve abundance with a correlation coefficient of 0.61. No strongly correlated input parameters are observed.

Key parameters for DPKD
Combining with Fig. 6 feature importance analysis, we choose five highly dependent key parameters for DPKD: reservoir type, porosity, formation temperature, formation pressure coefficient, and reserve abundance. The weights  of other parameters to DPKD are less than 0.05, so we remove them from the input. Figure 7 shows the histogram distributions for the five key factors. The reservoir type is unstructured data, so we map it to numbers where 0 represents porous type, 1 represents fractured-vuggy type, 2 represents fractured-porous type, and 3 represents fractured-vuggy porous type.

Key parameters for IRR
Similarly, as economic data is limited, we combine the correlation analysis (Fig. 8) with the feature importance ( Fig. 9) to choose five key parameters due to their high impact on IRR: remaining economic recoverable gas reserves, economic production life, gas price, drilling and completion costs per meter, and fixed cost per well.

Ranking method with experts scoring
After data cleaning and analysis, the experts scoring system is used to evaluate the input parameters to obtain the comprehensive score of the gas reservoir index. Experts determine the key factor ranges and corresponding scores based on their understanding of the reservoirs. Table 4 summarizes the range and corresponding score for each input parameter, determined by the development committee in the Sichuan Basin. The weights for each input key factor in Table 4 are calculated from the random forest model. Then, the score of the index output for RF, DPKD, and IRR can be obtained by summarizing the scores of the key factors multiplied by their corresponding weights. The overall gas reservoir score can be determined from the index output as follows (Guo 2007): The score indicates the quality of the reservoir in the Sichuan Basin, which reflects the two critical indices that impact the E&P decision: production and economics. National oil companies tend to care more about production, while international oil companies tend to care more about economics. Therefore, E&P companies can vary the weights in the above equation depending on their priorities, and the calculated score can guide the development plan in this area.

Ranking method with random forest prediction
The key factors are obtained and can be used to predict RF, DPKD, and IRR using the random forest method described above Zhong et al. 2015). Figure 10 shows that the predicted RF agrees well with the actual values, with an R 2 of 0.96. Figure 11 shows that the predicted DPKD agrees well with the actual values, with an R 2 of 0.89. And Fig. 12 shows that the predicted IRR agrees well with the actual values, with an R 2 of 0.85.
Then, the index is scored by experts for RF, DPKD, and IRR as shown in Table 5. The overall gas reservoir score can be calculated as follows:

Results and discussion
Ranking method with experts scoring Figure 13 shows the results of reservoir scoring and classification based on the experts scoring approach. The figure shows that area I has a score greater than 90, area II has a score between 85 and 90, area III has a score between 80 and 85, and area IV has a score less than 80. The boundaries are determined based on the development performance of gas Gas reservoir score = 0.2 * RF + 0.4 * DPKD + 0.4 * IRR  Fig. 9 Variable importance for IRR reservoirs in the Sichuan Basin. The performance of gas reservoirs with a score above 90 (such as LG Feixianguan formation, developed in 2012) is significantly better than the performance of reservoirs with a score between 85 and 90 (such as GM Dengying formation, developed in 2015). The score system can be used to rank the development sequence for gas reservoirs in the Sichuan Basin. Table 6 shows that LG 27 Changxing in area I scores 93.47, which has a lower initial water saturation, a higher reserve abundance and porosity, a longer economic production life, and a higher gas price. All of these factors make it better than others. It is consistent with the important feature analysis mainly because its reserve abundance and porosity are better, and its DPKD is better. Reserve abundance and porosity are the most important inputs for the DPKD, which is predicted by random forest. LHC 002-X3 Changxing in area III scores 80.34, which has a lower reserve abundance and porosity, a lower remaining economic recoverable gas reserves, a shorter economic production life, and a lower gas price. Its DPKD is also worse. Reserve abundance and porosity are the most important inputs for the DPKD, which is predicted by random forest. Figure 14 shows that area I has a score greater than 90, area II has a score between 85 and 90, area III has a score between 80 and 85, and area IV has a score less than 80. The boundaries are determined based on the predicted performance of gas reservoirs in the Sichuan Basin based on the expertise. As Table 7 shows, MX 8 Longwangmiao in area I scores 100, which has a higher RF, DPKD, and IRR. p i /z i is higher, while p a /z a is lower in the random forest prediction for RF. Reserve abundance, porosity, and formation pressure coefficient are higher in the random forest prediction for DPKD. Remaining economic recoverable gas reserves are higher, while the economic production life is longer in the random forest prediction for IRR. This is consistent with the important feature analysis.

Ranking method with random forest prediction
MX 22 Dengying in area IV scores 72, which has a lower RF, DPKD, and IRR. p a /z a and initial water saturation are higher in the random forest prediction for RF. Reserve  abundance and porosity are lower in the random forest prediction for DPKD.

Result verification
The results of the two methods are similar, as shown in Fig. 15. The two methods can accurately represent the actual reserve grade, production capacity, and economics of the carbonate gas reservoirs in Sichuan Basin, according to experts who have worked in the area. The score system and classification based on the two methods can be used to optimize the development plan for carbonate gas reservoirs in Sichuan Basin. The top one in Fig. 15 is MX 8 Longwangmiao in area I, which has a random forest score of 100 and an expert score of 90.86. Both methods are based on mathematical statistics for determining the range of input parameters. The expert scoring method relies more on the experience of experts, as it evaluates the influencing factors of index factors. The model needs to evaluate more parameters.  The random forest prediction method depends on the quality of the sample and evaluates the gas reservoir index factors, which can easily result in a local optimum of the model when the sample size is small. Therefore, these two methods will partially have differences in results.
GS 1 Dengying has a random forest score of 90 and an expert score of 89.44. The DPKD of GS 1 is 7.92 × 10 4 m 3 /d km, and the IRR is 16.71%. The two indices are better than many gas reservoirs. The reserve abundance is 5.28 × 10 8 m 3 /km 2 , and the porosity is 3.9%, which makes the DPKD better. The remaining economic recoverable gas reserves are 974.14 × 10 8 m 3 , the economic production life is 33 years, and the gas price is 1487 yuan, which makes the IRR better. The two methods match well, and GS 1 Dengying reservoir is a premium block with a priority development in the field development plan.

Conclusion
This paper presents a scoring system using a data-driven approach to rank carbonate gas reservoirs in the Sichuan Basin. The system uses two approaches, expert scoring and random forest, to evaluate the quality of the reservoirs. The data-driven techniques employed provide weights and relative importance of key factors, allowing for dimensionality reduction and avoiding overfitting. The model can assist exploration and production companies in making decisions with limited data at the early exploration stage.
The results of the study show that the proposed scoring system can effectively rank the quality of carbonate gas reservoirs in the Sichuan Basin. Based on the results above, it was concluded that: 1) For ultra-deep carbonate gas reservoirs in Sichuan Basin, the key factors of reservoir performance for recovery factor are p a /z a , water saturation, and p i /z i ; the key factors for daily production per kilometer depth are reservoir type, porosity, formation temperature, formation pressure coefficient, and reserve abundance; the key factors for internal rate of return are remaining economic recoverable gas reserves, economic production life, gas price, drilling and completion costs per meter, and fixed cost per well. 2) The random forest method is well suited for determining variable importance and predicting the index of gas reservoirs compared to other methods, due to its higher accuracy, efficiency, and ease of use.
3) The two methods used in this study are quantitative in ranking carbonate gas reservoirs rather than qualitative, and the ranking results agree well with the actual development plan in the Sichuan Basin compared to other evaluation methods. While both approaches agree well with field data, the experts scoring approach is more suitable for areas with limited data available, especially in the early exploration stage, and the random forest model is more suited for developed areas with sufficient data. The experts scoring method, however, relies heavily on the expertise and experience of the evaluator in this area, while the random forest method is more robust and can reduce uncertainty from experience. Therefore, the experts scoring system is recommended for early exploration stages, and as the data sample size increases, the accuracy of the random forest method can be applied and improved. 4) The data of twenty-five reservoirs have been used in this study, and more data are necessary to further validate and update the model. Also, this paper only focuses on the carbonate gas reservoir, and we will next apply such method to the unconventional Shale gas in Sichuan basin for broader interest and applications.

Conflict of interest
On behalf of all the co-authors, the corresponding author states that there is no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.