1 Introduction

Hydrocarbon production from unconventional resources has become the focus of attention in the oil and gas industry in the last decades. Due to the exponential growth of energy demand, advances in horizontal drilling technologies, and multi-stage hydraulic fracturing operations, developing unconventional reservoirs has become highly attractive (Taherdangkoo et al. 2019). Shale gas and tight gas reservoirs as the main unconventional resources have extremely low matrix permeability and even the existence of natural fracture networks could not provide flow paths from formation to the wells (King 2012; Kissinger et al. 2013). Reservoir stimulation techniques such as hydraulic fracturing could effectively increase the ability of natural gas recovery from such reservoirs. Hydraulic fracturing improves access to the larger part of the reservoir by creating artificial fracture networks which increase the reservoir permeability and the contact areas over which fluids flow from the matrix to the fractures (Tatomir et al. 2018; Rice et al. 2018).

Shale gas is typically dry gas composed primarily of methane with traces of ethane and propane (King 2012). The gas is stored in three ways including absorbed in the limited pore spaces of these rocks, adsorbed on the surface of organic material, or confined in the natural fractures and fissures. Therefore, gas production from shale gas reservoirs is more complex in comparison to conventional gas reservoirs (King 2012; Siegel et al. 2015). Several environmental concerns have emerged surrounding shale gas development, such as groundwater contamination, induced seismicity, climate impact from leaked stray gas into the atmosphere, etc. The main risk difference in comparison with other technologies in the subsurface is that hydraulic fracturing is remunerative, thus it is necessary to distinguish between economic and environmental issues (Cahill et al. 2017; Rice et al. 2018; Taherdangkoo et al. 2020b).

The occurrence of light hydrocarbons in groundwater in the vicinity of oil and gas operations is most commonly associated with leakage from hydrocarbon wells (Jackson et al. 2013; Nowamooz et al. 2015; Taherdangkoo et al. 2020a). Natural and anthropogenic permeable pathways such as leaky oil and gas abandoned wells, fault zones and extensive fracture systems could facilitate stray gas migration and an early gas manifestation in groundwater wells (Cahill et al. 2017; Rice et al. 2018; Tatomir et al. 2018). The presence of low-permeability layers leads to gas migration along higher permeability sediments, either up-dip or in the direction of groundwater flow, delaying the breakthrough to the shallow aquifer system (Taherdangkoo et al. 2020a).

Numerical modelling of subsurface flow and transport of stray gas plays an essential role in better evaluating the relationship between groundwater quality and hydrocarbon development. Modelling of stray gas migration requires phase equilibrium calculations to obtain the concentration of each component in liquid and gas phases. The calculation of gas solubility is usually performed by solving thermodynamic equations, which contributes a significant portion of the overall computational cost because compositions of liquid and gas phases must be calculated at each iteration. Therefore, due to the high computational time, an efficient application of complex thermodynamic models in numerical simulation of coupled multi-physics processes in the subsurface is laborious (Grunwald et al. Grunwald et al. 2020).

As an alternative to complex thermodynamic models, machine learning (ML) can be employed to describe the phase behaviour of gas-water-salt systems (Mohammadi et al. 2022; Taherdangkoo et al. 2021a). ML models can handle complex nonlinear relationships between inputs and outputs and can perform high precision interpolations (Qiao et al. 2020). Optimization algorithms can also be employed to tune hyper-parameters of ML algorithms and thus improve their overall performance (Taherdangkoo et al. 2023). Equations of state and predictive ML models estimate maximum gas solubility assuming perfect mixing and mass transfer between gas and aqueous phases. The stray gas migrating in the subsurface typically does not result in maximum solubility manifesting as the mass transfer from gas to the aqueous phase is limited in porous media (Cahill et al. 2018).

The primary goal of this study is to develop a robust ML model able to determine the solubility of light hydrocarbons (C1–C3) in aqueous solutions for further application in higher level numerical modelling. We used regression tree and boosted regression tree tuned with a Bayesian optimization algorithm to perform the regression task. The ML models were developed using a dataset containing 2129 experimental data of methane, ethane, and propane solubilities in pure water and aqueous and electrolyte systems. A comparison analysis was designed to evaluate the performance of the most accurate ML model with experimental data as well as Spivey et al. (2004), Duan and Mao (2006), Chapoy et al. (2004), and Pereda et al. (2009) thermodynamic models.

2 Data

We reviewed publicly available experimental data of light hydrocarbons (C1–C3) solubility in aqueous solutions to build a dataset to develop a robust machine learning model able to determine the gas solubility under a wide range of field conditions. The compiled dataset includes experimental gas solubility data measured between 1855 and 2007. There are instances in which experimental solubility data are incorrect or inconsistent with other gas solubility measurements. Following Duan et al. (1992), the inconsistent solubility data, e.g., methane solubility in pure water reported by Michels et al. (1936), were not considered in the dataset. The gas solubility has been reported in different units. Most of the experimental data in our dataset were originally reported in mole fraction, i.e., the mole fraction of the gas component in the liquid phase, and thus the remaining solubility values were converted to mole fraction using appropriate conversion functions. Where feasible, to avoid a conversion, the solubility data were taken from Clever and Young (1987), Hayduk (1982, 1986).

The solubility of methane in pure water and brine has been reported over a broad range of pressure, temperature, and NaCl concentrations (Clever and Young 1987), but limited measurements are available for saline solutions containing other salts. The dataset contains a total of 1912 methane solubility experimental data (Amirijafari et al. 1972; Ben-Naim et al. 1973; Blanco et al. 1978; Blount 1982; Bunsen 1855; Byrne and Stoessell 1982; Chapoy et al. 2004; Claussen and Polglase 1952; Cosgrove and Walkley 1981; Cramer 1984; Crovetto et al. 1982; Culberson et al. 1951; Dhima et al. 1998; Duffy et al. 1961; Eucken and Hertzberg 1950; Kiepe et al. 2003; Krader and Franck 1987; Lannung et al. 1960; Lekvam and Bishnoi 1997; Michels et al. 1936; Mishnina et al. 1962; Morrison and Billett 1952; Moudgil et al. 1974; Muccitelli and Wen 1980; Namiot 1961; O'Sullivan and Smith 1970; Rettich et al. 1981; Stoessell and Byrne 1982; Wang et al. 2003; Wen and Hung 1970; Wetlaufer et al. 1964; Winkler 1901; Yamamoto et al. 1976; Yano et al. 1974; Yarym-Agaev et al. 1985). Temperatures are in the range of 273.15 and 799 K, and pressure ranges from 1 to 2630 bar. The methane solubility in the aqueous phase is a function of pressure and temperature and concentrations of dissociated ions of NaCl, KCl, CaCl2, MgCl2, K2SO4, MgSO4, Na2SO4, K2SO4, Na2CO3, K2CO3 in aqueous solutions. The statistical analysis of parameter values is summarized in Table 1, and their distributions in terms of histogram plots are presented in Fig. 1.

Table 1 Range of parameter values in methane solubility dataset
Fig. 1
figure 1

Distribution of parameter values in methane solubility dataset

The solubility of ethane and propane in aqueous systems has not been widely examined, and the majority of experiments were conducted before 1980 (Hayduk 1982, 1986). Additionally, only some authors measured the gas solubility under intermediate to high pressure and temperature conditions. The compiled dataset contains 235 ethane solubility in pure water (Anthony and McKetta 1967; Chapoy et al. 2004; Claussen and Polglase 1952; Culberson et al. 1950; Mohammadi et al. 2004; Morrison and Billett 1952; Rettich et al. 1981; Wang et al. 2003; Wen and Hung 1970; Wetlaufer et al. 1964; Winkler 1901; Ben-Naim et al. 1973). Temperatures are in the range of 273.51 and 444.26 K, and pressure ranges from 1 to 685 bar. The statistical analysis of ethane solubility data is presented in Table 2 and Fig. 2.

Table 2 Range of parameter values in ethane solubility dataset
Fig. 2
figure 2

Distribution of parameter values in ethane solubility dataset

The compiled dataset contains 259 propane solubility data in pure water (Azarnoosh and McKetta 1958; Chapoy et al. 2004; Claussen and Polglase 1952; Gaudette and Servio 2007; Kobayashi and Katz 1953; Kresheck et al. 1965; Mokraoui et al. 2007; Morrison and Billett 1952; Umano and Nakano 1958; Wehe and McKetta 1961; Wen and Hung 1970; Wetlaufer et al. 1964; Wishnia 1963). Temperatures are in the range of 273.2 and 422 K, and pressure ranges from 0.103 to 42.69 bar. The statistical analysis of propane solubility data is provided in Table 3 and Fig. 3.

Table 3 Range of parameter values in propane solubility dataset
Fig. 3
figure 3

Distribution of parameter values in propane solubility dataset

3 Methodology

3.1 Machine learning

A brief description of regression tree and boosted regression tree algorithms is presented. Detailed explanations of mathematical backgrounds and computational procedures can be found in the cited literature. The models were developed using MATLAB 2021b software.

3.1.1 Regression tree

Regression tree (RT) is a supervised technique that uses one or more input (predictor) variables to predict a single output (response) variable. An RT is built through a binary recursive partitioning process, in which the data are split data into partitions or branches, and then splitting of each branch continues further (Leblanc 2006; Loh 2011). All the data in the training set are initially in the same group (root), and then are allocated into two branches (child nodes) using a splitting rule that maximizes homogeneity in the child nodes. The splitting process continues until each node reaches a user-specified minimum node size and becomes a terminal node. The splitting rules are in the internal nodes and the responses are in the leaf nodes (Cichsz 2015; Saha et al. 2015).

The advantage of tree-based models is that they are scalable to large problems and can handle smaller datasets than neural networks (Cichsz 2015; Shehab et al. 2024). RT is flexible and has the ability to adjust in time, can easily handle outliers, has an easy implementation on different types of data structures, and is computationally cheap. As with any model, regression tree has its own weaknesses; a single tree model tends to be unstable, which can negatively influence the accuracy of the response variable (Breiman 1996; Loh 2011).

3.1.2 Boosted regression tree

Boosted regression tree (BRT) is an ensemble of boosting and regression tree algorithms. In boosting, multiple trees are fitted to the training data, and are then sequentially combined to improve the predictive performance that can be obtained from a single tree (Elith et al. 2008; Taherdangkoo et al. 2022). Boosting emphasizes poorly modeled observations, i.e. observations with high deviations from the mean, in the existing trees to produce a strong prediction and improve the model accuracy. The predictive performance of RTs has been improved by the boosting algorithm (Buhlmann and Hothorn 2007; Saha et al. 2015).

BRT uses a stepwise forward procedure, which means that the existing trees remain unchanged. A new tree is trained at each iteration using the original features and is added to the current tree sequence. Then, residuals of each observation are updated to represent the contribution of the new tree. Once the process is complete, the final predictions are determined by the weighted sum of the predictions of individual trees (Elith et al. 2008; Saha et al. 2015). To minimize the loss function, Friedman (2001) introduced gradient boosting by applying the steepest descent method to the stepwise forward estimate. Later, the gradient boosting method was modified by using a random subsampling of the training data to improve the predictive performance and reduce over-fitting potential and the computation time (Friedman 2002).

3.1.3 Bayesian optimization

The performance of a tree model is dependent on the choice of its hyper-parameters values (Bergstra et al. 2011). We employed a Bayesian optimization algorithm to tune hyper-parameters of the RT and BRT models. Bayesian optimization is suitable for optimizing computationally expensive objective functions, and tolerates stochastic noise in function evaluations. This method is characterized by two features (i) a surrogate model of an objective function, and (ii) an acquisition function computed from the surrogate model, which is used to define the next evaluation point. We employed the expected improvement acquisition function, which was used to construct a utility function from the model posterior to direct sampling to areas where improvement over the current optimum can be expected (Bergstra et al. 2011; Hutter et al. 2019).

3.1.4 Accuracy assessment

We partitioned the data into k randomly groups (or folds) of roughly equal size using k-fold cross-validation. Models were trained using k-1 groups of the dataset and validated on the remaining group. The average error over all k groups was used to obtain the total effectiveness of the ML models. Herein, a fivefold cross-validation (k = 5) was used.

There are various statistical approaches to assess the accuracy of ML models. We used the coefficient of determination (R2), the mean squared error (MSE), and the median absolute deviation (MAD) metrics. We also used the absolute residual distribution plot to evaluate the model’s accuracy and check possible residual trends. Additionally, we employed a leverage statistical approach and sketched William’s plot (Narmandakh et al. 2023) to detect outliers.

3.2 Thermodynamic models

The performance of the most accurate ML model was compared with four thermodynamic models as they can effectively describe the phase behavior of the system when conditions fall within their applicability domain. We used Spivey and McCain (2004) to calculate methane solubility, Mao et al. (2005) to calculate ethane solubility, and Chapoy et al. (2004) and Pereda et al. (2009) to calculate propane solubility. Herein, we present each thermodynamic model in its original form.

(1) Spivey model

Spivey and McCain (2004) developed an empirical correlation to calculate methane (CH4) solubility in pure water, and used a modification of Duan et al. (1992) method to account for salinity. Spivey and McCain (2004) model, simply referred to as “Spivey model”, is valid for temperatures from 293.15 to 633.15 K, pressures from 9 to 2000 bar, and NaCl concentrations of up to 6 m. The solubility of methane in pure water and NaCl solutions can be calculated as follows:

$${C}_{m{CH}_{4},{H}_{2}O}=exp\left(A\left(T\right){\left[ln\left(P-{P}_{v}\right)\right]}^{2}+B\left(T\right)ln\left(P-{P}_{v}\right)+C\left(T\right)\right)$$
(1)
$${C}_{mC{H}_{4},brine}={C}_{mC{H}_{4},{H}_{2}O}exp\left[-2{\lambda }_{C{H}_{4},Na}\left(T,P\right){C}_{mNaCl}-{\xi }_{C{H}_{4},NaCl}\left(T,P\right){C}_{mNaCl}^{2}\right]$$
(2)

where \({C}_{m{CH}_{4}{H}_{2}O}\) and \({C}_{mC{H}_{4},brine}\) [mol kg−1] are the solubility of methane in pure water and brine solutions, respectively. T [K] is temperature, P [MPa] is pressure, and \({P}_{v}\) [MPa] is vapor pressure of pure water. A(T), B(T), and C(T) are temperature-dependent functions. \({C}_{mNaCl}\) [mol kg−1] is the NaCl concentration, and \({\lambda }_{{CH}_{4},Na}\left(T,P\right)\) and \({\xi }_{{CH}_{4},NaCl}\left(T,P\right)\) are coefficients.

(2) Mao model

Mao et al. (2005) developed a thermodynamic model based on an equation of state and the theory of Pitzer (Pitzer 1973) to calculate ethane (C2H6) in pure water and aqueous NaCl solutions. Mao model is valid within the range of temperature 273 to 444 K, and pressure of 0 to 1000 bar. The solubility of methane in pure water can be calculated using the following equation:

$$ln\frac{{y}_{{C}_{2}{H}_{6}}P}{{m}_{{C}_{2}{H}_{6}}}=\frac{{\mu }_{{C}_{2}{H}_{6}}^{l\left(0\right)}(T, P)-{\mu }_{{C}_{2}{H}_{6}}^{v\left(0\right)}(T)}{RT}-ln{\varphi }_{{C}_{2}{H}_{6}}(T, P, y)+ln{\gamma}_{{C}_{2}{H}_{6}}(T,P, m)$$
(3)

where \({m}_{{C}_{2}{H}_{6}}\) [mol kg−1] is the solubility of C2H6 in pure water, \({y}_{i}\) is the mole fraction of C2H6 in the gas phase, P [bar] is pressure, T [K] is temperature, R [bar cm3 mol−1 K−1] equal to 83.14467 is the universal gas constant, \({\mu }_{{C}_{2}{H}_{6}}^{l\left(0\right)}\) is the chemical potential of C2H6 in the liquid phase, and \({\varphi }_{{C}_{2}{H}_{6}}\) is the fugacity coefficient.

(3) Chapoy model

Chapoy et al. (2004) developed a thermodynamic model based on uniformity of the fugacity of each component in all phases to calculate propane (C3H8) solubility in the liquid phase. The Valderrama modification of the Patel–Teja equation of state (VPT-EoS) with non-density dependent (NDD) mixing rules was used to calculate fugacities in the fluid phases. They acquired experimental propane solubility data to adjust the binary interaction parameters between propane and water. Henry’s constants for propane in water can be calculated as:

$$ln\left({H}_{iw}\right)=552.64799+0.078453T-\frac{21334.4}{T}-85.89736lnT$$
(4)

where \({H}_{iw}\) [KPa] is Henry’s constant and T [K] is temperature.

(4) Pereda model

Pereda et al. (2009) used a group contribution plus association equation of state (GCA-EoS) to describe the phase behavior of water + hydrocarbon (C2 to n-C6, cy–C6, i–C4 and i–C8) system. They acquired experimental solubility data on the solubility of n-hexane, cyclo-hexane and iso-octane in pure water to adjust the parameters of GCA-EoS. The following equation was presented for the hard sphere diameter of water to take into account the temperature dependency of the hydrocarbon solubility in water and the vapor pressure of water.

$$d_{W} = d_{{CW}} \{ 0.554\left[ {exp\left[ {\frac{{ - 2T_{{CW}} }}{{3T}}} \right]} \right]^{2} - 0.543exp\left[ {\frac{{ - 2T_{{CW}} }}{{3T}}} \right] + 1.097\}$$
(5)

where \({d}_{W}\) [cm mol−1] is the hard sphere diameter of water, \({d}_{CW}\) [cm mol−1] is the hard sphere diameter of water at the critical temperature, T [K] is temperature, and \({T}_{CW}\) [K] is the critical temperature of water.

4 Results

4.1 Model performance evaluation

We employed a Bayesian optimization algorithm with an expected improvement acquisition function to optimize the hyper-parameters of the RT and BRT algorithms. The iteration number for running each algorithm was set to 300. Table 4 summarizes the optimum hyper-parameter values of the models obtained after the optimization process.

Table 4 Optimal hyper-parameter values of the ML models and their range of variations

The input parameters of the RT-BO and BRT-BO models are pressure [bar], temperature [K], and concentration [m] of NaCl, KCl, CaCl2, MgCl2, K2SO4, MgSO4, Na2SO4, K2SO4, Na2CO3, K2CO3, and the corresponding gas solubility (mole fraction) is the output parameter. In the case of gas solubility in pure water, the salt concentrations are set to zero. The regression plots (Fig. 4) of predicted gas solubility values from the RT-BO and BRT-BO versus experimental ones, show accumulation of data points close to the 45-degree reference line. The deviations from the reference line are more evident in RT-BO model, indicating its lower accuracy.

Fig. 4
figure 4

Regression plots of the RT-BO and BRT-BO models, showing predicted versus experimental solubility values of methane, ethane, and propane gases. \({x}_{gas}\) is the gas solubility in mole fraction

The statistical indices (R2, MSE, and MAD) indicate the superior performance of the BRT-BO model (Table 5). In this case, the MSE equals 9.97 × 10–8 and the MAD is 1.72 × 10–4. The RT-BO model also has a relatively high predictive capability; MSE and MAD equal 2.15 × 10–7 and 2.33 × 10–4, respectively.

Table 5 Summary of statistical indices for the ML model’s performance

The predictive capabilities of the models are further illustrated in Fig. 5. The experimental and predicted solubility values of the BRT-BO model mostly cover each other, indicating its good performance. The results show that the model is slightly less accurate in determining high gas solubility values, especially where the solubility in the aqueous phase is higher than 0.01 mol fraction because most of the data points have lower solubility values (see Sect. 2). The maximum and mean values for methane solubility are 0.0185, and 0.0024. These values for ethane are 0.0041 and 0.00052, respectively, and for propane are 0.00366 and 0.00012. The statistical analysis shows that the BRT-BO model’s deviations from the experimental data are minor, except for some data points.

Fig. 5
figure 5

Comparing the RT-BO and BRT-BO calculated light hydrocarbon solubility values with experimental values versus corresponding data index

We conducted more analysis to evaluate the performance of the BRT-BO model since it is more accurate than the RT-BO. The empirical cumulative distribution function (eCDF) of the BRT-BO (Fig. 6) indicates the high accuracy of the model because the curve is close to the Y-axis; the residuals are mostly distributed near zero. Furthermore, 70 % of the precipitated gas solubility values have an absolute error of lower than 1.5 × 10–4, and 90 % of the predicted values have an error of lower than 3.9 × 10–4.

Fig. 6
figure 6

Cumulative frequency of absolute residuals obtained from the BRT-BO model

We calculated the partial dependence between the predictor variables and gas solubility in aqueous phase using the BRT-BO model. Figure 7 displays the two-variable partial dependence of gas solubility on joint values of pressure and temperature. The solubility of light hydrocarbons has a strong partial dependence on pressure; gas solubility increases with increasing of pressure. The strong partial dependence of the gas solubility on the temperature is evident. These outcomes confirm that the developed model is reliable following the gas solubility behavior observed during the laboratory testing.

Fig. 7
figure 7

Partial dependence of the gas solubility in aqueous phase obtained from the BRT-BO on pressure and temperature

We sketched the Williams plot (Taherdangkoo et al. 2021b) on the basis of the standardized residuals and Hat values (diagonal elements of the Hat matrix) for the BRT-BO model to detect suspected experimental data and high leverage points, which are outliers falling outside of the applicability domain of the model. The analysis (Fig. 8) shows that bulk of the gas solubility data falls in the valid domain, 0 ≤ Hat ≤ 0.0019 and -3 ≤ standardized residuals ≤ 3, indicating the reliability of the compiled dataset and the statistical validity of the model.

Fig. 8
figure 8

Williams plot of gas solubility dataset for the BRT-BO model

The quantitative analysis shows that the standardized residuals of 36 data points (1.69% of the compiled data) are outside the range of − 3 to 3, which are considered as suspected data. Additionally, 112 data points (5.26%) have high leverage (Hat value ≥ 0.0019). However, the BRT-BO model’s performance analysis (Fig. 8) shows that the predicted gas solubility residuals are in an acceptable range.

4.2 Comparison analysis

In Fig. 9, the predictive ability of the BRT-BO model to calculate methane solubility in pure water was compared with experimental data and Spivey model in the temperature range between 298.2 and 344.3 K and pressure between 22.8 and 680 bar. The BRT-BO and Spivey models capture the solubility trend observed in the experimental data; increase of the methane solubility in liquid phase with increasing of pressure. The BRT-BO model can accurately determine methane solubility values at low pressure and temperature conditions. Furthermore, the comparison analysis with O’Sullivan (1970), and Culberson et al. (1951) shows the efficiency of the model in determining the solubility values at high pressures. The modeling deviations, i.e. the difference between experimental and predicted values, are minor showing the potential of the BRT-BO for future applications.

Fig. 9
figure 9

Comparison of experimentally determined solubility of methane in H2O at 298.2 K (Culberson et al. 1951), 303.2 K (Wang et al. 2003), 324.65 K (O’Sullivan 1970), and 344.3 K (Culberson et al. 1951) with the calculated values from the BRT-BO and Spivey models

Methane solubility in NaCl solutions was compared at temperatures ranging from 324 to 378 K, and NaCl concentration between 0.88 and 2.5 m. The gas solubility in liquid phase decreases with the increase in salinity, which was effectively modeled (Fig. 10). The BRT-BO is highly accurate in predicting the experimental data at wide ranges of pressure (41.8–1339 bar) demonstrated by the comparison analysis. The overall performance of the BRT-BO model is slightly better than Spivey. The comparison analysis shows that the methane solubility predictions are accurate in the CH4–H2O–NaCl and CH4–H2O systems. The analysis shows that the BRT-BO can be employed for modeling of two-phase flow and transport of methane in shallow and deep subsurface, e.g. freshwater and saline water aquifers, with an accuracy needed for hydrogeological applications.

Fig. 10
figure 10

Comparison of experimentally determined solubility of methane in the system H2O-NaCl at 324.65 K (O’Sullivan 1970), 298.15 K (Michels et al. 1936), 373 K (Blount et al. 1982), and 348.15 K (Michels et al. 1936) with the calculated values from the BRT-BO model and Spivey model

The BRT-BO model was compared with Mao’s model to calculate ethane solubility in pure water. The predictive ability of both models is satisfactory, showing only minor deviations from the experimental data in some conditions (Fig. 11). For instance, Mao model is slightly more accurate than the BRT-BO to determine experimental ethane solubility values taken from Mohammadi et al. (2004) at temperature of 298.3 K, while the BRT-BO model performs better at 313 K. The BRT-BO model provides better predictions on Wang et al. (2003) dataset. The BRT-BO model demonstrates a good covering of experimental data points and can be applied for ethane solubility prediction in aqueous phase at pressures ranging from 4.39 to 573.6 bar.

Fig. 11
figure 11

Comparison of experimentally determined solubility of ethane in H2O at 298.3 K (Mohammadi et al. 2004), 303.2 K (Wang et al. 2003), 313.19 K (Mohammadi et al. 2004), and 444.26 K (Culberson et al. 1951) with the calculated values from the BRT-BO model and Mao model

The BRT-BO and Chapoy model’s calculations regarding the propane solubility in pure water are close to experimental values taken from Chapoy et al. (2004) (Fig. 12). Similar to previous cases, the BRT-BO predictions follow the gas solubility behavior observed in experimental data. Chapoy model is the most accurate model in predicting propane solubility followed by the BRT-BO and Pereda models. The BRT-BO is slightly more accurate than the Pereda model under the conditions studied.

Fig. 12
figure 12

Comparison of experimentally determined solubility of propane in H2O at 298.12, 338.15, and 353.18 K (Chapoy et al. 2004), and various temperatures (Mokraoui et al. 2007) with the calculated values from the BRT-BO, Chapoy, and Pereda models

5 Discussion

The results showed that the BRT-BO model is able to determine solubility of methane, ethane, and propane gases in pure water and electrolyte solutions with sufficient accuracy, highlighting its potential for wide-ranging geological and hydrogeological applications. In general, the model provides accurate outcomes compared to established thermodynamic models such as those by Spivey, Mao, Chapoy, and Pereda models. The BRT-BO model serves as a viable alternative for predicting light hydrocarbon solubility in aquatic systems under diverse conditions.

One of the main advantages of the BRT-BO is its applicability to calculate the solubility of C1–C3 gases in a wide range of conditions. Although thermodynamic approaches are highly efficient, they are usually complex and have a limited application domain. For example, the Mao model, which requires many empirical parameters, is limited to the solubility calculations of ethane in aqueous systems. Numerical modeling of the transport of light hydrocarbons between deep gas reservoirs and shallow groundwater aquifers is complex as it involves multi-phase, multi-component flow through different media such as fault zones, fracture networks, and low permeability layers. The BRT-BO model can serve as an alternative to thermodynamic models needed to calculate the phase behavior of various gas components during the transport. This would reduce the complexity of numerical models making groundwater contamination models more efficient. Therefore, the BRT-BO model can be further implemented in numerical modeling frameworks to address issues in the field of science and engineering.

Future studies might explore incorporating the impact of ionic strength and specific ion effects on gas solubility directly within the machine learning algorithm, aiming to further refine its predictive accuracy. Additionally, the application of alternative optimization algorithms could offer improvements in the performance of machine learning models. While the dataset used for model development was adequately extensive, expanding this dataset could further improve the model’s accuracy and its ability to generalize across a broader range of conditions.

6 Conclusions

We employed regression tree (RT) and boosted regression tree (BRT) algorithms optimized with a Bayesian optimization algorithm to build a model able to calculate solubility of methane, ethane, and propane in aquatic systems over a wide range of pressure, temperature, and salt concentrations. The RT-BO and BRT-BO are able to determine the solubility of light hydrocarbons in aquatic systems. The BRT-BO is more accurate, evidenced by the MSE value close to zero (MSE = 9.97 × 10–8) and an R2 value of 0.99. The predictions of the BRT-BO are in good agreement with the experimental hydrocarbon solubility dataset, which contains 2129 experimental data of methane, ethane, and propane solubility in pure water and various electrolyte solutions. The comparison analysis of the BRT-BO model’s predictions with four well-established thermodynamic models confirms the high prediction capability of the ML model. The application of the leverage approach showed that the majority of data points (5.26% outliers) fall in the valid domain, verifying the statistical validity of the model. We conclude that the BRT-BO model is a well-suited and robust tool which can be regarded as an alternative to more classical approaches for light hydrocarbon solubility calculations needed for various scientific and engineering applications such as numerical modeling of stray gas migration in the subsurface environment, development of effective environmental risk management strategies, optimization of gas extraction and processing operations, and development of strategies aimed at mitigating atmospheric emissions of methane and other light hydrocarbons.