Modeling hydrogen solubility in hydrocarbons using extreme gradient boosting and equations of state

Mohammadi, Mohammad-Reza; Hadavimoghaddam, Fahime; Pourmahdi, Maryam; Atashrouz, Saeid; Munir, Muhammad Tajammal; Hemmati-Sarapardeh, Abdolhossein; Mosavi, Amir H.; Mohaddespour, Ahmad

doi:10.1038/s41598-021-97131-8

Modeling hydrogen solubility in hydrocarbons using extreme gradient boosting and equations of state

Article
Open access
Published: 09 September 2021

Volume 11, article number 17911, (2021)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

Modeling hydrogen solubility in hydrocarbons using extreme gradient boosting and equations of state

Download PDF

Mohammad-Reza Mohammadi¹,
Fahime Hadavimoghaddam²,
Maryam Pourmahdi³,
Saeid Atashrouz⁴,
Muhammad Tajammal Munir⁵,
Abdolhossein Hemmati-Sarapardeh^1,6,7,
Amir H. Mosavi^8,9 &
…
Ahmad Mohaddespour⁵

6658 Accesses
54 Citations
Explore all metrics

Abstract

Due to industrial development, designing and optimal operation of processes in chemical and petroleum processing plants require accurate estimation of the hydrogen solubility in various hydrocarbons. Equations of state (EOSs) are limited in accurately predicting hydrogen solubility, especially at high-pressure or/and high-temperature conditions, which may lead to energy waste and a potential safety hazard in plants. In this paper, five robust machine learning models including extreme gradient boosting (XGBoost), adaptive boosting support vector regression (AdaBoost-SVR), gradient boosting with categorical features support (CatBoost), light gradient boosting machine (LightGBM), and multi-layer perceptron (MLP) optimized by Levenberg–Marquardt (LM) algorithm were implemented for estimating the hydrogen solubility in hydrocarbons. To this end, a databank including 919 experimental data points of hydrogen solubility in 26 various hydrocarbons was gathered from 48 different systems in a broad range of operating temperatures (213–623 K) and pressures (0.1–25.5 MPa). The hydrocarbons are from six different families including alkane, alkene, cycloalkane, aromatic, polycyclic aromatic, and terpene. The carbon number of hydrocarbons is ranging from 4 to 46 corresponding to a molecular weight range of 58.12–647.2 g/mol. Molecular weight, critical pressure, and critical temperature of solvents along with pressure and temperature operating conditions were selected as input parameters to the models. The XGBoost model best fits all the experimental solubility data with a root mean square error (RMSE) of 0.0007 and an average absolute percent relative error (AAPRE) of 1.81%. Also, the proposed models for estimating the solubility of hydrogen in hydrocarbons were compared with five EOSs including Soave–Redlich–Kwong (SRK), Peng–Robinson (PR), Redlich–Kwong (RK), Zudkevitch–Joffe (ZJ), and perturbed-chain statistical associating fluid theory (PC-SAFT). The XGBoost model introduced in this study is a promising model that can be applied as an efficient estimator for hydrogen solubility in various hydrocarbons and is capable of being utilized in the chemical and petroleum industries.

Modeling of nitrogen solubility in normal alkanes using machine learning methods compared with cubic and PC-SAFT equations of state

Article Open access 22 December 2021

Modeling the solubility of light hydrocarbon gases and their mixture in brine with machine learning and equations of state

Article Open access 02 September 2022

Modeling CO2 solubility in water using gradient boosting and light gradient boosting machine

Article Open access 12 June 2024

Introduction

One of the fundamental properties for designing gas absorption and stripping columns in chemical industries is the solubility of gases in liquids¹. While the basic principles of solubility thermodynamics are well known, it is still a challenging issue to accurately predict solubility for important industrial systems applying molecular thermodynamics alone. Nowadays, hydrogen is an eminent substance in the industry. Hydrogen plays a substantial role in industrial processes, hence the solubility of it in various hydrocarbon solutions such as fuels is very important for designing and optimal operating of these processes². Hydrogen is a useful compound in the chemical and petroleum industries. The quality of heavy petroleum fractions can be upgraded through hydrovisbreaking or hydrocracking processes by adding hydrogen to them and increase the hydrogen to carbon ratio (H/C). The production of low sulfur fuels in the oil refining industry is such that large amounts of hydrogen are used for desulfurization plants^3,4,5. Design and operating processes such as hydrogenation and hydrocracking, along with corresponding kinetic models, require hydrogen solubility data⁶. Pressure, temperature, and composition of solvents can remarkably affect the hydrogen solubility as a thermodynamic quantity. Increasing pressure and temperature have an increasing impact on the solubility of gases. Also, from the molar fraction point of view, as hydrocarbon carbon number increases, hydrogen solubility increases as demonstrated by experimental tests^2,7,8,9. It is well known that traditional equations of state (EOSs) are limited in accurately predicting the solubility of hydrogen for the modeling of hydrogenation processes. There is a potential for energy waste and even a potential safety hazard in the hydrogenation process due to the overuse of hydrogen. Therefore, solubility data is very significant to predict the optimal amount of hydrogen in this process and can lead to improved plant safety. Performing experiments for heavy hydrocarbons due to the complexity of them is particularly difficult. Also, the risks associated with high-pressure or/and high-temperature conditions in industrial processes do not make extensive testing an attractive choice. Hence, modeling based on experimental data can be a good alternative.

The methods for predictions of hydrogen solubility in solvents such as hydrocarbons or petroleum mixture are mostly based on the application of empirical and semi-empirical models such as EOSs and are alike to those applied for solubility of other gases such as methane and CO₂^{10,11,12,13,14,15}. Shaw¹⁶ proposed a correlation for measuring the solubility of hydrogen in hydrocarbon solvents including heterocyclic, aromatic, and alicyclic type, by applying corresponding state theory¹⁶. Yuan et al.¹⁷ used molecular dynamics simulations to estimate the hydrogen solubility in heavy hydrocarbons for a range of pressures and temperatures. They concluded that a combination of the EOSs and molecular dynamics simulations can lead to more accurate and practical predictions for the hydrogen solubility at high pressures and temperatures¹⁷. Riazi and Roomi⁵ proposed a method for predicting the hydrogen solubility in hydrocarbons and their mixtures based on regular solution theory. Their procedure was based on calculating the parameter of hydrogen solubility according to the type of solvents or their molecular weight. The advantage of their method was that, unlike EOSs or other models, critical properties of solvent were not needed to calculate the hydrogen solubility. However, the need for other calculations in this method can still be considered a negative point⁵. Torres et al.¹⁸ applied the augmented Grayson Streed method¹⁹ to better model the solubility of hydrogen in heavy oil cuts. However, they noted that the homogeneous EOSs models could provide better results. The solubility of hydrogen in n-alcohols has been measured and modeled by d’Angelo and Francesconi²⁰. Also, in their work, individual correlations as pseudo-Henry’s constants were used to better estimate hydrogen solubility²⁰. Luo et al.²¹ experimentally investigated the hydrogen solubility in coal liquid and several hydrocarbons. They also proposed a mathematical model based on Henry’s law and the Pierotti method²¹. Yin and Tan²² obtained hydrogen solubility data in toluene in the presence of CO₂ (i.e., ternary system H₂ + toluene + CO₂). An EOS named Peng–Robinson associated with the van der Waals mixing rule was used to model the vapor–liquid equilibrium (VLE) data²². Qian et al.²³ used Peng–Robinson EOS to model a large dataset of various hydrogen-containing binary systems with the implementation of the group-contribution method for calculating temperature-dependent binary interaction parameters²³. This method was previously been proposed by Jaubert and Mutelet to predict the VLE of hydrocarbons binary mixtures²⁴. The solubility of hydrogen in several heavy normal alkanes has measured and modeled by Florusse et al.². They used statistical associating fluid theory (SAFT) approach to model the hydrogen solubility after experiments. However, this method is a complex method due to the adjustable parameters and parameters required for any potential function². Perturbed-Chain SAFT (PC-SAFT) EOS²⁵ is another method that can be used to estimate the solubility of hydrogen in hydrocarbons. This method has been utilized to propose several models for prognostication of the solubility of hydrogen in hydrocarbons and heavy oils^6,26,27,28. The classical EOSs, activity models, etc. require adjustable parameters, proper mixing rules, iterative calculations, etc. Traditional EOSs are only reliable in specific temperature and pressure ranges and have bounded flexibility for substances used.

Complex calculations in chemical and petroleum sciences have been facilitated by artificial intelligence (AI) methods in recent years. Regarding the use of artificial intelligence in the case of hydrogen solubility, Safamirzaei et al.²⁹ have considered the hydrogen solubility in primary n-alcohols and after that, they applied artificial neural networks (ANNs) to overcome EOSs and simple correlations constraints in achieving best modeling²⁹. Nasery et al.³⁰ implemented Adaptive Neuro-Fuzzy Inference System (ANFIS) to estimate the solubility of hydrogen in heavy oil fractions³⁰. Safamirzaei and Modarress³¹ modeled hydrogen solubility in heavy n-alkanes (C₄₆H₉₄, C₃₆H₇₄, C₂₈H₅₈, C₁₆H₃₄, and C₁₀H₂₂) by ANNs³¹. As can be seen in the literature studies, the issue of modeling hydrogen solubility in different solvents especially hydrocarbons has always been the focus of researchers. Also, according to the classification scheme of van Konynenburg and Scott³² and the updated version by Privat and Jaubert³³, hydrogen-containing systems systematically show type III phase behavior, and such systems are acknowledged to be particularly difficult to correlate. Hence, there is a window for developing a more general model to estimate hydrogen solubility in hydrocarbons using AI methods, which accounts more influential variables, with higher precision. Due to the nature of data-driven soft computing techniques, such a comprehensive model can be developed by combining more data points and various operating conditions.

In the current work, we apply a total of 919 experimental hydrogen solubility data points for 26 different hydrocarbons accumulated at different operating conditions^{1,2,8,11,14,21,34,35,36,37,38,39,40,41,42,43,44}. Advanced machine learning methods namely extreme gradient boosting (XGBoost), adaptive boosting support vector regression (AdaBoost-SVR), gradient boosting with categorical features support (CatBoost), light gradient boosting machine (LightGBM), and multi-layer perceptron (MLP) optimized by Levenberg–Marquardt (LM) algorithm are utilized to develop models for estimating the hydrogen solubility in hydrocarbons. Moreover, the validity of the proposed models is checked by applying statistical parameters and graphical error analyses. Also, several hydrogen solubility systems are estimated by the models developed in this work and five EOSs including Soave–Redlich–Kwong (SRK), Peng–Robinson (PR), Redlich–Kwong (RK), Zudkevitch–Joffe (ZJ), and perturbed-chain statistical associating fluid theory (PC-SAFT) to make a comparison between these models and EOSs.

Data gathering

To accurately model hydrogen solubility in hydrocarbons, 919 experimental hydrogen solubility data were gathered from the literature^{1,2,8,11,14,21,34,35,36,37,38,39,40,41,42,43,44}. Table 1 represents the sources of the experimental hydrogen solubility data used in this work along with the pressure range, temperature range, and uncertainty values for each system. Since the type of hydrocarbon dictates hydrogen solubility, a broad range of hydrocarbons was selected with properties represented in Table S1. Hydrocarbon families used in this study include alkane, alkene, cycloalkane, aromatic, polycyclic aromatic, and terpene.

Table 1 Hydrogen solubility database used for modeling in this work.

Full size table

To model hydrogen solubility in hydrocarbons, thermodynamic properties were considered for model development. In this work, molecular weight, critical pressure, and critical temperature of solvents along with pressure and temperature were selected as input parameters to the models. The hydrogen solubility (in terms of mole fraction) at different pressures and temperatures is set to be the model output. Moreover, a short statistical description of input and target parameters of the data bank applied for modeling is listed in Table 2. Using the uncertainty values of the experimental data in data-driven modeling can make the model really reliable. However, because uncertainty values (for test conditions and results of solubility tests) were not reported or fully reported in some papers, it was not possible to use them in modeling.

Table 2 Statistical information about the collected databank in this paper.

Full size table

It is very important to apply different systems to achieve a comprehensive model for predicting hydrogen solubility in hydrocarbons. The characterization data for the 26 various hydrocarbons from 6 hydrocarbon families utilized for modeling are presented in Table S1. A databank including 919 data points was gathered from 48 different systems of the literature^{1,2,8,11,14,21,34,35,36,37,38,39,40,41,42,43,44}, the statistical information of which is reported in Table 2. The carbon number of hydrocarbons is ranging from 4 to 46 corresponding to a molecular weight range of 58.12–647.2 g/mol. Also, the experimental hydrogen solubility data were collected in a broad range of operating temperatures, 213–623 (K) and pressures, 0.1–25.5 (MPa). According to the statistics reported in Table 2, the variation range and distribution of model input parameters are wide enough to provide a general model for estimating hydrogen solubility in hydrocarbons.

Models implementation

Extreme gradient boosting (XGBoost)

The main idea behind a tree-based ensemble technique is to utilize an ensemble of classification and regression trees (CARTs) such that the training data is fitted by the minimization of a regularized objective function. XGBoost is one of these tree-based models under the framework of gradient boosting decision tree (GBDT). To elaborate on the CART’s structure, every cart consists of (I) a root node, (II) internal nodes, and (III) leaf nodes as shown in Fig. 1. According to the binary decision practice, the root node which embodies the whole data set is subjected to be split into internal nodes, while the leaf nodes represent the ultimate classes. In order to build a robust ensemble in gradient boosting, a series of base CATRs are consecutively constructed where the weight of every individual CART needs to be tuned through the training process⁴⁵.

To model the output y for a given dataset where m and n are dimension features and examples, respectively, an ensemble of n tress needs to be trained:

$$\begin{gathered} \hat{y}_{i} = \sum\limits_{{k = 1}}^{N} {f_{k} \left( {X_{i} } \right)} , \quad f_{k} \in f \hfill \\ With\; f = \left\{ {f(X) = \omega _{{q(x)}} } \right\},(q:\mathbb{R}^{m} \to T,\omega \;\in \mathbb{R}^{T} ) \hfill \\ \end{gathered}$$

(1)

where the example × is mapped by the decision rule q(x) to the binary leaf index. In Eqs. (1) and (2), f represents the space of regression trees, f_k is the kth independent tree, T denotes the number of leaves on the tree, and ω is the weight of the leaf.

The determination of the ensemble of trees is performed by the minimization of regularized objective function L:

$$\begin{gathered} L = \sum _{i}^{n} l(\hat{y}_{{_{i} }} ,y_{{_{i} }} ) + \sum _{k}^{N} \Omega (f_{k} ) \hfill \\ With\;\Omega (f) = \gamma T + \frac{1}{2}\lambda \left\| \omega \right\|^{2} \hfill \\ \end{gathered}$$

(2)

where Ω is the regularization term limiting the model intricacy, assisting the reduction of the overfitting; l denotes a differentiable convex loss function; γ stands for the minimum loss reduction which is needed to split a new leaf, and λ shows the regulation coefficient. It should be noted that γ and λ in these sets of equations help to soar the model variance and decrease the overfitting⁴⁶.

In the gradient boosting approach, the objective function for every individual leaf is minimized through which more branched will be added iteratively.

$$L^{{(t)}} = \sum\limits_{{i = 1}}^{n} {\left\{ {l(y_{i} ,\hat{y}_{i} ^{{(t - 1)}} ) + f_{t} (X_{i} )} \right\}} + \Omega (f_{t} )$$

(3)

where t represents the t-th iteration in the aforementioned training process. To notably ameliorate the ensemble model, the XGBoost’s approach greedily adds the space of regression trees which is usually referred to as “greedy algorithm”. Therefore, the model output is iteratively updated through the minimization of the objective function:

$$\hat{y}_{i}^{{(t)}} = \hat{y}_{i}^{{(t - 1)}} + f_{t} (X_{i} )$$

(4)

The XGBoost benefits from the shrinkage strategy in which newly added weights are scaled after every step of boosting by a learning factor rate. This helps to diminish the effects of future new trees on every existing individual tree, thereby reducing the risk of overfitting⁴⁷.

Light gradient boosting machine (LightGBM)

Another new gradient learning framework built up upon the idea of the decision tree is LightGBM⁴⁸. The salient features of LightGBM which dominates XGBoost are consuming less memory, utilizing a leaf-wise growth approach with depth restrictions, and benefiting from a histogram-based algorithm that expedites the training process⁴⁹. Using the aforementioned histogram algorithm, LightGBM discretizes continuous floating-point eigenvalues into k bins, hence leading to building a k-width histogram. In addition, extra storage of pre-sorted results is not required in the histogram algorithm and values can be stored in an 8-bit integer after the feature discretization that reduces the memory consumption to 1/8. Nevertheless, this rough partitioning approach does decrease the model accuracy. LightGBM also uses a leaf-wise approach which is more effective than the traditional growth strategy named level-wise. The rationale behind this inefficiency in level-wise strategy is that the leaves from the same layer are considered at each step, thereby leading to a gratuitous memory allocation. Instead, the leaves with the highest branching gain are found at every step in the leaf-wise approach after which the algorithm continues to the branching cycle. Thus, the errors can be diminished and higher precision is achieved with the same number of segmentations compared to the horizontal direction. In Fig. 2, the strategy of leaf-wise tree growth is depicted. The downside of leaf orientation is growing deeper decision trees which unavoidably results in overfitting. However, LightGBM precludes this overfitting while furnishing high efficiency by applying a maximum depth limit to the leaf top^48,49.

In the followings, calculations for LightGBM are shown⁵⁰:

For a given training dataset $X = \left\{ {(x_{i} ,y_{i} )} \right\}_{{_{i = 1} }}^{m}$, LightGBM searches an approximation $\widehat{f}(x)$ to the function f*(x) to minimize the expected values of specific loss functions L(y, f (x)):

$$\mathop f\limits^{ \wedge } \left( x \right) = \arg \mathop {\min }\limits_{f} E_{y,x} L(y,f(x))$$

(5)

LightGBM ensembles many T regression trees $\sum_{t=1}^{T}{f}_{t }(x)$ to approximate the model. The regression trees are defined as w_q(x), $q \in \left\{ {1, \, 2,...,N} \right\}$, where w shows a vector representing the sample weights of leaf nodes, N stands for the number of tree leaves, and q represents the decision rule of trees. The model is trained in the additive form at step t:

$$G_{t} \cong \sum\limits_{i = 1}^{N} {L(y_{i} ,F_{t - 1} (x_{i} ) + f_{t} (x_{i} ))}$$

(6)

Newton's approach is used to approximate the objective function.

Gradient boosting with categorical features support (CatBoost)

For categorical boosting, categorical columns are used in CatBoost which uses permutation techniques such as one_hot_max_size (OHMS) and target-based statistics. In this technique, a greedy method is used for each new split of the current tree which enables CatBoost to find the exponential growth of the feature combination⁵¹. The following steps are applied in CatBoost for every feature possessing more categories compared to OHMS:

1.
Random subset formation of the records
2.
Label conversion to integers
3.
Categorical feature transformation to numeric, as follows:
$$avgT\arg et = \frac{countInClass + prior}{{totalCount + 1}}$$
(7)

where countInClass counts targets with the value of one for a given categorical feature, and totalCount counts previous objects (the starting parameters determine the prior to count the objects)^52,53.

Adaptive boosting (AdaBoost)

For supervised classification, Freund and Schapire⁵⁴ have suggested the AdaBoost system. In this model, reweighted data, that the eights are chosen reliability refers to the consistency of the output of the learners, are sequentially assumed in the week learners. This trick reduces the inexperienced learner in order to concentrate on the hard cases⁵⁵. The following represent the key steps of the Adaboost technique:

Defining Weights: ${w}_{j}=\frac{1}{n}, j=\mathrm{1,2},\dots .,n$ ;
For each i, set the training data to a weak learner ${Wl}_{i}(x)$ using weights and obtain the weighted error
$${Err}_{i}=\frac{{\sum }_{j=1}^{n}{w}_{j}I({t}_{j}\ne {wl}_{i}\left(x\right))}{{\sum }_{j=1}^{n}{w}_{j}}, I\left(x\right)=\{\begin{array}{c}0 if x=false \\ 1 if x=true\end{array}$$
For each i, determine weights for predictors as: ${\beta }_{i}=log\left(\frac{(1-{Err}_{i})}{{Err}_{i}}\right)$
Modified data wights for each i to N ( N denotes the number of learners);
Adjust weak learner for data test (x) as output.

In this study, support vector regressors (SVR) were applied as the weak learners in Adaboost systems.

Support vector regression (SVR)

Support Vector machine (SVM) is a group of similar supervised machine learning algorithms that can be applied for both regression and clustering tasks⁵⁶. SVR is a systematic technique for soft computation, with a well-established mathematical formulation. As it has been shown to be very stable for modeling multiple complex structures, this approach has gained significant interest. In the literature, the fundamental concept behind SVR is commonly presented⁵⁷. Therefore, we present a short description of the SVR conception for the sake of brevity. SVR attempts to obtain a regression function f(x) for a given dataset $[\left({x}_{1}, {y}_{1}\right),\dots ..,\left({x}_{n}, {y}_{n}\right)]$ with $x\in {R}_{d}$ as the d-dimensional input space and $y\in R$ as the output vector dependent on the input data to estimate the output as below:

$$f(x) = w.\phi (x_{i} ) + b$$

(8)

where $b$ denotes bias vectors, $w$ shows the weight, and $\phi \left(x\right)$ refers to the function of the kernel. The following minimization problem proposed by Vapnik should be solved in order to achieve the right values of the weight and bias vectors⁵⁸:

$$\begin{gathered} minimize\frac{1}{2}w^{T} w + C\sum\limits_{j = 1}^{N} {(\zeta_{j}^{ - } + \zeta_{j}^{ + } )} \hfill \\ \hfill \\ \left\{ \begin{gathered} (w.\phi (x_{i} ) + b) - y_{i} \le \varepsilon + \zeta_{j}^{ - } \hfill \\ y_{i} - (w.\phi (x_{i} ) + b) \le \varepsilon + \zeta_{j}^{ + } \hfill \\ \zeta_{j}^{ + } ,\zeta_{j}^{ - } \ge 0,i = 1,2,...,m \hfill \\ \end{gathered} \right. \hfill \\ \end{gathered}$$

(9)

where $T$ represents the transpose operator, $\varepsilon$ shows the error tolerance, C represents a positive regularization parameter that defines the variance from $\varepsilon$, ${\zeta }_{j}^{+}$ and ${\zeta }_{j}^{-}$ consider positive parameters, attempting to point out the lower and higher excess variations, respectively.

By means of the Lagrange multipliers, the previously discussed constrained optimization problem is taken into a dual function. This move then leads to the final solution, which is presented as follows:

$$f(x) = \sum\limits_{j = 1}^{n} {(a_{k} - a_{k}^{*} )K(x_{k} ,x_{l} ) + } b$$

(10)

where $K({x}_{k},{x}_{l})$ represents the kernel function; ${a}_{k}$ and ${a}_{k}^{*}$ represent the Lagrange multipliers that follow the 0 k and k C constraints.

Multilayer perceptron (MLP) neural network

MLP is a class of feedforward ANNs that consists of various layers. The primary layer which is pertinent to the input data is the input layer, the last layer which corresponds to the output of the model is the output layer and the middle layers which process the information are hidden layers⁵⁹. In the hidden layers, each neuron will connect to every neuron in the next and prior layers. The manner of calculating the value of every neuron in the output or hidden layers is as follows: the amount of every neuron in the prior layer which is multiplying in its corresponding particular weight is summed together and a bias factor is appended to these values. Then, the resulting value passes through an activation function⁶⁰. Table S2 summarizes different activation functions along with their corresponding mathematical equations. The number of hidden layers and neurons in any hidden layer should be optimized to acquire a highly efficient and accurate model, usually using the empirical method. The performance of the MLP model depends on the optimization algorithms such as Levenberg–Marquardt (LM)⁶¹ applied to train this intelligent model. In this work, the MLP model which is developed on the basis of the LM optimization algorithm is dubbed MLP-LM . Figure 3 represents a schematic of the developed MLP in this work.

The procedure of model development

For developing each model and take care of overfitting, we used grid search for optimizing hyperparameters of models. The hyperparameters used in grid search for each model were different, the importance of the hyperparameters was based on theoretical and practical aspects. The following hyperparameters were used for each model:

For XGBoost: max_depth, n_estimators, learning_rate, min_child_weight, base_score.
For LightGBM: Boosting type, objective, metric, learning rate, feature fraction, bagging fraction.
For AdaBoost-SVR: learning rate, loss, epsilon, n_estimators, γ, C.
For CatBoost: n_estimators, max_depth, learning rate.
For MLP-LM: learning rate, Epoches.

The empirical method is also applied to determine the optimal number of hidden layers and neurons in any hidden layer for the MLP neural network.

In this work, we used k-fold cross-validation on our train dataset because it cares that every observation from the dataset has the chance of appearing in training and validation. For all models, we did use KFold 6 (as we know Kfold should not be too small or too high, and it depends on data size) so the value is picked up based on our data. It means we split the train data randomly into 6 folds and then fit the model using K-1 (which is 5 folds) and validate the model using the remaining fold.

Equations of state (EOSs)

The analytical description of the relationship between volume, temperature, and pressure of a substance can be expressed by an EOS. The vapor–liquid–equilibria (VLE), volumetric behavior, and thermal properties of mixtures and pure substances can be described by this expression. The phase behavior of petroleum fluids is widely predicted by EOSs. As already mentioned, traditional EOSs offer poor predictions for the solubility of gases in solvents, especially in complex operating conditions. In this study, four cubic EOSs including SRK, PR, RK, and ZJ along with PC-SAFT as a type of SAFT EOSs are implemented to measure the hydrogen solubility in hydrocarbons and their precision in estimating the hydrogen solubility is compared with the proposed machine learning models. Conventional van der Waals one-fluid mixing rules are utilized in cubic EOSs. Table S3 shows the PVT relationships of the cubic EOSs and PC-SAFT equation in terms of the residual Helmholtz free energy. Furthermore, the parameters and mixing rules for the EOSs are presented in Table S4. Also, the pure-component PC-SAFT parameters for the substances used in this work are reported in Table S5. The binary interaction parameter (k_ij) in van der Waals mixing rules characterizing molecular interactions between molecules of two components, can be a key parameter in estimating the solubility of a solute in a solvent in cubic EOSs. A similar k_ij parameter is introduced by applying the van der Waals one-fluid mixing rules to the perturbation terms in PC-SAFT EOS that corrects the segment-segment interactions of unlike chains. The optimized values of k_ij parameter for all EOSs in different hydrogen solubility systems are reported in Table S6.

Model assessment

Statistical error analysis

The following definitions have been implemented for the statistical parameters of standard deviation (SD), average absolute percent relative error (AAPRE), root mean square error (RMSE), coefficient of determination (R²), and average percent relative error (APRE) to assess the validation and accuracy of the models:

$$SD = \sqrt {\frac{1}{N - 1}\sum\limits_{i = 1}^{N} {\left( {\frac{{HS_{i,e} - HS_{i,p} }}{{HS_{{i,{\text{e}} }} }}} \right)}^{2} }$$

(11)

$$RMSE = \sqrt {\frac{1}{N}\sum\limits_{i = 1}^{N} {\left( {HS_{{i,{\text{e}} }} - HS_{i,p} } \right)}^{2} }$$

(12)

$$AAPRE = \frac{100}{N}\sum\limits_{i = 1}^{N} {\left| {\frac{{HS_{{i,{\text{e}} }} - HS_{i,p} }}{{HS_{{i,{\text{e}} }} }}} \right|}$$

(13)

$$APRE = \frac{100}{N}\sum\limits_{i = 1}^{N} {\left( {\frac{{HS_{{i,{\text{e}} }} - HS_{i,p} }}{{HS_{{i,{\text{e}} }} }}} \right)}$$

(14)

$$R^{2} = 1 - \frac{{\sum\limits_{i = 1}^{N} {(HS_{{i,{\text{e}} }} - HS_{i,p} )^{2} } }}{{\sum\limits_{i = 1}^{N} {(HS_{i,p} - \overline{{HS_{{i,{\text{e}} }} }} )^{2} } }}$$

(15)

In these formulas, HS_i,e, HS_i,p, and N, respectively, represent the experimental hydrogen solubility data and predicted values of hydrogen solubility in hydrocarbons by developed models, and the number of data points. The coefficient of determination which is represented almost everywhere by the R² is one of the most well-known criteria for the goodness of fit of a model. R² is an important statistical parameter that shows how well the model output corresponds to the experimental data and how valid the model is. If the R² value is closer to 1, the fit of the model response to the experimental values is greater. The data scattering around zero deviation is assessed by RMSE. APRE and AAPRE measure the relative deviation and the relative absolute deviation from the target data, respectively. The measure of scattering is assessed by SD, which less value of it demonstrates a lower grade of dispersion.

Graphical error analysis

Besides the statistical error analysis that has already been mentioned, visual graphical analysis can also help to understand the validity of the models developed in this work. The significant items are classified as follows:

Crossplot: in this graph, the estimated values of a model are plotted versus experimental values. If the finest fit line of the model estimation has no deviation from the 45° line and the computed data are mostly concentrated nearest to the unit slope line (Y = X), the performance of the model is excellent.

Error distribution plot: the presence or absence of error trend is checked by measuring the error scattering around the zero-error line. Here, the relative error (E_i) is calculated through Eq. (16):

$${\text{E}}_{{\text{i}}} = \left[ {\frac{{HS_{{i,{\text{e}} }} - HS_{i,p} }}{{HS_{{i,{\text{e}} }} }}} \right] \times100 {\text{i }} = 1,2,3,...,n$$

(16)

Cumulative frequency graph: the cumulative frequency of data is sketched versus absolute relative error (E_a). The higher cumulative frequency curve reveals that most of the estimations fall within the usual error range. In other words, the closer the curve to the vertical axis, the model error in estimating the high percentage of data is less. In this work, the E_a is calculated through Eq. (17):

$${\text{E}}_{a} = \left| {\frac{{HS_{{i,{\text{e}} }} - HS_{i,p} }}{{HS_{{i,{\text{e}} }} }}} \right| \times 100{\text{i }} = 1,2,3,...,n$$

(17)

Group error diagram: the data are divided into diverse ranges and their error at each range is calculated and sketched.

Trend plot: in this diagram, both target data and estimated values by the proposed model are sketched against the index of data points and their coverage and trend are tracked.

Results and discussion

Description of model development

The optimal values of the important hyperparameters along with the search interval of the hyperparameters tuned for the machine learning models implemented in this work are presented in Table 3.

Table 3 Optimal features for implemented models.

Full size table

In Table 3, n_estimators show the number of trees; subsample is subsample ratio of the training instance; C denotes a degree of importance that is given to misclassifications; max_depth represents maximum depth of a tree; min_child_weight is the minimum sum of instance weight (hessian) needed in a child; bagging_fraction shows the fraction of data to be used for each iteration; feature_fraction is parameters randomly selected in each iteration for building trees; learning_rate controls the impact of each tree on the final outcome; base_score represents the initial prediction score of all instances; epsilon is a parameter affect the number of support vectors applied to construct the regression function; γ shows kernel coefficient, and epochs show the number of times that the learning algorithm is passed through a full training dataset.

Statistical assessment of the developed models

To identify the most accurate model, we should compare the developed models using statistical factors including, R², AAPRE (%), SD, APRE (%), and RMSE. The calculated values for these parameters are reported in Table 4. The results reveal that among all developed models, XGBoost provides the most accurate predictions, followed by AdaBoost-SVR, LightGBM, CatBoost, and MLP−LM models, respectively. Based on Table 4, AAPRE values of 2.14% for the testing set, 1.71% for the training set, and 1.81% for the total set of data, suggest that the XGBoost model has the most accurate estimation of hydrogen solubility in hydrocarbons. However, Table 4 reveals that other models also display good accuracy.

Table 4 Statistical error analysis for the models developed in this work.

Full size table

For a comparative evaluation of the models developed in this work with five EOSs, 30 hydrogen solubility data points in three different systems including hydrocarbons with low, medium, and high molecular weight collected from the literature^8,11,39 were estimated by these models. Predictions of models along with the results calculated by the EOSs are presented in Table 5. The AAPRE reported in Table 5 is much higher for the EOSs than the machine learning models. ZJ EOS with an AAPRE of 15.78% has the best calculations for hydrogen solubility in hydrocarbons among the other cubic EOSs. Also, PC-SAFT as a modern type of EOSs shows good estimates with AAPRE of 9.56% and has superior performance compared to traditional cubic EOSs. All machine learning models have good predictions and show a significant advantage over EOSs. XGBoost model has the best performance among all models and EOSs with an AAPRE of 1.92%. It is noteworthy that uncertainty values are different for different systems. According to our studies, AAPRE values reported in Table 5 can vary about 5–10% due to uncertainty values, but it is better to trust the reported experimental values in the literature.

Table 5 Comparison of proposed models’ performance in this work with EOSs.

Full size table

To further evaluate the validity and reliability of the XGBoost model, an external validation dataset containing 413 hydrogen solubility data in 18 different hydrocarbons, including 6 new hydrocarbons (i.e. ethane, propane, ethene, 1-hexene, 1-heptene, and diphenylmethane) over a wide range of operating temperatures (98–701 K) and pressures (1.03–78.45 MPa), were collected from the literature. The properties of all hydrocarbons used in this work are presented in Table S1. Table 6 describes this validation dataset of hydrogen solubility data. This dataset is completely outside the training and testing sets used for modeling in this paper. Hence, it allows evaluating the performance of the model outside the modeling data sets. AAPRE values for each system are calculated using experimental data and predictions of the XGBoost model. The AAPRE values reported in Table 6 show that the XGBoost model has good predictions for all systems, even for new hydrocarbons not used in modeling. Overall AAPRE of 1.78% for this validation dataset shows the high validity of the XGBoost model in predicting hydrogen solubility in hydrocarbons.

Table 6 Validation dataset for evaluation of XGBoost model.

Full size table

Visual error analysis

For a more detailed assessment of the accuracy of the proposed models, visual analysis applying the crossplot of predicted hydrogen solubility against the corresponding experimental values was depicted in Fig. 4. Besides, Fig. 5 presented the error distribution diagram for each of the two testing and training sets of all models. Figure 4 demonstrates that the high concentration of data points surrounding the 45° line for all models. However, the XGBoost model performs much better than other models, indicating its high reliability for predicting hydrogen solubility in hydrocarbons. The relative errors among experimental hydrogen solubility and estimated values by the proposed models versus the experimental data for the test and training sets are illustrated in Fig. 5. This figure demonstrates that the relative errors of XGBoost and AdaBoost-SVR models are highly near the zero-error line, but the errors of the predictions of CatBoost, LightGBM, and MLP-LM models are not as low as the XGBoost and AdaBoost-SVR models. The maximum percent relative error among the estimated hydrogen solubility values and the experimental data for the XGBoost model is 19%. Figures 4 and 5 reflect the significant extent of agreement between the experimental hydrogen solubility data and the XGBoost model predictions.

Figure S1 represents the trend plot of the predicted values of hydrogen solubility in hydrocarbons for all proposed models and the experimental hydrogen solubility data versus the index of data points. As demonstrated in Fig. S1 in the Supplementary file, all models show good overlap between the estimated hydrogen solubility data and the experimental values, but the degree of overlap is excellent for the XGBoost model.

Figure S2 depicts the cumulative frequency of the data versus E_a for all developed models. Based on this figure, more than 70% of estimated hydrogen solubility by the XGBoost model have an absolute relative error < 1.3%, as well as more than 90% of the estimated data, have an absolute relative error < 3.6%. However, for the AdaBoost-SVR, LightGBM, CatBoost, and MLP-LM models respectively 81%, 79%, 73%, and 48% of predicted hydrogen solubility data have an absolute relative error < 3.6%, indicating the high validity of the XGBoost model.

Operating pressure and temperature greatly affect the solubility of hydrogen in hydrocarbons. As mentioned earlier, predicting hydrogen solubility under high-pressure/ igh-temperature conditions in various industries, is very important and the safety and efficiency of industrial processes depend on it. Figure 6 presents the validity of models at selected values of pressure and temperature ranges by applying the group error plots. It is worth noting that the group error analysis is performed by splitting all data into various ranges of pressure (i.e. 0–5 MPa, 5–10 MPa, 10–15 MPa, 15–20 MPa, and 20–25 MPa) and temperature (i.e. 210–294 K, 294–378 K, 378–462 K, 462–546 K, and 546–630 K) to investigate the validity of the proposed models at various ranges of these important parameters. AAPRE was calculated for the mentioned intervals and plotted in Fig. 6a for pressure parameter and Fig. 6b for temperature parameter. As can be seen in Fig. 6, LightGBM and MLP-LM models have relatively higher errors in low and high pressures and temperatures. Also, CatBoost and AdaBoost-SVR models have relatively higher errors in low pressures and temperatures. XGBoost model has the lowest error among all models for different temperature and pressure operating conditions, which proves the previous claims of good performance of this model.

Trend analysis

At the next stage, several different analyses were performed to assess the performance of the XGBoost model in different systems of hydrogen solubility in hydrocarbons. First, the impact of pressure on the hydrogen solubility in n-Decane at a high temperature of 432 K² is evaluated in Fig. 7. The hydrogen solubility values predicted by the XGBoost model for this system along with the values calculated by the EOSs are demonstrated in Fig. 7. As indicated in this figure, at high-temperature conditions, the deviation between traditional RK EOS calculations and experimental data is high, but the other EOSs and XGBoost model predict experimental data excellently. As expected, the solubility of hydrogen in the n-Decane increases with increasing pressure. However, cubic EOSs slightly overestimate or underestimate the increase in solubility with increasing pressure at high temperatures, while the XGBoost model follows the trend very well. PC-SAFT EOS also has good predictions with low deviation from experimental data and outperforms traditional cubic EOSs.

Next, the hydrogen solubility data in a hydrocarbon named diphenylmethane⁷⁶ with a molecular weight of 168.23 and a carbon number of 13 are predicted by the XGBoost model at high temperature and pressure conditions (Fig. 8). Again, as depicted in Fig. 8, the XGBoost model correctly detects data trends and provides excellent forecasts. As can be seen, the effect of temperature increase along with increasing pressure on hydrogen solubility is correctly predicted by the XGBoost model.

As mentioned earlier, the solubility of hydrogen increases with an increasing carbon number of hydrocarbons^2,7,8,9. Therefore, the predictions of the XGBoost model for the solubility of hydrogen in several hydrocarbons with different carbon numbers (decane, eicosane, octacosane, and hexatriacontane) at a temperature of 373 K, which have been studied experimentally in literature⁸, are presented in Fig. 9. In this case, as well, the estimations of the XGBoost model are in good agreement with the reported experimental hydrogen solubility data for all these hydrocarbons.

Conclusions

In this work, five robust machine learning models were introduced for estimating the hydrogen solubility in hydrocarbons as a function of critical pressure, critical temperature, and molecular weight of solvents along with pressure and temperature operating conditions. A databank including 919 data points gathered from 48 different systems of the 26 various hydrocarbons was applied to model the hydrogen solubility. Implementing the techniques of XGBoost, CatBoost, LightGBM, AdaBoost-SVR, and MLP-LM revealed that the estimations of hydrogen solubility in hydrocarbons from the five proposed models reached the AAPRE of 1.81%, 3.40%, 3.52%, 4.70%, and 6.01% for XGBoost, AdaBoost-SVR, LightGBM, CatBoost, and MLP-LM , respectively. XGBoost is introduced as the best-proposed model in this work based on graphical and statistical error analysis. Evaluation of the XGBoost model with an external validation dataset containing 413 hydrogen solubility data in 18 different hydrocarbons over a wide range of operating temperatures (98–701 K) and pressures (1.03–78.45 MPa) also proved the validity and reliability of the XGBoost model in predicting hydrogen solubility in hydrocarbons. Also, the calculation of hydrogen solubility in hydrocarbons for several different systems by EOSs showed that PC-SAFT has the best predictions for hydrogen solubility in hydrocarbons among the other EOSs. However, ZJ EOS also outperformed another cubic EOSs.

Abbreviations

ZJ:: Zudkevitch–Joffe EOS
XGBoost:: EXtreme gradient boosting
VLE:: Vapor–liquid equilibrium
SVR:: Support vector regression
SD:: Standard deviation
SVM:: Support vector machine
SAFT:: Statistical associating fluid theory
SRK:: Soave–Redlich–Kwong EOS
ReLU:: Rectified linear unit
RMSE:: Root mean square error
RK:: Redlich–Kwong EOS
pred:: Predicted
PC-SAFT:: Perturbed-chain statistical associating fluid theory
PR:: Peng–Robinson EOS
OHMS:: One_hot_max_size
MLP-LM:: Multilayer perceptron optimized by Levenberg–Marquardt algorithm
LightGBM:: Light gradient boosting machine
HS:: Hydrogen solubility
exp:: Experimental
EOSs:: Equations of state
EOS:: Equation of state
CARTs:: Classification and regression trees
CatBoost:: Gradient boosting with categorical features support
AAPRE:: Average absolute percent relative error
ANFIS:: Adaptive neuro-fuzzy inference system
APRE:: Average percent relative error
AdaBoost-SVR:: Adaptive boosting support vector regression
R² :: Coefficient of determination
E_i :: Relative error
E_a :: Absolute relative error

References

Katayama, T. & Nitta, T. Solubilities of hydrogen and nitrogen in alcohols and n-hexane. J. Chem. Eng. Data 21, 194–196 (1976).
Article CAS Google Scholar
Florusse, L., Peters, C., Pamies, J., Vega, L. F. & Meijer, H. Solubility of hydrogen in heavy n-alkanes: Experiments and saft modeling. AIChE J. 49, 3260–3269 (2003).
Article CAS Google Scholar
Pacheco, M. A. & Dassori, C. G. Hydrocracking: An improved kinetic model and reactor modeling. Chem. Eng. Commun. 189, 1684–1704 (2002).
Article CAS Google Scholar
Alves, J. J. & Towler, G. P. Analysis of refinery hydrogen distribution systems. Ind. Eng. Chem. Res. 41, 5759–5769 (2002).
Article CAS Google Scholar
Riazi, M. & Roomi, Y. A method to predict solubility of hydrogen in hydrocarbons and their mixtures. Chem. Eng. Sci. 62, 6649–6658 (2007).
Article CAS Google Scholar
Saajanlehto, M., Uusi-Kyyny, P. & Alopaeus, V. Hydrogen solubility in heavy oil systems: Experiments and modeling. Fuel 137, 393–404 (2014).
Article CAS Google Scholar
Lal, D., Otto, F. & Mather, A. Solubility of hydrogen in Athabasca bitumen. Fuel 78, 1437–1441 (1999).
Article CAS Google Scholar
Park, J., Robinson, R. L. J. & Gasem, K. A. Solubilities of hydrogen in heavy normal paraffins at temperatures from 323.2 to 423.2 K and pressures to 17.4 MPa. J. Chem. Eng. Data 40, 241–244 (1995).
Article CAS Google Scholar
Cai, H.-Y., Shaw, J. & Chung, K. Hydrogen solubility measurements in heavy oil and bitumen cuts. Fuel 80, 1055–1063 (2001).
Article CAS Google Scholar
Schwarz, B. J. & Prausnitz, J. M. Solubilities of methane, ethane, and carbon dioxide in heavy fossil-fuel fractions. Ind. Eng. Chem. Res. 26, 2360–2366 (1987).
Article CAS Google Scholar
Tsuji, T., Shinya, Y., Hiaki, T. & Itoh, N. Hydrogen solubility in a chemical hydrogen storage medium, aromatic hydrocarbon, cyclic hydrocarbon, and their mixture for fuel cell systems. Fluid Phase Equilib. 228, 499–503 (2005).
Article CAS Google Scholar
Moysan, J., Huron, M., Paradowski, H. & Vidal, J. Prediction of the solubility of hydrogen in hydrocarbon solvents through cubic equations of state. Chem. Eng. Sci. 38, 1085–1092 (1983).
Article CAS Google Scholar
Li, H. & Yan, J. Evaluating cubic equations of state for calculation of vapor–liquid equilibrium of CO₂ and CO₂-mixtures for CO₂ capture and storage processes. Appl. Energy 86, 826–836 (2009).
Article CAS Google Scholar
Park, J., Robinson, R. L. & Gasem, K. A. Solubilities of hydrogen in aromatic hydrocarbons from 323 to 433 K and pressures to 21.7 MPa. J. Chem. Eng. Data 41, 70–73 (1996).
Article CAS Google Scholar
Jamali, M., Izadpanah, A. A. & Mofarahi, M. Correlation and prediction of solubility of hydrogen in alkenes and its dissolution properties. Appl. Petrochem. Res. 20, 1–10 (2021).
Google Scholar
Shaw, J. A correlation for hydrogen solubility in alicyclic and aromatic solvents. Can. J. Chem. Eng. 65, 293–298 (1987).
Article CAS Google Scholar
Yuan, H., Gosling, C., Kokayeff, P. & Murad, S. Prediction of hydrogen solubility in heavy hydrocarbons over a range of temperatures and pressures using molecular dynamics simulations. Fluid Phase Equilib. 299, 94–101 (2010).
Article CAS Google Scholar
Torres, R., De Hemptinne, J.-C. & Machin, I. Improving the modeling of hydrogen solubility in heavy oil cuts using an augmented Grayson Streed (AGS) approach. Oil Gas Sci. Technol. Rev. IFP Energies Nouvelles 68, 217–233 (2013).
Article CAS Google Scholar
Streed, G. G. In 6th World Petroleum Congress. (World Petroleum Congress).
d’ Angelo, J. V. H. & Francesconi, A. Z. Gas−liquid solubility of hydrogen in n-alcohols (1≪ n≪ 4) at pressures from 3.6 MPa to 10 MPa and temperatures from 298.15 K to 525.15 K. J. Chem. Eng. Data 46, 671–674 (2001).
Article CAS Google Scholar
Luo, H., Ling, K., Zhang, W., Wang, Y. & Shen, J. A model of solubility of hydrogen in hydrocarbons and coal liquid. Energy Sources Part A Recov. Util. Environ. Effects 33, 38–48 (2010).
Article CAS Google Scholar
Yin, J.-Z. & Tan, C.-S. Solubility of hydrogen in toluene for the ternary system H₂⁺ CO₂⁺ toluene from 305 to 343 K and 1.2 to 10.5 MPa. Fluid Phase Equilib. 242, 111–117 (2006).
Article CAS Google Scholar
Qian, J.-W., Jaubert, J.-N. & Privat, R. Phase equilibria in hydrogen-containing binary systems modeled with the Peng–Robinson equation of state and temperature-dependent binary interaction parameters calculated through a group-contribution method. J. Supercrit. Fluids 75, 58–71 (2013).
Article CAS Google Scholar
Jaubert, J.-N. & Mutelet, F. VLE predictions with the Peng–Robinson equation of state and temperature dependent kij calculated through a group contribution method. Fluid Phase Equilib. 224, 285–304 (2004).
Article CAS Google Scholar
Gross, J. & Sadowski, G. Perturbed-chain SAFT: An equation of state based on a perturbation theory for chain molecules. Ind. Eng. Chem. Res. 40, 1244–1260 (2001).
Article CAS Google Scholar
Saajanlehto, M., Uusi-Kyyny, P. & Alopaeus, V. A modified continuous flow apparatus for gas solubility measurements at high pressure and temperature with camera system. Fluid Phase Equilib. 382, 150–157 (2014).
Article CAS Google Scholar
Ghosh, A., Chapman, W. G. & French, R. N. Gas solubility in hydrocarbons—a SAFT-based approach. Fluid Phase Equilib. 209, 229–243 (2003).
Article CAS Google Scholar
Ma, M., Chen, S. & Abedi, J. Modeling the solubility and volumetric properties of H2 and heavy hydrocarbons using the simplified PC-SAFT. Fluid Phase Equilib. 425, 169–176 (2016).
Article CAS Google Scholar
Safamirzaei, M., Modarress, H. & Mohsen-Nia, M. Modeling the hydrogen solubility in methanol, ethanol, 1-propanol and 1-butanol. Fluid Phase Equilib. 289, 32–39 (2010).
Article CAS Google Scholar
Nasery, S., Barati-Harooni, A., Tatar, A., Najafi-Marghmaleki, A. & Mohammadi, A. H. Accurate prediction of solubility of hydrogen in heavy oil fractions. J. Mol. Liq. 222, 933–943 (2016).
Article CAS Google Scholar
Safamirzaei, M. & Modarress, H. Hydrogen solubility in heavy n-alkanes; modeling and prediction by artificial neural network. Fluid Phase Equilib. 310, 150–155 (2011).
Article CAS Google Scholar
Van Konynenburg, P. & Scott, R. Critical lines and phase equilibria in binary van der Waals mixtures. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Sci. 298, 495–540 (1980).
ADS Google Scholar
Privat, R. & Jaubert, J.-N. Classification of global fluid-phase equilibrium behaviors in binary systems. Chem. Eng. Res. Des. 91, 1807–1839 (2013).
Article CAS Google Scholar
Ronze, D., Fongarland, P., Pitault, I. & Forissier, M. Hydrogen solubility in straight run gasoil. Chem. Eng. Sci. 57, 547–553 (2002).
Article CAS Google Scholar
Gao, W., Robinson, R. L. & Gasem, K. A. High-pressure solubilities of hydrogen, nitrogen, and carbon monoxide in dodecane from 344 to 410 K at pressures to 13.2 MPa. J. Chem. Eng. Data 44, 130–132 (1999).
Article CAS Google Scholar
Gao, W., Robinson, R. L. & Gasem, K. A. Solubilities of hydrogen in hexane and of carbon monoxide in cyclohexane at temperatures from 344.3 to 410.9 K and pressures to 15 MPa. J. Chem. Eng. Data 46, 609–612 (2001).
Article CAS Google Scholar
Sebastian, H. M., Simnick, J. J., Lin, H.-M. & Chao, K.-C. Gas-liquid equilibrium in the hydrogen+ n-decane system at elevated temperatures and pressures. J. Chem. Eng. Data 25, 68–70 (1980).
Article CAS Google Scholar
Kim, K. J., Way, T. R., Feldman, K. T. & Razani, A. Solubility of hydrogen in octane, 1-octanol, and squalane. J. Chem. Eng. Data 42, 214–215 (1997).
Article CAS Google Scholar
Brunner, E. Solubility of hydrogen in 10 organic solvents at 298.15, 323.15, and 373.15 K. J. Chem. Eng. Data 30, 269–273 (1985).
Article CAS Google Scholar
Aslam, R. et al. Measurement of hydrogen solubility in potential liquid organic hydrogen carriers. J. Chem. Eng. Data 61, 643–649 (2016).
Article CAS Google Scholar
Phiong, H.-S. & Lucien, F. P. Solubility of hydrogen in α-methylstyrene and cumene at elevated pressure. J. Chem. Eng. Data 47, 474–477 (2002).
Article CAS Google Scholar
Peramanu, S. & Pruden, B. B. Solubility study for the purification of hydrogen from high pressure hydrocracker off-gas by an absorption-stripping process. Can. J. Chem. Eng. 75, 535–543 (1997).
Article CAS Google Scholar
Klink, A., Cheh, H. & Amick, E. Jr. The vapor-liquid equilibrium of the hydrogen—n-butane system at elevated pressures. AIChE J. 21, 1142–1148 (1975).
Article CAS Google Scholar
Nelson, E. & Bonnell, W. Solubility of hydrogen in n-butane. Ind. Eng. Chem. 35, 204–206 (1943).
Article CAS Google Scholar
Chen, T. & Guestrin, C. In Proceedings of the 22nd Acm SIGKDD International Conference on Knowledge Discovery and Data Mining. 785–794.
Zhang, J. et al. A unified intelligent model for estimating the (gas+ n-alkane) interfacial tension based on the eXtreme gradient boosting (XGBoost) trees. Fuel 282, 118783 (2020).
Article CAS Google Scholar
Dev, V. A. & Eden, M. R. Computer Aided Chemical Engineering Vol 47 113–118 (Elsevier, 2019).
Google Scholar
Ke, G. et al. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural. Inf. Process. Syst. 30, 3146–3154 (2017).
Google Scholar
Yang, X., Dindoruk, B. & Lu, L. A comparative analysis of bubble point pressure prediction using advanced machine learning algorithms and classical correlations. J. Petrol. Sci. Eng. 185, 106598 (2020).
Article CAS Google Scholar
Sun, X., Liu, M. & Sima, Z. A novel cryptocurrency price trend forecasting model based on LightGBM. Financ. Res. Lett. 32, 101084 (2020).
Article Google Scholar
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V. & Gulin, A. CatBoost: Unbiased boosting with categorical features. arXiv:1706.09516 (arXiv preprint) (2017).
Dorogush, A. V., Ershov, V. & Gulin, A. CatBoost: Gradient boosting with categorical features support. arXiv:1810.11363 (arXiv preprint) (2018).
Meng, Q. et al. A communication-efficient parallel algorithm for decision tree. arXiv:1611.01276 (arXiv preprint) (2016).
Freund, Y. & Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 119–139 (1997).
Article MathSciNet MATH Google Scholar
Dargahi-Zarandi, A., Hemmati-Sarapardeh, A., Shateri, M., Menad, N. A. & Ahmadi, M. Modeling minimum miscibility pressure of pure/impure CO₂-crude oil systems using adaptive boosting support vector regression: Application to gas injection processes. J. Petrol. Sci. Eng. 184, 106499 (2020).
Article CAS Google Scholar
Smola, A. J. & Schölkopf, B. A tutorial on support vector regression. Stat. Comput. 14, 199–222 (2004).
Article MathSciNet Google Scholar
Schölkopf, B., Smola, A. J., Williamson, R. C. & Bartlett, P. L. New support vector algorithms. Neural Comput. 12, 1207–1245 (2000).
Article PubMed Google Scholar
Vapnik, V., Golowich, S. E. & Smola, A. Support vector method for function approximation, regression estimation, and signal processing. Adv. Neural Inf. Process. Syst. 20, 281–287 (1997).
Google Scholar
Lashkarbolooki, M., Hezave, A. Z. & Ayatollahi, S. Artificial neural network as an applicable tool to predict the binary heat capacity of mixtures containing ionic liquids. Fluid Phase Equilib. 324, 102–107 (2012).
Article CAS Google Scholar
Mohammadi, M.-R., Hemmati-Sarapardeh, A., Schaffie, M., Husein, M. M. & Ranjbar, M. Application of cascade forward neural network and group method of data handling to modeling crude oil pyrolysis during thermal enhanced oil recovery. J. Petrol. Sci. Eng. 20, 108836 (2021).
Article CAS Google Scholar
Hagan, M. T. & Menhaj, M. B. Training feedforward networks with the Marquardt algorithm. IEEE Trans. Neural Netw. 5, 989–993 (1994).
Article CAS PubMed Google Scholar
Sagara, H., Arai, Y. & Saito, S. Vapor-liquid equilibria of binary and ternary systems containing hydrogen and light hydrocarbons. J. Chem. Eng. Jpn. 5, 339–348 (1972).
Article CAS Google Scholar
Trust, D. & Kurata, F. Vapor-liquid phase behavior of the hydrogen-propane and hydrogen-carbon monoxide-propane systems. AIChE J. 17, 86–91 (1971).
Article CAS Google Scholar
Aroyan, H. J. & Katz, D. L. Low temperature vapour–liquid equilibria in hydrogen-n-butane system. Ind. Eng. Chem. 43, 185–189 (1951).
Article CAS Google Scholar
Sattler, H. Solubility of hydrogen in liquid hydrocarbons. Z. Tech. Phys. 21, 410–413 (1940).
CAS Google Scholar
Peter, S. & Reinhartz, K. Das phasengleichgewicht in den systemen H 2—n-heptan, H 2-methylcyclohexan und H 2–2, 2, 4-trimethylpentan bei Höheren Drucken und temperaturen. Z. Phys. Chem. 24, 103–118 (1960).
Article CAS Google Scholar
Sokolov, V. & Polyakov, A. Solubility of H2 in n-decane, n-tetradecane, 1-hexane, 1-octene, isopropyl benzene, 1-methyl naftalene and decalin. Zh. Prikl. Khim 50, 1403–1405 (1977).
CAS Google Scholar
Schofield, B., Ring, Z. & Missen, R. Solubility of hydrogen in a white oil. Can. J. Chem. Eng. 70, 822–824 (1992).
Article CAS Google Scholar
Dean, M. & Tooke, J. Vapor-liquid equilibria in three hydrogen-paraffin systems. Ind. Eng. Chem. 38, 389–393 (1946).
Article CAS Google Scholar
Lin, H.-M., Sebastian, H. M. & Chao, K.-C. Gas-liquid equilibrium in hydrogen+ n-hexadecane and methane+ n-hexadecane at elevated temperatures and pressures. J. Chem. Eng. Data 25, 252–254 (1980).
Article CAS Google Scholar
Berty, T., Reamer, H. & Sage, B. Phase behavior in the hydrogen-cyclohexane system. J. Chem. Eng. Data 11, 25–30 (1966).
Article CAS Google Scholar
Simnick, J. J., Sebastian, H. M., Lin, H.-M. & Chao, K.-C. Solubility of hydrogen in toluene at elevated temperatures and pressures. J. Chem. Eng. Data 23, 339–340 (1978).
Article CAS Google Scholar
Connolly, J. Thermodynamic properties of hydrogen in benzene solutions. J. Chem. Phys. 36, 2897–2904 (1962).
Article ADS CAS Google Scholar
Krichevskii, I. & Efremova, G. FAZOVYE I OBEMNYE SOOTNOSHENIYA V SISTEMAKH ZHIDKOST-GAZ PRI VYSOKIKH DAVLENIYAKH. Zh. Fiz. Khim. 22, 1116–1125 (1948).
CAS Google Scholar
Malone, P. V. & Kobayashi, R. Light gas solubility in phenanthrene: The hydrogen—phenanthrene and methane—phenanthrene systems. Fluid Phase Equilib. 55, 193–205 (1990).
Article CAS Google Scholar
Simnick, J. J., Liu, K. D., Lin, H.-M. & Chao, K.-C. Gas-liquid equilibrium in mixtures of hydrogen and diphenylmethane. Ind. Eng. Chem. Process. Des. Dev. 17, 204–208 (1978).
Article CAS Google Scholar
Simnick, J., Lawson, C., Lin, H. & Chao, K. Vapor-liquid equilibrium of hydrogen/tetralin system at elevated temperatures and pressures. AIChE J. 23, 469–476 (1977).
Article CAS Google Scholar

Download references

Author information

Authors and Affiliations

Department of Petroleum Engineering, Shahid Bahonar University of Kerman, Kerman, Iran
Mohammad-Reza Mohammadi & Abdolhossein Hemmati-Sarapardeh
Gubkin National University of Oil and Gas, Moscow, 119991, Russia
Fahime Hadavimoghaddam
Department of Polymer Reaction Engineering, Faculty of Chemical Engineering, Tarbiat Modares University, Tehran, Iran
Maryam Pourmahdi
Department of Chemical Engineering, Amirkabir University of Technology (Tehran Polytechnic), Tehran, Iran
Saeid Atashrouz
College of Engineering and Technology, American University of the Middle East, Kuwait, Kuwait
Muhammad Tajammal Munir & Ahmad Mohaddespour
Institute of Research and Development, Duy Tan University, Da Nang, 550000, Vietnam
Abdolhossein Hemmati-Sarapardeh
Faculty of Environment and Chemical Engineering, Duy Tan University, Da Nang, 550000, Vietnam
Abdolhossein Hemmati-Sarapardeh
Institute of Software Design and Development, Obuda University, Budapest, 1034, Hungary
Amir H. Mosavi
Department of Informatics, J. Selye University, 94501, Komarno, Slovakia
Amir H. Mosavi

Authors

Mohammad-Reza Mohammadi
View author publications
You can also search for this author in PubMed Google Scholar
Fahime Hadavimoghaddam
View author publications
You can also search for this author in PubMed Google Scholar
Maryam Pourmahdi
View author publications
You can also search for this author in PubMed Google Scholar
Saeid Atashrouz
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Tajammal Munir
View author publications
You can also search for this author in PubMed Google Scholar
Abdolhossein Hemmati-Sarapardeh
View author publications
You can also search for this author in PubMed Google Scholar
Amir H. Mosavi
View author publications
You can also search for this author in PubMed Google Scholar
Ahmad Mohaddespour
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.-R.M.: investigation, data curation, visualization, writing-original draft. F.H.: conceptualization, validation, modeling. M.P.: writing-review and editing, validation. S.A.: writing-review and editing, methodology, validation. M.T.M.: writing-original draft, supervision. A.H.-S.: methodology, validation, supervision, writing-review and editing. A.M.: writing-review and editing, validation, funding. A.M.: writing-review and editing, validation.

Corresponding authors

Correspondence to Abdolhossein Hemmati-Sarapardeh, Amir H. Mosavi or Ahmad Mohaddespour.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information 1.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Mohammadi, MR., Hadavimoghaddam, F., Pourmahdi, M. et al. Modeling hydrogen solubility in hydrocarbons using extreme gradient boosting and equations of state. Sci Rep 11, 17911 (2021). https://doi.org/10.1038/s41598-021-97131-8

Download citation

Received: 03 March 2021
Accepted: 12 August 2021
Published: 09 September 2021
DOI: https://doi.org/10.1038/s41598-021-97131-8
Springer Nature Limited

This article is cited by

Modeling liquid rate through wellhead chokes using machine learning techniques
- Mohammad-Saber Dabiri
- Fahimeh Hadavimoghaddam
- Abdolhossein Hemmati-Sarapardeh
Scientific Reports (2024)
Prediction of compressive strength and tensile strain of engineered cementitious composite using machine learning
- Md Nasir Uddin
- N. Shanmugasundaram
- Ling-zhi Li
International Journal of Mechanics and Materials in Design (2024)
Nanoscale modeling of an efficient Carbon Nanotube-based RF switch using XG-Boost machine learning algorithm
- Pranav Chaitanya
- S. Sethuraman
- S. Mohamed Mansoor Roomi
Microsystem Technologies (2024)
Machine Learning Approach in Dosage Individualization of Isoniazid for Tuberculosis
- Bo-Hao Tang
- Xin-Fang Zhang
- Wei Zhao
Clinical Pharmacokinetics (2024)
Compositional modeling of gas-condensate viscosity using ensemble approach
- Farzaneh Rezaei
- Mohammad Akbari
- Abdolhossein Hemmati-Sarapardeh
Scientific Reports (2023)

Modeling hydrogen solubility in hydrocarbons using extreme gradient boosting and equations of state

Abstract

Similar content being viewed by others

Introduction

Data gathering

Models implementation

Extreme gradient boosting (XGBoost)

Light gradient boosting machine (LightGBM)

Gradient boosting with categorical features support (CatBoost)

Adaptive boosting (AdaBoost)

Support vector regression (SVR)

Multilayer perceptron (MLP) neural network

The procedure of model development

Equations of state (EOSs)

Model assessment

Statistical error analysis

Graphical error analysis

Results and discussion

Description of model development

Statistical assessment of the developed models

Visual error analysis

Trend analysis

Conclusions

Abbreviations

References

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Navigation