Introduction

Dew point pressure is needed to characterize wet gas and gas condensate reservoir fluids. Industry practitioners often rely on measuring the dew point (among other phase behavior properties) using laboratory experiments. In the absence of laboratory data, dew point estimation models (correlations) are usually available to estimate the dew point with varying accuracy (discussed below). These models are based on either knowledge of fluid composition or knowledge of some surface fluid properties data (e.g., GCR, API of stock tank oil, and reservoir temperature). For many years, the oil and gas industry has been actively developing several tools to measure real-time in situ fluid composition and properties. To the best of our knowledge, we are unaware of any simple dew point estimation model based on the down-hole fluid composition measurements except the one we present here. Such model, when available, will allow rapid evaluation of dew point pressure before measuring it in the laboratory with common techniques and will have several applications.

Numerous models for predicting gas condensate dew point pressure have been derived from large databases in the literature. There are essentially two types of published models for estimating dew point pressure in gas condensate reservoirs. One type of models uses detailed compositional analysis that requires laboratory PVT experiments, while another type uses easily measured parameters from production tests and down-hole temperature as inputs. Marruffo et al. (2002), Nemeth and Kennedy (1966), Elsharkawy (2001, 2011), Shokir (2008), Olds et al. (1944), and Godwin (2012) require detailed compositional analysis. Marruffo et al. (2002) used nonlinear regression to fit appropriate models and build their model, applying statistical tools such as residual analyses and cross-plots. The developed model required information from production tests, but did not require the knowledge of gas condensate composition. The original total PVT data sample size they used was 148. After the process of selection and validation, the database was reduced to 114 data points. Nemeth and Kennedy (1966) used 579 data points from 480 different hydrocarbon systems to develop a model that predicts dew point pressure with an average deviation of 7.4%. The model input parameters are variables measured in the laboratory with composition in mole fraction for CH4 through C7H16, N2, CO2, H2S, and molecular weight and specific gravity of heptane plus fraction. Their work is regularly quoted in many of the more recent models. Elsharkawy (2001) developed a physically sound empirical method for predicting dew point pressure based on routinely measured gas analysis and reservoir temperature. In total, 340 measurements of dew point pressure were used, resulting in a model with an absolute average of 7.68%. The model included the effect of all variables such as temperature, gas condensate composition, and properties of the plus fraction on dew point pressure. Elsharkawy (2001, 2011) discussed the two types of dew points applicable to any hydrocarbon mixtures. The first dew point occurs when dry gas is compressed to the point that liquid starts to form. The second called retrograde or condensate dew point and occurs when a gas mixture containing heavy hydrocarbons in its solution is depressurized until liquid forms. He presented models for dew point prediction based on gas composition and reservoir temperature. His database included 340 data points. Shokir (2008) used genetic programming to develop a model for dew point pressure prediction from 245 gas condensate systems. The developed model uses the full composition of the gas (CH4 through C7H16+, N2, CO2, and H2S mole fractions) in addition to the molecular weight of the heptanes plus fraction, and reservoir temperature. He also tested his model against other published models. Olds et al. (1944) studied the behavior of six symmetrically chosen mixtures from Paloma field for a range of parameters (temperatures ranged from 100 to 250 °F and pressures up to 5000 psia). They studied the influence of pressure and temperature on the composition and the retrograde gas dew point, and implemented a graphical examination in order to obtain a chart correlating volumetric and phase behavior with the composition of the system and temperature. Godwin (2012) used data from the literature and developed a dew point estimation model based on gas composition, reservoir temperature, and properties of the heptanes plus fraction. A total of 259 out of 273 data points were selected to build the new model, and 14 data points were used for testing.

On the other hand, a literature review for the models that take as input field data that are easily measured finds the following: Humoud and Al-Marhoun (2001), Ovalle et al. (2005) and Al-Dhamen and Al-Marhoun (2011). Humoud and Al-Marhoun (2001) developed another model based on available field data from 74 PVT reports. They correlated the dew point pressure of a gas condensate fluid directly with its reservoir temperature, pseudo-reduced pressure and temperature, primary separator gas–oil ratio, the primary separator pressure and temperature, and relative densities of separator gas and heptanes plus specific gravity. The average error for this model was 4.33%. Ovalle et al. (2005) used readily available field data to calculate the dew point pressure. Their database contained 615 points. Their model is based on initial producing gas condensate ratio from the first-stage separator, initial API of the stock tank liquid, specific gravity of the initial reservoir gas, and reservoir temperature. Nonparametric approaches for estimating optimal transformations of data were used to obtain the maximum correlation between observed variables. Al-Dhamen and Al-Marhoun (2011) developed a new model to predict dew point pressure for gas condensate reservoirs, using nonparametric approaches and artificial neural networks. Their results were based on a total number of 113 data samples obtained from constant mass expansion experiments from fields in the Middle East.

Down-hole fluid analysis

The process of obtaining real-time analysis of down-hole characteristics passed through many stages of development, starting with IFA and ending with DFA. Fingerprinting in fluid characterization became an important topic receiving wide attention with direct application on improving the quality of fluid samples. Many tools (e.g., OFA, LFA, CFA, and IFA) capable of detecting in situ variation of different fluids were developed over the years (1991/2001/2003/2007), (Mullins et al. 2009; Elshahawi et al. 2007; Xian et al. 2006). The development of these tools was to address several production problems (e.g., sizing of facilities, well placement, completions equipment, and production prediction). According to Betancourt et al. (2004, 2007), the Composition Fluid Analyzer (CFA) is a tool which has a sensor for performing fluorescence spectroscopy by measuring light emission in the green and red ranges of the spectrum after excitation with blue light. It was originally introduced in order to track phase transitions in gas condensate sampling. In situ Fluid Analyzer (IFA), based on optical absorption methods, can provide the mass fractions of three hydrocarbon molecular groups: CH4, C2H6–C5H12 and C6H14+, as well as CO2, in real time at down-hole conditions. It can also track the gas condensate (by simply dropping temporarily the sampling pressure below the saturation pressure of the fluid so as to observe the change in the fluorescence signal that will occur with dew formation at the dew point pressure).

With the development of down-hole optical fluid analyzer (DFA), more capabilities were added to down-hole fluid analysis. DFA has become an increasingly utilized technology in wireline logging as it enables fluid characterization by creating a down-hole fluid log (versus depth along the hydrocarbon column). In multi-well applications, DFA can help in addressing fluid distribution and variation inside the reservoir, and in identification of reservoir compartments. The basic outputs from DFA measurements are weight percentages of CO2, CH4, C2H6, C3H8, C5H12, and C6H14+, in addition to live fluid density.

Mullins et al. (2009) showed that in the case of large fluid conditional variations and compartmentalization, DFA can be used as a tool to help in delineating these variations in a cost-effective manner. They introduced the example shown in Fig. 1 as an identifying fingerprint among different fluids. Analysis of the oil peak at a wavelength of 1700 nm gives the dissolved methane content. Therefore, it can be used for the tracking of density variations and discontinuities in fluid properties.

Fig. 1
figure 1

Visible near-infrared spectra of oilfield fluids, after Mullins et al. 2009

In the form of an optimized wireline logging tool, DFA is used in the Gulf of Mexico and different areas of the world for detecting hydrocarbon variations and reducing uncertainty in varied compositions cases. Compartmentalization can also be detected by these tools (Betancourt et al. 2007).

With the increased application of these down-hole fluid analysis tools, valuable compositional information (for hydrocarbon groups in weight percent) becomes available to reservoir and production engineers. In this paper, we present a new dew point estimation model that is different from the other models available in the literature, as it is based on down-hole fluid analyzer data. The correlation is thus capable of predicting the dew point pressure for a wide range of wet gases and gas condensate fluids without the need for full laboratory compositional analysis, production data, or production test information.

Methodology

Fluids database

McCain (1994) characterized different fluid properties and introduced widely acceptable criteria to differentiate among the five reservoir fluid types. According to McCain’s criteria, we collected fluid data (covering wide range of gas properties) from different reservoirs located in different regions of the world (with around 17% of the data from the Middle East). Part of the database came from the literature, especially the data presented by Nemeth (1966), which was extensively used in developing most available dew point pressure prediction models from surface data.

The database contained 667 complete (without missing values) laboratory gas condensate samples. We manipulated the data to be in the format of the output of down-hole fluid analyzer tools. We divided the data into two groups. The first group included the data where complete laboratory analyses were performed (Table 1) and consisted of 99 complete samples. The second group consisted of the remaining 568 samples which included compositional data and some basic information (Table 2). The full database included gas condensate samples with reservoir gas gravity ranging from 0.558 to 1.86, primary separator gas gravity of 0.56–1.42, field stock tank liquid gravity of 37.0–67.7, condensate gas ratio of 0.63–232 STB/MMscf, separator temperature from 19.9 to 176 °F, separator pressure 33.20–2581.7 psia, C7+ specific gravity from 0.69 to 0.85, C7+ mole percent from 0 to 24.23, reservoir temperatures of 143.8–347 °F, and dew point pressure 1429–11,656 psia.

Table 1 Data ranges for the complete PVT experiments gas samples (99 samples)
Table 2 Data ranges for the composition available PVT gas samples (568 samples)

Development of a new empirical model

In developing the model, we considered only the 667 complete gas condensate samples in our database (no missing variable values). In the model to predict dew point pressure (the output or dependent variable), the following pool of independent (or input) variables was considered: temperature, CO2, CH4, C2H6, C3H8, C4H10, C5H12, and C6H14+ mole%. The model building procedure entailed the following steps.

  1. 1.

    building a database for gas samples;

  2. 2.

    making quality checks on the data samples;

  3. 3.

    filtering the samples;

  4. 4.

    converting mole% to weight% for all samples based on molecular weight of each component to match the output of the down-hole fluid analyzer data;

  5. 5.

    PVT data compositions were lumped back to emulate the down-hole fluid analyzer output compositions; and

  6. 6.

    checking interrelationships among the variables and removing poor predictor variables (e.g., C2H6).

Some exploratory analyses were carried out to find the best scale, transformations, and importance of the predictor variables (step 6 above). A moderate amount of correlation was found among most of the variables. These preliminary analyses led us to consider a multiple linear regression model with the following variables, where we have assigned them symbols in order to more easily discuss the results.

$$\begin{aligned} & y = \log \,({\text{Dew Point Pressure}}) \\ & x1 = {\text{temperature}} \\ & x2 = \log \,({\text{CO}}_{2} + 0.1) \\ & x3 = \log \,({\text{CH}}_{4} ) \\ & x4 = \log \,({\text{CH}}_{4} )^{2} \\ & x5 = \log \,({\text{C}}_{3} {\text{H}}_{8} - {\text{C}}_{5} {\text{H}}_{{12}} ) \\ & x6 = \log \,({\text{C}}_{4} {\text{H}}_{{14}} + ) \\ \end{aligned}$$

The units of CO2, CH4, C2H6, C3H8, C5H12, and C6H14+ are in weight percent, temperature is in degrees Fahrenheit, and pressure is in psia. C3H8–C5H12 denotes the weight percent of the group C 3 H 8 through C 5 H 12. All logarithm (log) values denote natural log (base e).

The scatter plots below the diagonal in the composite matrix plot in Fig. 2 give an idea of the pairwise relationships among the variables. The red line is a local linear smoother through the cloud of points. The blue values in the upper part of the matrix plot are the corresponding values of the correlation coefficients between each pair of variables, with font size proportional to the absolute value of the correlation. Thus, the largest correlation (0.98) occurs between x 3 and x 4 (not surprising since x 4 = x 23 ), and the smallest (−0.0018 and too small to be visible) between x 2 and x 5. Note that x 1 (temperature) is the most important predictor of pressure since the two have a correlation coefficient of 0.50, while x 3 is the least important (correlation between y and x 3 is 0.004).

Fig. 2
figure 2

Scatter plot of variables used in the regression model

A multiple linear regression model was predicted via standard statistical methods to the database of n = 667 well samples, resulting in the model listed in Table 3 and shown in Eqs. (3) and (4). Three criteria were used to select an appropriate model (AIC, AICc, BIC); however, the “best” model identified by each of these criteria is usually not the same (Burnham and Anderson 2002). The search involved considering models of the form:

$$y = \beta_{0} + \beta_{1} x_{1} + \cdots + \beta_{6} x_{6} + \cdots + \beta_{12} x_{12} + \cdots + \beta_{56} x_{56} + \in$$
(1)

where ∊ is the usual residual noise term in a regression model. The β i are model parameters to be estimated from our data. This was done by searching across all possible combinations of variables and their pairwise interactions. For example, the variable x 34 = x 3 * x 4 denotes the product of x 3 and x 4, and is called the interaction between x3 and x 4. This resulted in a pool of 21 potential predictors: the 6 single variables {x 1,…,x 6}, plus a total of 15 interaction terms {x 12,…,x 56}. With all combinations of 21 variables, the number of possible candidate models that can be formed is 221 ≈ 2.1 million. (This can be understood by realizing that we have the option of whether or not to select each of the 21 predictors {x 1,…,x 56} for inclusion in the model.) The size of the model space to be searched over is thus extremely large. The complexity of this search was made possible by using sophisticated statistical software, namely the R package glmulti (R Core Team 2016) which implements a genetic algorithm search over large model spaces.

Table 3 Coefficients of the chosen model

Basic measures of model goodness of fit are R 2, mean squared error (MSE) which is the estimate of the noise variance σ2, and mean absolute relative error (MARE), defined as follows. If yi and ŷi are, respectively, the observed and the model-predicted values for the ith value of y, i = 1,…,n, then

$${\text{MARE}} = \frac{1}{n}\sum \left| {\frac{{y_{i} - \hat{y}i}}{{y_{i} }}} \right|$$
(2)

Note that n is the sample size and in our case n = 667. It is well known that R 2 (proportional of variability in y explained by the model) will increase and both MSE and MARE will decrease, as more variables are included in the model, despite the importance of these predictor variables. Thus, an over-parameterized model (too many predictors) will have very high/low values of these measures, accordingly, and will seemingly do very well in sample, but will do poorly out of sample. The use of model selection tools based on information criteria such as AIC, AICc, and BIC tends to avoid the over-fitting problem (Burnham and Anderson 2002).

Performance of the chosen model

The coefficient estimates for the best model according to the AICc criterion, are displayed in Table 3. For example, \(\beta_{0}\) = 19.11 and \(\beta_{1}\) = −0.0679. The standard error (Std. Error) column is an estimate of the variability of the estimate and can be used to assess the uncertainty associated via a formal hypothesis test. The P value in Table 3 is the result of testing if the corresponding parameter is equal to 0. For example, for β 0 the P value of 0.000 means that the estimate of \(\beta_{0}\) is significantly different from 0, whereas the P value of 0.908 means that the estimate of \(\beta_{2}\) is not significantly different from 0. The other commonly used criteria (AIC, BIC) arrived at models that were very similar to this one.

$$P_{d} = e^{X }$$
(3)
$$\begin{aligned} x = 19.1109840 - 0.067916 \times {\text{temperature}} - 0.0162705 \times \log \left( {{\text{CO}}_{2} + 0.1} \right) - 6.6190184 \hfill \\ \times \log \left( {{\text{CH}}_{4} } \right) + 0.5104139 \times \log \left( {{\text{CH}}_{4} } \right)^{2} + 1.1398989\log \left( {{\text{C}}_{3} {\text{H}}_{8} - {\text{C}}_{5} {\text{H}}_{12} } \right) \hfill \\ + 0.6263451 \times \log \left( {{\text{C}}_{6} {\text{H}}_{14} + } \right) + 0.0371260 \times {\text{temperature}} \times \log \left( {{\text{CH}}_{4} } \right) - 0.0048367 \hfill \\ \times {\text{temperature}} \times \log \left( {{\text{CH}}_{4} } \right)^{2} + 0.0573708 \times \log \left( {{\text{CO}}_{2} + 0.1} \right) \times \log \left( {{\text{CH}}_{4} } \right) \hfill \\ - 0.0565329 \times \log \left( {{\text{CO}}_{2} + 0.1} \right) \times \log \left( {{\text{C}}_{6} {\text{H}}_{14} + } \right) + 0.0794272 \times \log \left( {{\text{CH}}_{4} } \right) \hfill \\ \times \log \left( {{\text{CH}}_{4} } \right)^{2} - 0.1985207 \times \log \left( {{\text{CH}}_{4} } \right) \times \log \left( {{\text{C}}_{3} {\text{H}}_{8} - {\text{C}}_{5} {\text{H}}_{12} } \right) - 0.1334765 \hfill \\ \times \log \left( {{\text{C}}_{3} {\text{H}}_{8} - {\text{C}}_{5} {\text{H}}_{12} } \right) \times \log \left( {{\text{C}}_{6} {\text{H}}_{14} + } \right) \hfill \\ \end{aligned}$$
(4)

This model has R 2 = 0.54, MSE = 0.23, and MARE = 0.0209 (or approximately 2%) on the transformed log (pressure) scale, if measured on the original pressure scale; however, the MARE increases to about 17%. Standard diagnostic analysis shows that this model fits well, and the normality assumption on ε is reasonable (see Fig. 3). Figure 3a plots the model-predicted versus observed values of pressure and shows that there is generally close agreement. Figure 3b shows essentially the same information, but on a horizontal line, where the vertical axis is now the difference observed minus predicted (the residuals). Figure 3c shows a standard diagnostic used to identify data points that are not well fit by the model, and it does not indicate the presence of any overly problematic points in this case. Figure 3d shows a graphical summary (boxplot) of all 667 absolute relative errors (AREs). The vertical line inside the box of the boxplot, located at around 0.13, signifies that the median ARE is 13%. Overall, Fig. 3 suggests that the model provides a good fit to the data.

Fig. 3
figure 3

Diagnostic analysis of the chosen model

Validation of the chosen model

The kind of modeling problem at hand is termed supervised learning in machine learning terminology, which has seen an explosion in activity in the last two decades. The most successful and theoretically sound approaches to solve this problem have recently been compiled by Hastie et al. (2009). They span the gamut of statistical methods from the high bias/low variance, e.g., linear regression, principal components (PCA), partial least squares (PLS), and least absolute shrinkage and selection operator (LASSO), to the low bias/high variance, e.g., splines, local smoothing, and neural networks. Roughly in the middle of this bias/variance trade-off dilemma, one finds regression tree-based models and extensions (bagging, boosting, random forests) to be some of the best predictive methods on a variety of different problems.

For the data set at hand, sparsity seeking and shrinkage inducing methods such as PCA, PLS, and LASSO are not really appropriate given the small number of predictors involved (only 6). Rather, more important is the capturing of complex nonlinear relationships with the output variable (dew point pressure) and interactions among the predictors. Thus, and in order to ensure we were considering all the best possible models, we decided to compare a variety of methods, restricting our attention to the following 4 classes: (1) linear regression with up to two-way interactions and all subsets search using a consistent information criterion such as BIC; (2) regression trees and the computationally intensive resampling-based extensions such as bagging, boosting, and random forests; (3) generalized additive models with individual predictor functions estimated via splines and local smoothers; (4) feed-forward neural networks with a single hidden layer. Details of these methods can be found in Hastie et al. (2009). (Note that method 1 was the strategy used to arrive at the chosen model in Table 3).

In order to determine which of these methods should actually be employed here, we used the tried and tested paradigm of K-fold cross-validation, with the best general recommendation at present being something like K = 5 or K = 10 (Hastie et al. 2009). With fivefold cross-validation, we randomly split the data into 5 equal portions (fivefold), use fourfold to train the model, and use the remaining onefold to test. The absolute relative error (ARE) measure described above was used to evaluate the predictive ability of a given model. Thus, for any given training/test data combination, approximately 530 data points are used to fit the model and predict the remaining 130 points. Any decisions and selection of tuning parameters pertaining to a given candidate model must be made on a case-by-case basis for each of the fivefold, using the training/test set paradigm. The absolute difference between the observed and predicted values of pressure at these 130 points is then divided by the observed value, resulting in 130 ARE values. This exercise was repeated for each of the fivefold, so that each method yields 667 ARE values. Only the best performing model in each of the 4 classes described above was considered. Figure 4 displays a statistical summary (boxplot) of log base 10 of the 667 ARE values pertaining to each of these 4 optimal models. (A boxplot extends approximately from the minimum value to the maximum value, with a box around the middle 50% of the data.)

Fig. 4
figure 4

Boxplots of log base 10 ARE values for each of the 4 classes of models used to validate the chosen model

We see that all methods perform similarly, with the random forest model (a type of regression tree) having a slight edge, and linear regression a close second. However, whereas it is straightforward to write down an equation for the linear regression model, this is infeasible for the regression tree-based random forest model, since it is a combination of thousands of trees, each tree being a sequence of yes/no questions about the predictors that must be answered sequentially in order to arrive at the appropriate predicted value. For this reason, and due to the fact that the difference in predictive ability between the two models is small, we have chosen to report only the linear regression model in this paper. However, our recommendation is that any future work should carefully consider regression trees.

Results and discussion

As already mentioned, there are two types of published correlations for dew point pressure prediction. Some correlations use detailed compositional data while others use relatively easily measured parameters from production tests and fluid data as inputs. A review of the literature of dew point prediction models revealed the main 9 published correlations listed in Table 4.

Table 4 Published dew point pressure correlations

Table 5 lists the main input parameters and number of PVT data points used for developing each of the published dew point pressure correlations. All the published models use fluid data and production parameters that are based on surface values, while the new model presented here uses down-hole data. The model was used to predict the 99 samples in Table 1 that were left out of the model building procedure. However, recall that our model uses down-hole data, whereas the other available models cannot use these types of input variables. This is an important fact for fast dew point pressure estimation in the field while sampling the fluid and before going to the laboratory.

Table 5 Published correlations and their required input parameters

The value of the new model proposed here lies in its simplicity and relative accuracy for the database used in this work. More importantly, it is based on down-hole fluid data that are becoming more available in today’s applications of fluid sampling and fluid characterization. Unlike many of those available dew point models (correlations), the new model does not require information obtainable from the laboratory or production test data. Therefore, the dew point can be estimated before even taking a fluid sample from the reservoir. Also, since C2H6 as a predictor variable is absent from the model, the output of earlier versions of down-hole analyzer tools (which do not estimate the amount of C2H6) may be used to predict the dew point pressure.

It is envisioned that the calculated dew point from this model could be used in several applications. First, it can be employed as a form of quality control to ensure the sampling procedure takes fluid samples with down-hole pressure above the dew point pressure (for more accurate sampling). This can serve as a confirmation to the current operational procedure of establishing dew point (by pumping out until liquid appears and is sensed by the tool sensors). This is particularly useful in low condensate gas ratio fluids. Secondly, it provides a quick estimate of dew point pressure that can help in any further estimation of phase behavior of gas wells for reservoir and production engineering applications. Thirdly, in cases having extensive down-hole data for multi-wells in the same reservoir, the calculated dew point pressure can be used to quality control the down-hole data. In this application, the trend of the calculated dew point pressure will be checked to see whether it follows the expected increase with depth trend. This estimation can also be used in confirming reservoir compartmentalization.

Conclusions and recommendations

The objective of this paper was to introduce a dew point pressure correlation based on down-hole fluid analysis data. We used outputs of an existing tool in the industry to guide us through the selection of the inputs for the model. Put simply, we are introducing a quick model that the industry can use for the estimation of dew point pressure based on simple measured data. Our study proceeded as follows.

  • The extensive literature review to identify all dew point pressure estimation models, classifying them into two groups, comparing the performance of each, and then suggesting which one performs better based on our extensive database.

  • Building a model based on a small group of information-dependent variables that are measured from down-hole fluid analyzer tools, after deleting non-effective variables such as C2H6 from the pool of independent variables.

  • Testing and validating the model based on randomly selected data sets from our database.

A single best linear regression model that includes pairwise interactions was arrived at for the well data, by using a sophisticated statistical model selection criterion (AICc). We think the proposed model arrived at in this analysis is the best of its kind in the industry nowadays. A comparison of our proposed model versus published ones (although published models are based on surface data while our new model is based on down-hole data) shows similar results in terms of accuracy in predicting dew point pressure values. As a final recommendation, more refined models could be proposed in future work taking into account the collection of more data.