Saybolt color prediction for condensates and light crude oils

Saybolt color determination is one of the techniques used to evaluate the quality of petroleum products as an indicator of the degree of refinement. As color is a property readily observed by operators, conventional procedures require operators to determine Saybolt color either through direct visual observation or through Saybolt chromometers. These methods are subjective due to the variability in perception of colors across different observers and may be influenced by external factors such as the level of illuminance. Digital oil color analyzers, on the other hand, cost almost four times as much as Saybolt chromometers. An alternative approach to color measurement is to develop a correlation model between Saybolt color with the physical and chemical properties of condensates and light crude oils from Malaysian oil and gas fields. This work applies several multiple linear regression techniques (such as stepwise regression) performed both manually and using the R software (version 3.6.1) to obtain statistically significant results. The step, regsubsets and glmulti functions from R are explored to develop the correlation model which predicts Saybolt color using only identified key properties, overcoming the possible drawbacks associated with conventional laboratory analysis. The models developed through these different techniques are analyzed and compared based on criteria indicated through the coefficient of multiple determination, R2 and F-tests to infer on suitable regression approaches. Results obtained from these regression methods for models with and without interaction terms report deviations of less than 5% for 75% of the samples used for validation.


Introduction
Color observations of petroleum products are standardized through two international standards developed by the American Society for Testing and Materials (ATSM), namely ASTM D 156 and ASTM D 1500. The two standards cover different ranges of color. Highly refined petroleum products use the ASTM D 156 scale, also known as the Saybolt color scale which ranges from −16 (darkest) to +30 (lightest) (ASTM International 2003). For colors darker than −16 of the ASTM D 156 scale, the ASTM D 1500 scale is used, ranging from 0.5 (lightest) to 8 (darkest) (ASTM International 2008). Petroleum products for which colors fall outside of the established range are deemed contaminated. Conventional ways of measuring color are through direct and indirect visual observation. Direct observation involves comparing the color of oil samples directly with color standards, whereas indirect observation utilizes chromometers (Khor et al. 2020).
The measurement of Saybolt color using a Saybolt chromometer is carried out in the presence of a constant light source. The oil level in the sample tube is adjusted in a way so that the short-wavelength (violet) portion of the light energy reaching the eye is equal to that passing through the standard disk and the empty tube. Since surface tension, refractive index and specific dispersion of oil determine the angle at which light hits the wall from the oil surface, these attributes directly affect Saybolt color (Diller et al. 1943).
The American Petroleum Institute (API) gravity expresses the density of petroleum liquids in comparison to water where high API gravity represents low density. While condensates and light oils have low viscosity and high API gravity, degraded oils are heavier and more viscous. As crude oils get heavier, API gravity decreases, the absorption spectra moves to the red region, and fluorescence emissions become weaker (Hagemann and Hollerbach 1986). Furthermore, different types of hydrocarbons behave differently: Aromatics absorb visible or near-infrared light, while aliphatic compounds are only excited by high ultraviolet light. Hence, light hydrocarbons are colorless as they do not absorb light in the visible spectrum. Heavier or degraded crude oils with high concentrations of complex aromatic molecules are distinctively darker since they absorb light effectively in the visible light region (Steffens et al. 2011).
Regression modeling has been applied in the petroleum industry to develop correlations and pose models to predict physical properties; see, for example, Tomren and Barth (2014) and Douglas et al. (2018). The former work (Tomren and Barth 2014) involves formulating partial least squares calibration models to estimate properties such as viscosity, acid number and asphaltene content of crude oils and condensates based on information from gas chromatography (GC) and infrared spectroscopy (IR). However, the applicability range of the models might be limited and not readily extended to a wide range of petroleum sources. The latter work (Douglas et al. 2018) aims to predict hydrocarbon concentrations in contaminated soil in which different regression techniques are compared. Due to nonlinearity of soil spectral responses, higher prediction accuracy is observed using the random forest machine learning technique compared to partial least squares regression.
A main contribution of this work is to develop a Saybolt color correlation model for devising a fast and potentially cost-efficient method of estimating the color compared to laboratory-based measurements of the same. To the best of our knowledge, there is no correlation model developed for an automated determination of Saybolt color since the practice remains dependent on laboratory analysis. Arguably, the novelty of the paper lies in its attempt to develop such an empirical correlation model for Saybolt color to support measurement of this physical property as a standard quality indicator in the oil and gas industry. This color property has become more prominent in recent years due to increased interest in petroleum condensates as refinery feedstock resulting from shale gas extraction activities (IHS Markit 2018). With previous studies reporting the correlation between color with petroleum product properties, this work aims to demonstrate that regression modeling and analysis can be used to develop such a correlation model for predictive purpose.

Problem statement
It is reported that direct visual observations used for color determination of petroleum are highly subjective due to the variability in color perception across different observers (Rodriguez et al. 2017). On the other hand, measurements using Saybolt chromometers are affected by environmental factors: Varying illuminance levels can be obtained from different light sources such as fluorescent lamps and halogen lamps. Moreover, compounds such as olefins in crude oils and condensates are prone to oxidation, thus resulting in darkening and aging of samples which affect Saybolt color analysis (Rodriguez et al. 2017;Speight 2015). Digital oil color analyzers, on the other hand, cost almost four times as much as Saybolt chromometers (Clarkson Laboratory and Supply Inc 2019; IndiaMart 2019). Hence, an alternative approach is to rely on mathematical models for determining Saybolt color.

Best subset regression methodology
The aim of this paper is to develop best regression models for Saybolt color based on four properties of oil samples, namely (1) refractive index (R), (2) density (D), (3) kinematic viscosity at 75 °C (V 1 ) and (4) kinematic viscosity at 100 °C (V 2 ). To achieve this, we utilize the methodology of best subset regression where multiple statistical hypotheses tests are performed either to add or to remove regressor terms from a full model. The full model consists of all possible regressor terms constructed from considering all possible powers and interactions (i.e., main-and higher-order interactions) among the original four attributes {R, D, V 1 , V 2 }. The general mathematical formulation of a full model based on a response variable y and m = 4 explanatory variables, x 1 , x 2 , ⋯, x m , is given by: where y is the observed Saybolt color; x p i is the regressor i (i ∈ {R, D, V 1 , V 2 }) raised to the power p with 1 ≤ p ≤ M where M is the highest power considered; x q j is similar to x p i with j ∈ {R, D, V 1 , V 2 } and q such that 1 ≤ q ≤ M; β 0 is the constant intercept term; β ip and β ijpq are, respectively, the regression coefficient corresponding to x p i and x p i x q j ; and ε is the random error assumed to arise from a normal distribution with mean zero and constant, but unknown variance σ 2 . Best subset regression analysis is performed using a dataset of oil samples of size n, We develop two main full models in this paper, both with M = 2, without and with pairwise interactions between the explanatory variables.
To arrive at the best subset regression model, we implement the stepwise regression technique which adds or removes regressors from the current model one at a time (Montgomery et al. 2012) and tests for the significance of (1) the added/removed term. The technique can be classified into forward selection, backward elimination and bidirectional elimination methods (Rawlings et al. 2006). In forward selection, the initial model starts with zero regressors. Subsequently, regressor terms from the full model in Eq.
(1) are fitted into the current model, and the regressor with the best correlation with Saybolt color is selected for inclusion in the current regression model. The forward selection procedural flowchart is shown in Fig. 1. Backward elimination works in the opposite direction where a regressor is removed from the full model (1) if the corresponding test of significance for this regressor falls below a pre-specified threshold as shown in Fig. 2. A combination of the forward and backward methods constitutes the bidirectional elimination method. To perform statistical tests on the added/ removed terms, two different F-tests, namely the partial and overall F-tests, are used to evaluate the significance (Draper and Smith 1998). The flowcharts of procedures for both bidirectional elimination methods are presented in Figs. 3 and 4.
We perform the forward, backward and bidirectional elimination methods by manually creating columns in Excel that represent all the regressor terms in Eq. (1). The second approach that we implement is more automated and similar hypotheses tests are carried out via functions accessed   (Lumley 2014). Another set of functions (or classes) used in this work is from the glmulti package which automatically considers all possible generalized linear models arising from all possible subsets of a given collection of regressors from the full model. As an exhaustive screening tool, glmulti ranks the subset regression models according to a specific information criterion: It gives three choices, namely AIC, Bayesian information criterion (BIC) and corrected AIC (AICc). The first-ranked (i.e., highest ranked) model has the lowest value of such information criterion (Calcagno and de Mazancourt 2010).

Regression modeling based on physical properties
The four properties of refractive index (R), density (D), kinematic viscosity at 75 °C (V 1 ) and kinematic viscosity at 100 °C (V 2 ), as well as Saybolt color measurements, are obtained from assay reports for the whole (i.e., bulk) and product fractions (i.e., cuts) of condensates and light crude oils from Malaysian oil and gas fields located mainly in offshore Sabah and Sarawak (e.g., of the types named Kimanis, Marjoram, Bintulu and Kawasari). The dataset consists of n = 15 samples. The scatterplots obtained for analyzing pairwise variable relationships, as shown in Fig. 7   (3) parameters has zero residual degrees of freedom and hence cannot be tested for statistical significance. We consider the full model in (3) only as a strict upper bound for all pairwise interaction terms to be considered in the subset regression models. We find that the best regression model typically   includes only one or two of the pairwise interaction terms this model. The stepwise regression results are summarized in Table 1 for both types of full models (2) and (3). Note that in the case of the latter model, a zero degree of freedom does not permit backward elimination to be applied (Hu 2016). For both types of full models (2) and (3), the overall F-statistic and adjusted R 2 values obtained using the forward selection procedure are higher than those obtained using the other techniques. This outcome is supported by Berk (1978) and Dempster et al. (1977): Forward selection produces model subsets with smaller residual variances compared to backward and bidirectional eliminations because only regressors which improve the model significantly are added. In the case where sample size is small, backward elimination which starts by fitting all candidate regressors into a model can result in overfitting with huge rounding errors (Draper and Smith 1998).
Best subset regression results using the R functions step and regsubsets are summarized in Table 2. Backward elimination using step for the full model (2) gives a better subset regression model (see Table 2) compared to manual backward elimination (see corresponding entry in Table 1) as indicated by a higher adjusted R 2 and lower p value. The same result is obtained using regsubsets. In this case, the initial model produced by regsubsets gives an adjusted R 2 value of 0.8177, but after the removal of insignificant regressors through partial F-tests, the adjusted R 2 value drops to 0.6139. This poses a problem when using regsubsets: There is no guarantee of obtaining a model with all statistically significant terms despite a high adjusted R 2 value (Kassambara 2018). As for the full model (3), regsubsets displays the same problem where the adjusted R 2 value drops from 0.9303 to 0.6050 after removing statistically insignificant terms.
Comparing all entries in Tables 1 and 2, the best subset regression model based on the full model (2) is obtained via forward selection either performed manually or using the step function. The explicit model fit is given by: This best subset regression model has an adjusted R 2 value of 0.7135 with an F-statistic value of 18.43 and a corresponding p value of 0.00022.
As an extension of the R functions utilized earlier, the glmulti package enables outputs in the form of multi-model results in which models are ranked according to specific information criteria such as AIC, BIC and AICc. This work considers only applying the glmulti function to fitting regressor terms from the full models (2) and (3), i.e., without and with interaction terms, respectively. Models without pairwise interaction are obtained using the first-level iterations where we report the top-five models ranked according to    Tables 3,  4 and 5, respectively. We find that the best subset regression model obtained previously using forward selection in terms of regressors R 2 and V 2 2 (as shown in Eq. (4)) is returned as the top-two model via the AICc criterion (refer to Table 5). AICc implements a correction factor on AIC to prevent overfitting for small sample sizes (Burnham and Anderson 2002;Cavanaugh 1997); hence, it is a more appropriate measure compared to AIC in this case. Comparing all three criteria, the best model developed is found as the first-ranked model via AIC and the second-ranked model via BIC. These two models are similar with the highest adjusted R 2 value of 0.8177 and corresponding p value of 0.00148, and they are obtained as: We then proceed with the second-level glmulti iterations using the main effects obtained from the best-ranked model determined using AIC, BIC and AICc to check whether further model improvement can be made. Since the secondlevel iterations include pairwise interaction terms, they do not converge for the best-ranked model obtained via AIC (which is a model with six main effects, namely R, R 2 , D 2 , V 2 1 , V 2 and V 2 2 ) as additional degrees of freedom do not exist for hypothesis testing once interaction effects are included. Results for the second-level iterations based on BIC and AICc are presented in Table 6 and Table 7 in which both (5) S = −2.203 × 10 4 + 2.994 × 10 4 R − 1.022 × 10 4 R 2 + 1.030 × 10 −4 D 2 − 3.298V 2 1 + 1.504 × 10 2 V 2 − 7.825 × 10V 2 2 .
criteria develop models with the same main effects, namely R, R 2 and V 2 2 , as shown in Tables 4 and 5, respectively. From these results, the best-ranked model has a highest adjusted R 2 of 0.8160 with main effects of R andV 2 2 and interaction effects ofV 2 2 R and V 2 2 R 2 . Despite the higher adjusted R 2 as compared to the first-level iterations, the glmulti function fits interaction terms into the model without guaranteeing the inclusion of their corresponding main effects, thus violating the model hierarchy or heredity that prescribes including interaction terms in a model only if all corresponding main effect terms are present (Coulton and Chow   1993;Wang et al. 2010;Bien et al. 2013). In this case, no R 2 main effect is included in the model. Disregarding this model, the only model with inclusion of interaction effects with their corresponding main effects for both criteria is the sixth-ranked model (not displayed here), which is the same model as the best-ranked model from the first-level iterations that we initialized the second-level iterations with. This means that including pairwise interaction into the model does not improve the strength of its correlation if we abide by the hierarchy of regression models in which main effect terms must be present or are added in tandem with interaction terms.
Reassessing the best-ranked model obtained from the first-level iterations as presented in Eq. (5), the same problem of substantially decreased adjusted R 2 value (to 0.6139) occurs (as seen using regsubsets when partial F-test is performed to remove insignificant regressors). To ensure that we select only models with statistically significant regressors, we eliminate insignificant regressors based on partial F-test using the ANOVA function on all the top-five models developed based on each of the three ICs. The results are summarized in Tables 8, 9 and 10.
The removal of insignificant terms from the initial models developed using glmulti results in several repeated models. For instance, based on the AIC, only two models are obtained from the initial top-five models. After the removal of insignificant terms, the first-ranked model developed via AIC with an adjusted R 2 of 0.6139 is similar to the model obtained from step and regsubsets for the second-order model without pairwise interaction (see Table 2). This outcome shows the consistency of results obtained from glmulti with the other functions available in R such as step and regsubsets when the same underlying principle (i.e., AIC) is used.
Comparing all models with insignificant terms removed, the model with the highest adjusted R 2 of 0.7710 is obtained from the fourth-ranked model based on BIC and third-ranked model based on AICc with a lowest p value of 0.00021. This model has the following correlation: Based on the coefficient magnitudes, the most influential regressor is determined to be the square of kinematic viscosity at 100 °C (i.e., V 2 2 term) for all significance levels greater than 0.00021 (which include typically reported levels such as 1%, 5% and 10%). This model developed via glmulti has the highest adjusted R 2 compared to all other regression techniques. Note that this model is justified in comparison with model (4) although the R term is not included since it was mentioned earlier that R and D exhibit a high degree of collinearity. (Hence, only one of them needs to be incorporated in a model.)

Regression modeling based on physical and chemical properties
Similar steps are repeated for the development of model using physical and chemical properties where two additional chemical properties of condensates and light crude oils, namely sulfur content, Su, and total acid number, T, are included. The preliminary step of plotting scatterplots as shown in Fig. 7 to represent the relationship between Saybolt color with all six physical and chemical properties of condensates and light crude oils again shows an absence of linear relationship between these properties, hence higherorder powers and interaction terms ought to be considered.
(7) S = 0 + 1 R + 2 D + 3 V 1 + 4 V 2 + 5 Su + 6 T + 7 R 2 + 8 D 2 + 9 V 2 1 + 10 V 2 2 + 11 Su 2 + 12 T 2 + , while the full model with all pairwise interaction terms has the following explicit form: Comparing full models (8) and (3), the former has 28 unknown parameters, a significant increase from the 15 parameters in the latter. Due to a less than zero residual (8) 1 + 10 V 2 2 + 11 Su 2 + 12 T 2 + 13 RD + 14 RV 1 + 15 RV 2 + 16 RSu + 17 RT + 18 DV 1 + 19 DV 2 + 20 DSu degree of freedom, performing backward elimination is not possible, but other techniques are still available provided that the model developed has a maximum of 15 terms including the intercept. We first perform stepwise regression manually and utilize R functions to obtain the best model for both variants as summarized in Tables 11 and  12, respectively.
For models developed using stepwise regression, similar trends are observed where models fitted using forward selection have higher adjusted R 2 and lower p values than those developed using bidirectional elimination. When R functions are utilized, regsubsets performs better for models without and with pairwise interaction where the adjusted R 2 for the latter goes as high as 0.9968, significantly higher than all the other models developed using only physical properties.
This model with an F-statistic of 549.6 and corresponding p value of 0.0000 is explicitly expressed as: However, one drawback from this model is that the interaction terms are fitted without their main effects. Hence, if the hierarchy of a model is to be abided, the best subset regression model would be the model developed by regsubsets based on full model (7) without pairwise interaction with an adjusted R 2 of 0.8490, overall F-statistic of 27.25 and corresponding p value of 0.00002 as given by the following: Similarly, we reproduce results using the glmulti package with the inclusion of chemical properties and find that model (10) is returned as the top-fourth model via AICc as shown in Table 13, further supporting the suitability of AICc for small sample size. Regressors for the top-five models developed using AIC and BIC are shown in Tables 14 and  15, respectively. Comparing all three criteria, the best model developed is found as the topmost model via the AIC and BIC with the highest adjusted R 2 of 0.9290, overall F-statistic of 21.36 and corresponding p value of 0.00179 expressed in the following correlation: We then proceed with the second-level glmulti iterations using the main effects obtained from the best-ranked model via AIC, BIC and AICc to observe for any improvement in models. Since the second-level iterations include pairwise interaction terms, they do not converge for the topmost first-level model obtained via AIC and BIC with nine main effects, namely R, D 2 , V 1 ,V 2 1 , V 2 2 , Su, Su 2 , T and T 2 (see (9) lS = 4.766 × 10 2 − 4.977 × 10V 2 − 4.659Su − 6.175× 10 −4 D 2 − 9.009 × 10 2 RV 1 + 5.978RSu + 1.612DV 1 − 4.898 × 10 −3 DSu − 2.557 × 10 2 V 1 T.
From Table 16, the topmost model has a highest adjusted R 2 of 0.8558 with main effect of R 2 as well as interaction effect ofTV 2 2 . However, due to the absence of main effect T and V 2 2 which violates model hierarchy, we reassess the best model obtained from first-level iterations as presented in Eq. (11). Performing partial F-test to remove insignificant regressors from the model results in a significant drop in adjusted R 2 value from 0.9290 to 0.8354. Similar partial F-test is conducted on all models developed using first-level iterations to ensure that we select only models with statistically significant regressors. Note that all first-level models developed on AICc criterion have regressors that are already statistically significant; thus, they do not require subsequent F-tests. The results are summarized in Tables 17 and 18 for AIC and BIC, respectively.
Comparing all models with insignificant terms removed, the model with the highest adjusted R 2 of 0.8510 (see Table 16 Regressors for models developed based on AICc using second-level iteration of glmulti for physical and chemical properties   Table 13) is obtained as the topmost model based on AICc with the following correlation: Based on the coefficient magnitudes, the most influential regressor is determined to be the total acid number (i.e., T term) for all significance levels greater than 0.00002 (which include typically reported levels such as 1%, 5% and 10%).
This model developed via glmulti has the highest adjusted R 2 compared to all other regression techniques and is slightly better than model (10) in which both the main effects of R and V 2 are raised to a second order. This implies that the inclusion of chemical properties into Saybolt color correlation improves the fit of the model with an increase in adjusted R 2 from 0.7710 to 0.8510.  Obtained using first-level iteration of glmulti based on the BIC, this model has the lowest BIC compared to all other models. Although the adjusted R 2 is the highest, the overall F-statistic still falls behind models developed using forward selection Lower-order model with physical and chemical properties S = 199.8 − 73.89R 2 − 9.212V 2 2 − 268.8T (Eq. 10) Obtained using first-level iteration of glmulti based on AICc, this model has the highest overall F-statistic and adjusted R 2 compared to all other first-level iteration models developed based the three ICs as well as models developed manually and using R functions

Model validation
Model validation is performed on the developed correlation model by comparing its predicted Saybolt color values against those measured using conventional method (i.e., laboratory analysis). The latter (data from conventional measurements) are obtained from other assay reports for which the data are not used in developing (i.e., training) the model. We conduct the validation for 20 selected samples and present result on deviations between the actual (from assay reports) and predicted values by our proposed regression model given by Eq. (12). As shown in Fig. 8, the deviations are largely less than 5% except for five samples (but which yet do not exceed 10%) with a mean of 2.9%. The validation indicates a prediction error of less than 5% for 75% of the samples, which is acceptable (Simpson et al. 2004).

Concluding remarks
Based on the models developed using various regression approaches for this class of machine learning problems arising in prediction of petroleum properties, the best models representing different model structures are summarized in Table 19. As proposed by Draper and Smith (1998), no regression technique is the best, especially when there are constraints in terms of the patterns of the data as well as the practicality of the problem. Choosing the right regression approach would depend on the type of model that we aim to develop. If the significance of regressors in a model is an important factor, forward selection ensures that only significant regressors are added to the model, but this method may result in lower adjusted R 2 values. Hence, if we want to develop highly correlated models with high adjusted R 2 values, we can opt for the functions regsubsets (in leaps package) and glmulti (in glmulti package) which have been shown to give consistent results with glmulti offering the capability of providing multi-model inferencing. Since both functions are susceptible to overfitting in which redundant regressors may be present in the model, partial F-tests can be performed to remove insignificant terms. But it is important to note that the adjusted R 2 values typically drop with elimination of insignificant terms.