Introduction

Color observations of petroleum products are standardized through two international standards developed by the American Society for Testing and Materials (ATSM), namely ASTM D 156 and ASTM D 1500. The two standards cover different ranges of color. Highly refined petroleum products use the ASTM D 156 scale, also known as the Saybolt color scale which ranges from −16 (darkest) to +30 (lightest) (ASTM International 2003). For colors darker than −16 of the ASTM D 156 scale, the ASTM D 1500 scale is used, ranging from 0.5 (lightest) to 8 (darkest) (ASTM International 2008). Petroleum products for which colors fall outside of the established range are deemed contaminated. Conventional ways of measuring color are through direct and indirect visual observation. Direct observation involves comparing the color of oil samples directly with color standards, whereas indirect observation utilizes chromometers (Khor et al. 2020).

The measurement of Saybolt color using a Saybolt chromometer is carried out in the presence of a constant light source. The oil level in the sample tube is adjusted in a way so that the short-wavelength (violet) portion of the light energy reaching the eye is equal to that passing through the standard disk and the empty tube. Since surface tension, refractive index and specific dispersion of oil determine the angle at which light hits the wall from the oil surface, these attributes directly affect Saybolt color (Diller et al. 1943).

The American Petroleum Institute (API) gravity expresses the density of petroleum liquids in comparison to water where high API gravity represents low density. While condensates and light oils have low viscosity and high API gravity, degraded oils are heavier and more viscous. As crude oils get heavier, API gravity decreases, the absorption spectra moves to the red region, and fluorescence emissions become weaker (Hagemann and Hollerbach 1986). Furthermore, different types of hydrocarbons behave differently: Aromatics absorb visible or near-infrared light, while aliphatic compounds are only excited by high ultraviolet light. Hence, light hydrocarbons are colorless as they do not absorb light in the visible spectrum. Heavier or degraded crude oils with high concentrations of complex aromatic molecules are distinctively darker since they absorb light effectively in the visible light region (Steffens et al. 2011).

Regression modeling has been applied in the petroleum industry to develop correlations and pose models to predict physical properties; see, for example, Tomren and Barth (2014) and Douglas et al. (2018). The former work (Tomren and Barth 2014) involves formulating partial least squares calibration models to estimate properties such as viscosity, acid number and asphaltene content of crude oils and condensates based on information from gas chromatography (GC) and infrared spectroscopy (IR). However, the applicability range of the models might be limited and not readily extended to a wide range of petroleum sources. The latter work (Douglas et al. 2018) aims to predict hydrocarbon concentrations in contaminated soil in which different regression techniques are compared. Due to nonlinearity of soil spectral responses, higher prediction accuracy is observed using the random forest machine learning technique compared to partial least squares regression.

A main contribution of this work is to develop a Saybolt color correlation model for devising a fast and potentially cost-efficient method of estimating the color compared to laboratory-based measurements of the same. To the best of our knowledge, there is no correlation model developed for an automated determination of Saybolt color since the practice remains dependent on laboratory analysis. Arguably, the novelty of the paper lies in its attempt to develop such an empirical correlation model for Saybolt color to support measurement of this physical property as a standard quality indicator in the oil and gas industry. This color property has become more prominent in recent years due to increased interest in petroleum condensates as refinery feedstock resulting from shale gas extraction activities (IHS Markit 2018). With previous studies reporting the correlation between color with petroleum product properties, this work aims to demonstrate that regression modeling and analysis can be used to develop such a correlation model for predictive purpose.

Problem statement

It is reported that direct visual observations used for color determination of petroleum are highly subjective due to the variability in color perception across different observers (Rodriguez et al. 2017). On the other hand, measurements using Saybolt chromometers are affected by environmental factors: Varying illuminance levels can be obtained from different light sources such as fluorescent lamps and halogen lamps. Moreover, compounds such as olefins in crude oils and condensates are prone to oxidation, thus resulting in darkening and aging of samples which affect Saybolt color analysis (Rodriguez et al. 2017; Speight 2015). Digital oil color analyzers, on the other hand, cost almost four times as much as Saybolt chromometers (Clarkson Laboratory and Supply Inc 2019; IndiaMart 2019). Hence, an alternative approach is to rely on mathematical models for determining Saybolt color.

Best subset regression methodology

The aim of this paper is to develop best regression models for Saybolt color based on four properties of oil samples, namely (1) refractive index (R), (2) density (D), (3) kinematic viscosity at 75 °C (V1) and (4) kinematic viscosity at 100 °C (V2). To achieve this, we utilize the methodology of best subset regression where multiple statistical hypotheses tests are performed either to add or to remove regressor terms from a full model. The full model consists of all possible regressor terms constructed from considering all possible powers and interactions (i.e., main- and higher-order interactions) among the original four attributes {R, D, V1, V2}. The general mathematical formulation of a full model based on a response variable y and m = 4 explanatory variables, x1, x2, ⋯, xm, is given by:

$$y={\beta}_0+\sum_{p=1}^M\sum_{i=1}^m{\beta}_{ip}{x}_i^p+\sum_{p,q=1,p\ne q}^M\sum_{i,j=1,i<j}^m{\beta}_{ijpq}{x}_i^p{x}_j^q+\kern0.5em \varepsilon,$$
(1)

where y is the observed Saybolt color; \({x}_i^p\)is the regressor i (i ∈ {R, D, V1, V2}) raised to the power p with 1 ≤ p ≤ M where M is the highest power considered; \({x}_j^q\) is similar to \({x}_i^p\) with j ∈ {R, D, V1, V2} and q such that 1 ≤ q ≤ M; β0 is the constant intercept term; βip and βijpq are, respectively, the regression coefficient corresponding to \({x}_i^p\) and \({x}_i^p{x}_j^q\); and ε is the random error assumed to arise from a normal distribution with mean zero and constant, but unknown variance σ2. Best subset regression analysis is performed using a dataset of oil samples of size \(n,\left({x}_1^{(k)},{x}_2^{(k)},{x}_3^{(k)},{x}_4^{(k)},{y}^{(k)}\right),k=1,2,\dots, n.\)We develop two main full models in this paper, both with M = 2, without and with pairwise interactions between the explanatory variables.

To arrive at the best subset regression model, we implement the stepwise regression technique which adds or removes regressors from the current model one at a time (Montgomery et al. 2012) and tests for the significance of the added/removed term. The technique can be classified into forward selection, backward elimination and bidirectional elimination methods (Rawlings et al. 2006). In forward selection, the initial model starts with zero regressors. Subsequently, regressor terms from the full model in Eq. (1) are fitted into the current model, and the regressor with the best correlation with Saybolt color is selected for inclusion in the current regression model. The forward selection procedural flowchart is shown in Fig. 1. Backward elimination works in the opposite direction where a regressor is removed from the full model (1) if the corresponding test of significance for this regressor falls below a pre-specified threshold as shown in Fig. 2. A combination of the forward and backward methods constitutes the bidirectional elimination method. To perform statistical tests on the added/removed terms, two different F-tests, namely the partial and overall F-tests, are used to evaluate the significance (Draper and Smith 1998). The flowcharts of procedures for both bidirectional elimination methods are presented in Figs. 3 and 4.

Fig. 1
figure 1

Flowchart of forward selection procedure

Fig. 2
figure 2

Flowchart of backward elimination procedure

Fig. 3
figure 3

Flowchart of bidirectional elimination procedure using partial F-test

Fig. 4
figure 4

Flowchart of bidirectional elimination procedure using overall F-test

We perform the forward, backward and bidirectional elimination methods by manually creating columns in Excel that represent all the regressor terms in Eq. (1). The second approach that we implement is more automated and similar hypotheses tests are carried out via functions accessed through statistical packages in R (version 3.6.1) (R Core Team 2013). The R function step from the stats package uses the Akaike information criteria (AIC) to estimate the relative quality of model fit to the dataset for a given set of regressors. Procedures for forward selection and backward elimination using step are shown in Figs. 5 and 6, respectively. On the other hand, the regsubsets function from the leaps package evaluates all possible models for a given set of regressors and returns the model with the highest adjusted R2 (Lumley 2014). Another set of functions (or classes) used in this work is from the glmulti package which automatically considers all possible generalized linear models arising from all possible subsets of a given collection of regressors from the full model. As an exhaustive screening tool, glmulti ranks the subset regression models according to a specific information criterion: It gives three choices, namely AIC, Bayesian information criterion (BIC) and corrected AIC (AICc). The first-ranked (i.e., highest ranked) model has the lowest value of such information criterion (Calcagno and de Mazancourt 2010).

Fig. 5
figure 5

Flowchart of forward selection procedure performed using step function in R

Fig. 6
figure 6

Flowchart of backward elimination procedure performed using step function in R

Results and analysis

Regression modeling based on physical properties

The four properties of refractive index (R), density (D), kinematic viscosity at 75 °C (V1) and kinematic viscosity at 100 °C (V2), as well as Saybolt color measurements, are obtained from assay reports for the whole (i.e., bulk) and product fractions (i.e., cuts) of condensates and light crude oils from Malaysian oil and gas fields located mainly in offshore Sabah and Sarawak (e.g., of the types named Kimanis, Marjoram, Bintulu and Kawasari). The dataset consists of n = 15 samples. The scatterplots obtained for analyzing pairwise variable relationships, as shown in Fig. 7, indicate the absence of linear trends between Saybolt color and the four variables. These scatterplots suggest that there is a high degree of nonlinear relationships between Saybolt color and the potential regressors. Henceforth, we consider higher-order powers and interaction terms for the four variables. The catterplots of R vs. D and V1 vs. V2 show strong linear relationships (high collinearity). Thus, we expect the best subset regression model in our experiments to select either R or D, but not both, and similarly for V1 and V2.

Fig. 7
figure 7

Scatterplots of Saybolt color versus refractive index (R), density (D), kinematic viscosity at 75 °C (V1), kinematic viscosity at 100 °C (V2), sulfur content (Su) and total acid number (T)

The two full models considered with M = 2 have the following explicit forms:

$$S={\beta}_0+{\beta}_1R+{\beta}_2D+{\beta}_3{V}_1+{\beta}_4{V}_2+{\beta}_5{R}^2+{\beta}_6{D}^2+{\beta}_7{V}_1^2+{\beta}_8{V}_2^2+\varepsilon$$
(2)

as the full model without interaction terms and

$${\displaystyle \begin{array}{l}S={\beta}_0+{\beta}_1R+{\beta}_2D+{\beta}_3{V}_1+{\beta}_4{V}_2+{\beta}_5{R}^2+{\beta}_6{D}^2+{\beta}_7{V}_1^2+{\beta}_8{V}_2^2+{\beta}_9 RD+{\beta}_{10}{RV}_1+\\ {}{\beta}_{11}{RV}_2+{\beta}_{12}{DV}_1+{\beta}_{13}{DV}_2+{\beta}_{14}{V}_1{V}_2+\varepsilon \end{array}}$$
(3)

as the model with all pairwise interaction terms. Note that higher powers of explanatory variables are deliberately missing when considering the interaction terms in Eq. (3). This is because of the small sample size (n = 15) of the dataset used in this study. The full model (3) with 15 unknown parameters has zero residual degrees of freedom and hence cannot be tested for statistical significance. We consider the full model in (3) only as a strict upper bound for all pairwise interaction terms to be considered in the subset regression models. We find that the best regression model typically includes only one or two of the pairwise interaction terms this model.

The stepwise regression results are summarized in Table 1 for both types of full models (2) and (3). Note that in the case of the latter model, a zero degree of freedom does not permit backward elimination to be applied (Hu 2016). For both types of full models (2) and (3), the overall F-statistic and adjusted R2 values obtained using the forward selection procedure are higher than those obtained using the other techniques. This outcome is supported by Berk (1978) and Dempster et al. (1977): Forward selection produces model subsets with smaller residual variances compared to backward and bidirectional eliminations because only regressors which improve the model significantly are added. In the case where sample size is small, backward elimination which starts by fitting all candidate regressors into a model can result in overfitting with huge rounding errors (Draper and Smith 1998).

Table 1 Evaluation of best models with and without pairwise interaction developed using stepwise regression

Best subset regression results using the R functions step and regsubsets are summarized in Table 2. Backward elimination using step for the full model (2) gives a better subset regression model (see Table 2) compared to manual backward elimination (see corresponding entry in Table 1) as indicated by a higher adjusted R2 and lower p value. The same result is obtained using regsubsets. In this case, the initial model produced by regsubsets gives an adjusted R2 value of 0.8177, but after the removal of insignificant regressors through partial F-tests, the adjusted R2 value drops to 0.6139. This poses a problem when using regsubsets: There is no guarantee of obtaining a model with all statistically significant terms despite a high adjusted R2 value (Kassambara 2018). As for the full model (3), regsubsets displays the same problem where the adjusted R2 value drops from 0.9303 to 0.6050 after removing statistically insignificant terms.

Table 2 Evaluation of best models with and without pairwise interaction developed using R functions

Comparing all entries in Tables 1 and 2, the best subset regression model based on the full model (2) is obtained via forward selection either performed manually or using the step function. The explicit model fit is given by:

$$S=280.687-113.875{R}^2-14.102{V}_2^2.$$
(4)

This best subset regression model has an adjusted R2 value of 0.7135 with an F-statistic value of 18.43 and a corresponding p value of 0.00022.

As an extension of the R functions utilized earlier, the glmulti package enables outputs in the form of multi-model results in which models are ranked according to specific information criteria such as AIC, BIC and AICc. This work considers only applying the glmulti function to fitting regressor terms from the full models (2) and (3), i.e., without and with interaction terms, respectively. Models without pairwise interaction are obtained using the first-level iterations where we report the top-five models ranked according to increasing AIC, BIC and AICc values as given in Tables 3, 4 and 5, respectively.

Table 3 Regressors for models developed based on AIC using the first-level iterations of glmulti for physical properties
Table 4 Regressors for models developed based on BIC using the first-level iterations of glmulti for physical properties
Table 5 Regressors for models developed based on AICc using the first-level iterations of glmulti for physical properties

We find that the best subset regression model obtained previously using forward selection in terms of regressors R2 and \({V}_2^2\) (as shown in Eq. (4)) is returned as the top-two model via the AICc criterion (refer to Table 5). AICc implements a correction factor on AIC to prevent overfitting for small sample sizes (Burnham and Anderson 2002; Cavanaugh 1997); hence, it is a more appropriate measure compared to AIC in this case. Comparing all three criteria, the best model developed is found as the first-ranked model via AIC and the second-ranked model via BIC. These two models are similar with the highest adjusted R2 value of 0.8177 and corresponding p value of 0.00148, and they are obtained as:

$${\displaystyle \begin{array}{l}S=-2.203\times {10}^4+2.994\times {10}^4R-1.022\times {10}^4{R}^2+1.030\times {10}^{-4}{D}^2-3.298{V}_1^2+\\ {}1.504\times {10}^2{V}_2-7.825\times 10{V}_2^2.\end{array}}$$
(5)

We then proceed with the second-level glmulti iterations using the main effects obtained from the best-ranked model determined using AIC, BIC and AICc to check whether further model improvement can be made. Since the second-level iterations include pairwise interaction terms, they do not converge for the best-ranked model obtained via AIC (which is a model with six main effects, namely R, R2, D2, \({V}_1^2\), V2 and \({V}_2^2\)) as additional degrees of freedom do not exist for hypothesis testing once interaction effects are included. Results for the second-level iterations based on BIC and AICc are presented in Table 6 and Table 7 in which both criteria develop models with the same main effects, namely R, R2 and \({V}_2^2,\) as shown in Tables 4 and 5, respectively.

Table 6 Regressors for models developed based on BIC using the second-level iterations of glmulti for physical properties
Table 7 Regressors for models developed based on AICc using the second-level iterations of glmulti for physical properties

From these results, the best-ranked model has a highest adjusted R2 of 0.8160 with main effects of R and\({V}_2^2\) and interaction effects of\({V}_2^2R\) and \({V}_2^2{R}^2\). Despite the higher adjusted R2 as compared to the first-level iterations, the glmulti function fits interaction terms into the model without guaranteeing the inclusion of their corresponding main effects, thus violating the model hierarchy or heredity that prescribes including interaction terms in a model only if all corresponding main effect terms are present (Coulton and Chow 1993; Wang et al. 2010; Bien et al. 2013). In this case, no R2 main effect is included in the model. Disregarding this model, the only model with inclusion of interaction effects with their corresponding main effects for both criteria is the sixth-ranked model (not displayed here), which is the same model as the best-ranked model from the first-level iterations that we initialized the second-level iterations with. This means that including pairwise interaction into the model does not improve the strength of its correlation if we abide by the hierarchy of regression models in which main effect terms must be present or are added in tandem with interaction terms.

Reassessing the best-ranked model obtained from the first-level iterations as presented in Eq. (5), the same problem of substantially decreased adjusted R2 value (to 0.6139) occurs (as seen using regsubsets when partial F-test is performed to remove insignificant regressors). To ensure that we select only models with statistically significant regressors, we eliminate insignificant regressors based on partial F-test using the ANOVA function on all the top-five models developed based on each of the three ICs. The results are summarized in Tables 8, 9 and 10.

Table 8 Regressors for models developed based on AIC using the first-level iterations of glmulti for physical properties after the removal of insignificant regressors using ANOVA
Table 9 Regressors for models developed based on BIC using the first-level iterations of glmulti for physical properties after the removal of insignificant regressors using ANOVA
Table 10 Regressors for models developed based on AICc using the first-level iterations of glmulti for physical properties after the removal of insignificant regressors using ANOVA

The removal of insignificant terms from the initial models developed using glmulti results in several repeated models. For instance, based on the AIC, only two models are obtained from the initial top-five models. After the removal of insignificant terms, the first-ranked model developed via AIC with an adjusted R2 of 0.6139 is similar to the model obtained from step and regsubsets for the second-order model without pairwise interaction (see Table 2). This outcome shows the consistency of results obtained from glmulti with the other functions available in R such as step and regsubsets when the same underlying principle (i.e., AIC) is used.

Comparing all models with insignificant terms removed, the model with the highest adjusted R2 of 0.7710 is obtained from the fourth-ranked model based on BIC and third-ranked model based on AICc with a lowest p value of 0.00021. This model has the following correlation:

$$S=-2.807x{10}^3+6.738D-3.993x{10}^{-3}{D}^2-1.517x10{V}_2^2.$$
(6)

Based on the coefficient magnitudes, the most influential regressor is determined to be the square of kinematic viscosity at 100 °C (i.e., \({V}_2^2\) term) for all significance levels greater than 0.00021 (which include typically reported levels such as 1%, 5% and 10%). This model developed via glmulti has the highest adjusted R2 compared to all other regression techniques. Note that this model is justified in comparison with model (4) although the R term is not included since it was mentioned earlier that R and D exhibit a high degree of collinearity. (Hence, only one of them needs to be incorporated in a model.)

Regression modeling based on physical and chemical properties

Similar steps are repeated for the development of model using physical and chemical properties where two additional chemical properties of condensates and light crude oils, namely sulfur content, Su, and total acid number, T, are included. The preliminary step of plotting scatterplots as shown in Fig. 7 to represent the relationship between Saybolt color with all six physical and chemical properties of condensates and light crude oils again shows an absence of linear relationship between these properties, hence higher-order powers and interaction terms ought to be considered.

Considering M = 2, the full model without interaction terms is given by:

$${\displaystyle \begin{array}{l}S={\beta}_0+{\beta}_1R+{\beta}_2D+{\beta}_3{V}_1+{\beta}_4{V}_2+{\beta}_5 Su+{\beta}_6T+{\beta}_7{R}^2+{\beta}_8{D}^2+{\beta}_9{V}_1^2+{\beta}_{10}{V}_2^2+\\ {}{\beta}_{11}{Su}^2+{\beta}_{12}{T}^2+\varepsilon, \end{array}}$$
(7)

while the full model with all pairwise interaction terms has the following explicit form:

$${\displaystyle \begin{array}{l}S={\beta}_0+{\beta}_1R+{\beta}_2D+{\beta}_3{V}_1+{\beta}_4{V}_2+{\beta}_5 Su+{\beta}_6T+{\beta}_7{R}^2+{\beta}_8{D}^2+{\beta}_9{V}_1^2+{\beta}_{10}{V}_2^2+\\ {}{\beta}_{11}{Su}^2+{\beta}_{12}{T}^2+{\beta}_{13} RD+{\beta}_{14}{RV}_1+{\beta}_{15}{RV}_2+{\beta}_{16} RSu+{\beta}_{17} RT+{\beta}_{18}{DV}_1+{\beta}_{19}{DV}_2+\\ {}{\beta}_{20} DSu+{\beta}_{21} DT+{\beta}_{22}{V}_1{V}_2+{\beta}_{23}{V}_1 Su+{\beta}_{24}{V}_1T+{\beta}_{25}{V}_2 Su+{\beta}_{26}{V}_2T+{\beta}_{27} SuT+\varepsilon. \end{array}}$$
(8)

Comparing full models (8) and (3), the former has 28 unknown parameters, a significant increase from the 15 parameters in the latter. Due to a less than zero residual degree of freedom, performing backward elimination is not possible, but other techniques are still available provided that the model developed has a maximum of 15 terms including the intercept. We first perform stepwise regression manually and utilize R functions to obtain the best model for both variants as summarized in Tables 11 and 12, respectively.

Table 11 Regressors for best models with and without pairwise interaction developed using stepwise regression for physical and chemical properties
Table 12 Regressors for best models with and without pairwise interaction developed using R functions for physical and chemical properties

For models developed using stepwise regression, similar trends are observed where models fitted using forward selection have higher adjusted R2 and lower p values than those developed using bidirectional elimination. When R functions are utilized, regsubsets performs better for models without and with pairwise interaction where the adjusted R2 for the latter goes as high as 0.9968, significantly higher than all the other models developed using only physical properties. This model with an F-statistic of 549.6 and corresponding p value of 0.0000 is explicitly expressed as:

$${\displaystyle \begin{array}{l}S=4.766\times {10}^2-4.977\times 10{V}_2-4.659 Su-6.175\times {10}^{-4}{D}^2-9.009\times {10}^2{RV}_1+\\ {}5.978 RSu+1.612{DV}_1-4.898\times {10}^{-3} DSu-2.557\times {10}^2{V}_1T.\end{array}}$$
(9)

However, one drawback from this model is that the interaction terms are fitted without their main effects. Hence, if the hierarchy of a model is to be abided, the best subset regression model would be the model developed by regsubsets based on full model (7) without pairwise interaction with an adjusted R2 of 0.8490, overall F-statistic of 27.25 and corresponding p value of 0.00002 as given by the following:

$$S=361.3-211.5R-19.77{V}_2-278.2T.$$
(10)

Similarly, we reproduce results using the glmulti package with the inclusion of chemical properties and find that model (10) is returned as the top-fourth model via AICc as shown in Table 13, further supporting the suitability of AICc for small sample size. Regressors for the top-five models developed using AIC and BIC are shown in Tables 14 and 15, respectively.

Table 13 Regressors for models developed based on AICc using first-level iteration of glmulti for physical and chemical properties
Table 14 Regressors for models developed based on AIC using first-level iteration of glmulti for physical and chemical properties
Table 15 Regressors for models developed based on BIC using first-level iteration of glmulti for physical and chemical properties

Comparing all three criteria, the best model developed is found as the topmost model via the AIC and BIC with the highest adjusted R2 of 0.9290, overall F-statistic of 21.36 and corresponding p value of 0.00179 expressed in the following correlation:

$${\displaystyle \begin{array}{l}S=-5.844\times {10}^2+6.306\times {10}^2R-2.367\times {10}^{-4}{D}^2-1.396\times {10}^2{V}_1+7.198\times \\ {}10{V}_1^2-1.136\times {10}^2{V}_2^2+1.942\times {10}^{-1} Su-1.156\times {10}^{-4}{Su}^2-3.140\times {10}^3T+\\ {}4.420\times {10}^4{T}^2.\end{array}}$$
(11)

We then proceed with the second-level glmulti iterations using the main effects obtained from the best-ranked model via AIC, BIC and AICc to observe for any improvement in models. Since the second-level iterations include pairwise interaction terms, they do not converge for the topmost first-level model obtained via AIC and BIC with nine main effects, namely R, D2, V1,\({V}_1^2\), \({V}_2^2\), Su, Su2, T and T2 (see Table 14 and Table 15), as there is no degree of freedom once interaction effects are included. The top-five models for second-level iterations based on AICc criterion are tabulated in Table 16 by considering the topmost first-level model obtained via AICc with three main effects, namely \({R}^2,{V}_2^2\) and T (see Table 13).

Table 16 Regressors for models developed based on AICc using second-level iteration of glmulti for physical and chemical properties

From Table 16, the topmost model has a highest adjusted R2 of 0.8558 with main effect of R2 as well as interaction effect of\({TV}_2^2\). However, due to the absence of main effect T and \({V}_2^2\) which violates model hierarchy, we reassess the best model obtained from first-level iterations as presented in Eq. (11). Performing partial F-test to remove insignificant regressors from the model results in a significant drop in adjusted R2 value from 0.9290 to 0.8354. Similar partial F-test is conducted on all models developed using first-level iterations to ensure that we select only models with statistically significant regressors. Note that all first-level models developed on AICc criterion have regressors that are already statistically significant; thus, they do not require subsequent F-tests. The results are summarized in Tables 17 and 18 for AIC and BIC, respectively.

Table 17 Regressors for models developed based on AIC using first-level iteration of glmulti for physical and chemical properties after the removal of insignificant regressors using ANOVA
Table 18 Regressors for models developed based on BIC using first-level iteration of glmulti for physical and chemical properties after the removal of insignificant regressors using ANOVA

Comparing all models with insignificant terms removed, the model with the highest adjusted R2 of 0.8510 (see Table 13) is obtained as the topmost model based on AICc with the following correlation:

$$S=199.8-73.89{R}^2-9.212{V}_2^2-268.8T.$$
(12)

Based on the coefficient magnitudes, the most influential regressor is determined to be the total acid number (i.e., T term) for all significance levels greater than 0.00002 (which include typically reported levels such as 1%, 5% and 10%). This model developed via glmulti has the highest adjusted R2 compared to all other regression techniques and is slightly better than model (10) in which both the main effects of R and V2 are raised to a second order. This implies that the inclusion of chemical properties into Saybolt color correlation improves the fit of the model with an increase in adjusted R2 from 0.7710 to 0.8510.

Model validation

Model validation is performed on the developed correlation model by comparing its predicted Saybolt color values against those measured using conventional method (i.e., laboratory analysis). The latter (data from conventional measurements) are obtained from other assay reports for which the data are not used in developing (i.e., training) the model. We conduct the validation for 20 selected samples and present result on deviations between the actual (from assay reports) and predicted values by our proposed regression model given by Eq. (12). As shown in Fig. 8, the deviations are largely less than 5% except for five samples (but which yet do not exceed 10%) with a mean of 2.9%. The validation indicates a prediction error of less than 5% for 75% of the samples, which is acceptable (Simpson et al. 2004).

Fig. 8
figure 8

Validation results as based on deviations between actual and predicted Saybolt color values

Concluding remarks

Based on the models developed using various regression approaches for this class of machine learning problems arising in prediction of petroleum properties, the best models representing different model structures are summarized in Table 19. As proposed by Draper and Smith (1998), no regression technique is the best, especially when there are constraints in terms of the patterns of the data as well as the practicality of the problem. Choosing the right regression approach would depend on the type of model that we aim to develop. If the significance of regressors in a model is an important factor, forward selection ensures that only significant regressors are added to the model, but this method may result in lower adjusted R2 values. Hence, if we want to develop highly correlated models with high adjusted R2 values, we can opt for the functions regsubsets (in leaps package) and glmulti (in glmulti package) which have been shown to give consistent results with glmulti offering the capability of providing multi-model inferencing. Since both functions are susceptible to overfitting in which redundant regressors may be present in the model, partial F-tests can be performed to remove insignificant terms. But it is important to note that the adjusted R2 values typically drop with elimination of insignificant terms.

Table 19 Selection of best models obtained in this work