Refractive index (RI) is one of foremost representative properties of crude oils. This optical parameter is a fundamental physical property pertaining to the speed of light in crude oil (Chamkalani et al. 2014). Recently, this parameter has achieved growing attention in petroleum industry, mainly because of its application in various calculations regarding crude oil specific compositions (El Ghandoor et al. 2003). Touba et al. (1997) investigated the applicability of crude oil RI for defining various reservoir fluid characterizing properties. Their results demonstrated that crude oil RI can be used as a suitable property for predicting other crude oil properties including those controlling PVT behavior of hydrocarbon and surface tension. Taylor et al. demonstrated that the onset of asphaltene precipitation could be detected from RI measurement (Taylor et al. 2001). Fan et al. (2002) examined the relation between crude oil RI and asphaltene stability in crude oil. They suggested that the difference between crude oil RI and crude oil RI at the onset of asphaltene precipitation could be employed as a decisive factor for the assessment of asphaltene stability in crude oils (Fan et al. 2002). Buckley and Wang (2002) showed that the wettability altering tendency is related to crude oil RI. Riazi and Al-Otaibi (2001) attempted to shed a new light on the relation between the viscosity and RI. They proposed a simple mathematical-based approach which can estimate the viscosity of crude oils from their RI data.

Considering the importance of crude oil RI, introduction of a reliable technique for accurate determination of this parameter is valuable. Conventional refractometers have long been employed as the leading method of direct measurement for determination of crude oil RI. However, the conventional refractometer is incapable of measuring the RI of heavy crude oils, mainly due to its opacity (Taylor et al. 2001). Recently, a novel RI measurement technique, the so-called capillary tube interferometer, has been introduced by El Ghandoor et al. (2003) for laboratory determination of heavy crude oil RI. This method can determine the RI of heavy crude oil, as well as light crude oil, with high precision. Although aforementioned experimental technique measures the RI of crude oil with high accuracy, utilizing this method for quantifying crude oil RI is associated with some difficulties as experimental implementation is costly, time consuming, and also labor intensive. Due to drawbacks encountered in various methods of experimental measurement, development of a crude oil RI model could facilitate exact calculation of crude oil RI from readily available experimental data. Several researchers have tried to develop a predictive model for determining the RI (Chamkalani et al. 2014; Fan et al. 2002; Buckley and Wang 2002; Chamkalani et al. 2012; Chamkalani 2012; Gholami et al. 2014a, b; Vargas and Chapman 2010; Zargar et al. 2015; Tatar et al. 2015). The published models available in the literature for estimating the value of crude oil RI can be divided into two main categories. In the first group of models, RI of crude oils is estimated using their density. Buckley and Wang (2002) developed a correlation relating crude oil RI to its density through linear regression. Chamkalani (2012) also presented a similar correlation to that of Buckley and Wang (2002). Vargas and Chapman (2010) evaluated the applicability of the One-Third rule in hydrocarbon and crude oil systems. They introduced a novel model, so-called Lorentz–Lorenz expansion, for building a correlation between RI and density of crude oils. In the second group of models, the RI of crude oil is calculated using SARA (saturate, aromatic, resin, and asphaltene) fraction data. The concept of the mutual dependence of crude oil RI and SARA fraction data was introduced by Fan et al. (2002). They developed an empirical correlation between the SARA fraction data and RI of crude oil. Chamkalani (2012) established another empirical correlation to estimate the value of crude oils RI. Although aforementioned empirical correlations are valuable, they are faced with some shortcomings, the main one being unsatisfactory accuracy.

Recently, intelligence based models have emerged as powerful techniques of solving complex problems encountered in petroleum industry. These models have achieved encouraging results in modeling phenomena and parameters such as asphaltene precipitation, minimum miscible pressure, viscosity, porosity, permeability, bubble point pressure (Asoodeh et al. 2014a, b; Bagheripour et al. 2015; Gholami et al. 2014c; Rasuli Nokandeh et al. 2012; Hemmati-Sarapardeh et al. 2013; Afshar et al. 2014; Gholami et al. 2015; Naseri et al. 2014; Shokooh Saljooghi and Hezarkhani 2014, 2015; Ashoori et al. 2010). Moreover, intelligent models have been used for developing a quantitative correlation between the RI and SARA fraction data. Chamkalani et al. (2012) proposed a robust mathematical approach based on the least square support vector machine for prediction of crude oil RI from SARA fraction data. Later, Chamkalani et al. (2014) improved their approach by combining two optimization tools, viz. coupled simulated annealing (CSA) and Nelder and mead simplex method (NM). Gholami et al. (2014a) presented a new mathematical approach based on fuzzy logic to predict the value of crude oil RI from SARA fraction data. In another report, Gholami et al. (2014b) attempted to create a correlation between these two parameters. They published a smart model based on neural network optimized by a genetic algorithm pattern that searches for estimation of crude oil RI as a function of SARA fraction data. Zargar et al. (2015) included particle swarm optimization in fuzzy formulation and employed construction model for estimation of refractive index of crude oil. Crude oil RI obtained from above empirical correlations and intelligent models are employed as a criterion for diagnosis of asphaltene stability in crude oil. Although aforesaid intelligence models can quantify the crude oil RI with reasonable accuracy, there exists a quest for introduction of more reliable algorithms.

Current work is aimed to offer an innovative algorithm based on optimized SVR for creating a quantitative formulation between the RI of crude oil and SARA fraction data. Optimization approaches used in this study for optimizing SVR are HGP, GA, and ICA. Optimization techniques were embedded in SVR structure for computing the optimal value of parameters found in SVR formulation. Also, to identify which optimization approach has the best performance, the comparison study based on statistical criteria is implemented. It is observed that coupling ICA with SVR achieves better performance in estimating crude oil RI.

Model description

Support vector regression

Support vector regression (SVR) is a powerful approximation technique based on statistical learning theory proposed by Vapnik (1995). In this technique, the input data, x, are mapped from original space into a multidimensional feature space using a kernel function and then, a linear fitting function in this space is computed. The general SVR estimating function is a linear function as (Vapnik 1998):

$$ f\left( x \right) = \sum\limits_{i = 1}^{l} {\left( {a_{i} - a_{i}^{*} } \right)K(x_{i} ,x} ) + b $$

where b is the offset regression estimation, \( \alpha_{i} {\text{ and }}\alpha_{i}^{ *} \) are Lagrange multipliers and \( K(x_{i} ,x) \) is the kernel function. The Lagrange multipliers are obtained by minimizing a regularized risk function as follows (Steinwart and Christmann 2008; Vapnik 1982, 1998):

$$ R_{\text{reg}} [f] = \frac{1}{2}\left\| \omega \right\|^{2} + C\sum\limits_{i = 0}^{n} {L(f(x_{i} ) - y_{i} )} $$

where \( \left\| \omega \right\|^{2} \) is characterized as the model complexity. The term C is a constant that controls the tradeoff between training error minimization and model complexity. On the right hand side of Eq. (2), n refers to the size of the training data, y is the target data and \( L(f(x_{i} ) - y_{i} ) \) is the \( \varepsilon - \) insensitive loss function defined by:

$$ L(f(x) \, - \, y) = \left\{ {\begin{array}{*{20}l} 0 \hfill & {{\text{if}}\;\left| {y - f(x)} \right| \le \varepsilon } \hfill \\ {\left| {y - f(x)} \right| - \varepsilon } \hfill & {\text{Otherwise}} \hfill \\ \end{array} } \right. $$

The \( \varepsilon - \) insensitive loss function can be visualized as a tube that surrounds the training data with a radius equal to \( \varepsilon \). The parameters C and \( \varepsilon \) can be selected by the optimization algorithm. Another key parameter in SVR is the kernel parameter. In this study, a Gaussian radial basis function (RBF) is used as the kernel function and \( \gamma \) is its main parameter. The feature space of the RBF kernel is Hilbert space or infinite dimensional space. The RBF kernel is defined as (Vert et al. 2004):

$$ K(x_{i} ,x) = {\text{exp(}} - \gamma \left\| {x_{i} - x} \right\|^{2} ) $$

A more detailed account of SVR method can be found in other papers and reviews (Vapnik 1982, 1995, 1998; Al-Anazi and Gates 2010). Performance of SVR methodology is heavily dependent on the determination of parameters included in its formulation (Al-Anazi and Gates 2010; Ansari and Gholami 2015a, b). The following sections introduce the searching tools employed in this study for achieving the optimal values of SVR parameters.

Hybrid of grid and pattern search

Hybrid of grid and pattern search (HGP) is a conventional optimization method employed for attaining accurate values of SVR parameters. This methodology is the combination of grid search and pattern search techniques. The grid search is implemented using values of parameters across a specified given search range by geometric steps. The pattern search commences at the mid-point of the given search range and takes trial steps in each direction for each parameter. If the performance of the model is improved, the search mid-point shifts to the new point and the process is iterated. If no improvement in the model fit is achieved, the step size is reduced and the search is tried again. The pattern search terminates when the search step size is less than a pre-specified tolerance. Main shortcoming of the grid search method is its computationally expensiveness, due to the need for assessing the model at many points within the grid for each parameter. A pattern search, on the other hand, generally requires far fewer evaluations of the model than a grid search. However, it may converge to a local in lieu of global optimal point. Therefore, for obtaining an efficient optimization tool, both methods should be combined in a systematic way. In this study, optimization starts with grid search trying to find a region close to the global optimum point. Subsequently, a pattern search is conducted over the narrowed search range surrounding the best point found by the grid search. This hybrid method improves the searching methodology for determining optimal values of SVR parameters with high accuracy (Al-Anazi and Gates 2012).

Genetic algorithm

Genetic algorithm (GA), first introduced by Holland, is a global optimization strategy based on mathematical simulation of biological evolution (Holland 1975). In the GA terminology, potential solutions are called chromosomes. Optimization is started with initial population of random chromosomes. Evolution of each chromosome is achieved through a fitness function introduced in the problem. In each generation (algorithm iteration), the new solution is obtained through three main operations: selection, crossover, and mutation. The selection operator chooses those parents that contribute to the population in the next generation. The crossover produces children from parents, and mutation alters some gene values in a chromosome, randomly. The aforementioned process is repeated until the algorithm converges to global optima. The flowchart of GA is illustrated in Fig. 1. More details about GA can be found in Mohaghegh (2000), Khishvand and Khamehchi (2012) and Assareh et al. (2010).

Fig. 1
figure 1

General flowchart of GA

Imperialist competitive algorithm

Imperialist competitive algorithm (ICA) can be considered as a social counterpart of the GA. The ICA was introduced first by Atashpaz-Gargari and Lucas based on computer simulation of human social evolution in order to solve optimization problems (Atashpaz-Gargari and Lucas 2007). This optimization strategy has revealed great performance in the achievement of global optima (Atashpaz-Gargari et al. 2008; Alikhani-koupaei and Abdechiri 2010; Ansari 2014; Khabbazi et al. 2009; Rajabioun et al. 2008). The ICA utilizes a special terminology, i.e., the counterpart of chromosomes in the GA is called countries in the ICA; the function expected to be optimized is called cost function; each iteration of algorithm is called a decade. In this method, based on cost values, some of the best (least values) initial countries are selected as imperialists and other countries are called colonies. Each imperialist takes control of some colonies and they constitute an empire.

Similar to the GA, ICA uses three main operations during the optimization process: assimilation, revolution, and competition. Assimilation is the process of moving the colonies toward the imperialist along different socio-political axis such as culture, language, religion, etc. (Fig. 2). In the Fig. 2, d is the distance between the colony and the imperialist state and x is a random variable with uniform (or any proper) distribution which can be obtained as (Atashpaz-Gargari and Lucas 2007):

Fig. 2
figure 2

Movement of colonies toward their relevant imperialist along the language and culture axis (Atashpaz-Gargari and Lucas 2007)

$$ x\sim U\left( {0,\beta \times d} \right) $$

where \( \beta \) is a constant number greater than one. Revolution causes the position of some countries change along social axis in a relatively short period time. The revolution increases the ability of the solution search in the algorithm and prevents early convergence of the algorithm to local minimums. During assimilation and revolution, a colony might reach to a position with lower cost than the imperialist and thus, the positions of the imperialist and the colony will be changed in this case. Furthermore, all empires try to take the possession of colonies in other empires and control them. In this case an imperialist competition will happen. This process is done based on total cost (T.C) of the empires which can be obtained by:

$$ {\text{TC}}_{n} {\text{ = Cost(imprialist}}_{n} ) \,{ + }\,\xi {\text{ mean}}\{ {\text{Cost(colonies of empire}}_{n} )\} $$

where \( \xi \) is a coefficient which controls the percentage of contribution of the colonies costs in the total cost. The normalized total cost (NTC) is obtained by:

$$ NTC_{n} = TC_{n} - \hbox{max} \{ TC_{i} \} $$

Then, the possession probability (pp) of each empire is given by:

$$ {\text{p}}p_{n} = \left| {\frac{{NTC_{n} }}{{\sum\limits_{i = 1}^{{N_{\text{imp}} }} {NTC_{i} } }}} \right| $$

where N imp is the number of imperialists. To select an empire, a process similar to the roulette wheel process, which is used in selecting parents in the GA, is needed. The method of selection in the ICA is faster than the conventional roulette wheel because it is not necessary to compute the cumulative distribution function and the selection is based only on the values of possession probabilities. For this purpose, a vector of probabilities is generated as:

$$ {\mathbf{P}} = [pp_{1} ,pp_{2} , \ldots ,pp_{{N_{{\text{imp}}} }} ] $$

Then the vector R is created with uniformly distributed random numbers in range of [0 1]:

$$ {\mathbf{R}} = [r_{1} ,r_{2} , \ldots ,r_{{N_{\text{imp}} }} ] $$

Finally, vector D is formed as:

$$ {\mathbf{D}} = {\mathbf{P}} - {\mathbf{R}} = [pp_{1} - r_{1} , \ldots ,pp_{{N_{\text{imp}} }} - r_{{N_{\text{imp}} }} ] $$

Referring to vector D the mentioned colonies are handed to an empire whose relevant index in D is maximized (Atashpaz-Gargari and Lucas 2007). Algorithm continues in several decades in this manner with the main operations (assimilation, revolution, competition) until a stop condition is satisfied. To better understand this method, the main steps of ICA are summarized in Fig. 3.

Fig. 3
figure 3

General flowchart of the ICA

Data input/output space

Evaluation of the constructed model is achieved based on the dataset found in the open source literature (Buckley and Morrow 2003). The developed model creates a quantitative formulation between the input data (SARA fraction data) and the target value (crude oil RI). SARA method is a common separation technique that divides crude oil into four distinct groups namely saturate, aromatic, resin and asphaltene based on the differences in polarity and solubility. In the available dataset, fractionation is implemented through HPLC technique. In addition, crude oil RI is measured using automated index instruments GPR 11-37 refractometer. This instruments use the sodium-D line (589 nm) as a standard wavelength. The GPR 11-37 uses a critical angle technique that permits measurement of RI for opaque liquid samples. The statistical description of SARA faction and crude oil RI data are shown in Table 1. In this study, 104 data points are used in the modeling, 30 % of which are used as validation data and the rest as training data.

Table 1 Statistics of data set used in this study

Results and discussion

In this study, an \( \varepsilon \)-SVR with RBF kernel is employed for estimating the crude oil RI values from the SARA fraction data. Before training the model, all input data are normalized in range of [0 1]. The normalization task reduces confusion and increases performance of the model. The SVR parameter is optimized through three different optimization strategies. First, a common SVR optimization survey (HGP technique) is used to obtain the SVR parameters. In this strategy, a region close to the global optimal point is found with the grid search and then, the pattern search is performed over this region to find a global optimum. Second, GA and third, ICA is used to optimize the SVR model. The fitness function in the GA or its counterpart in the ICA, cost function, is introduced in the mean square error (MSE) state as follows:

$$ {\text{MSE}}_{\text{SVR}} = \frac{1}{k}\sum\limits_{{i{ = 1}}}^{k} { ( {\text{output}}_{i} - {\text{target}}_{i} )^{2} } $$

where the output is the SVR results and the target is the crude oil RI data. All regulated parameters of the GA are presented in Table 2. Also, Table 3 shows all regulations before running the ICA. The results of running the GA and the ICA for 70 iterations are shown in Fig. 4. According to this figure, the final optimum value of the ICA is 1.4941e−5 which is lower than its counterpart in the GA (1.5244e−5). Also in the ICA, it takes only 17 iterations for mean and best values of all imperialists to be nearly equal but in the GA, algorithm converges after 36 iterations. The corresponding values of parameters to optimize SVR model using three optimization strategies are presented in Table 4. Graphical comparison between predicted and measured crude oil RI values using HGP-SVR, GA-SVR and ICA-SVR for the validation data are shown in Fig. 5. The performance of all models is evaluated using concepts of the square of regression coefficient (R2), MSE and symmetric mean absolute percentage error (SMAPE), shown in Table 5. SMAPE is a dimensionless error measurement which expresses error as a percentage of the actual data and provides an easy way of assessing the errors. It is defined as (Flores 1986):

Table 2 Regulations which were used in the GA
Table 3 Regulations done before run of the ICA
Fig. 4
figure 4

Results of running ICA including best and mean costs values versus decade and Minimum and best fitness values of the GA results versus generation

Table 4 Optimized SVR parameters of the HGP–SVR, GA–SVR and the ICA–SVR
Fig. 5
figure 5

Graphical correlation between measured and predicted crude oil RI for validation data using a HGP–SVR, b GA–SVR and c ICA–SVR

Table 5 A comparisons among all methods of this study, versus R 2, MSE and SMAPE
$$ {\text{SMAPE}} = \frac{1}{n}\sum\limits_{i = 1}^{n} {\frac{{\left| {{\text{Predicted}}_{i} {\text{ - Target}}_{i} } \right|}}{{{\text{Predicted}}_{i} {\text{ + Target}}_{i} }}} \times 100 $$

According to the results presented in Table 5, the evolution algorithms including GA and ICA improve the results of SVR model. Among the presented methods, ICA-SVR decreases error up to 2.6832e−05 which corresponds to a correlation coefficient of 0.9284. Figure 6 presents error in terms of cumulative distribution function (CDF). The error in this figure is the difference between predicted and actual measured data. Also, details of this figure are presented in Table 6 using concept of the probability distributions of the errors. For better understanding of the performance of the models, the error distributions of all models are presented in histograms of Fig. 7. According to this figure, 59.38 % of the error of the SVR prediction fall within one standard deviation of its mean ([−0.0056 0.0081]). The same analysis for GA-SVR and ICA–SVR reveals that 78.13 % of data have an error in the range of [−0.0051 0.006] and [−0.0049 0.0056], respectively. A deep insight into Figs. 6 and 7 and Table 6 suggests the superiority of the ICA–SVR when the error is decreased to zero.

Fig. 6
figure 6

CDF plots of error distribution for SVR, GA–SVR and ICA–SVR

Table 6 Probability distributions of modeling errors
Fig. 7
figure 7

Error distribution of prediction from SVR, GA–SVR and ICA–SVR. A judgment based on mean and standard deviation error reveals superiority of ICA–SVR


Refractive index is an optical parameter of crude oil that provides valuable information about crude oil specific compositions. Hence it is crucial to develop a computer-based approach for calculation of this property. SVR is an intelligent model used in this study for modeling of crude oil RI. Three optimization approaches namely HGP, GA, and ICA were used to find the optimal value of SVR parameters. The correlation coefficient between results of SVR optimization by HGP, GA, and ICA and the corresponding experimental values are 0.87537, 0.91915, and 0.92838, respectively. It is obvious that SVR optimized with ICA results in better prediction accuracy and outperforms the competing models (GA–SVR and HGP–SVR). This advantage is the result of ICA–SVR great performance in both rate of convergence and exploration of global optimum. The current study concluded that ICA–SVR is a promising methodology for modeling crude oil RI.