1 Introduction

Polychlorinated biphenyls (PCBs) are a class of discrete organic compounds with one to ten chlorine atoms attached to a biphenyl nucleus and a general chemical formula of C12H10−n Cl n , where n = 1–10 [1]. A general chemical structure of polychlorinated biphenyls is shown in Fig. 1. There are 209 theoretically possible congeners subdivided into ten homologue groups with 1–46 congeners in each. PCBs of a given homolog with different chlorine substitution position are called “isomers”. The degree of chlorination varies depending on the reaction conditions, and ranges from 19 to 69% (w/w). The composition of PCBs is summarized in Table 1 [2]. All congeners have been assigned a systematic number from 1 to 209 corresponding to a specific substitution pattern. The initial scheme was proposed by Ballschmiter and Zell [3] and revised by Guitart et al. [4].

Fig. 1
figure 1

Generic polychlorinated biphenyl (PCB) chemical structure

Table 1 Composition of PCBs by homologs

PCBs are hydrophobic compounds with low volatility, and the highly chlorinated ones have poor water solubility. Moreover, they are resistant to acids, bases, and (generally) environmental degradation processes. They are, therefore, highly persistent in the environment. They have good electrical insulation properties with high thermal conductivity, low flammability and high resistance to thermal degradation [2], and have therefore been widely used as heat transfer fluids and dielectric fluids in electric transformers and large capacitors. They have also been used as organic diluents, plasticizers, additives in pesticides, carbonless copy paper, paint additives, hydraulic fluids, lubricants and many other applications [1, 58].

Since PCBs occur as complex mixtures of up to 209 distinct congeners, their properties are dependent on the composition of the mixture. The properties of individual PCB congeners also vary according to the degree of chlorination and location of the chlorine atoms. Generally, the water solubility and vapor pressure decrease as the degree of substitution increases, and the lipid solubility increases with increasing chlorine substitution. PCBs in the environment may be expected to associate with the organic components of soils, sediments, and biological tissues, or with dissolved organic carbon in aquatic systems, rather than being in solution in water [911].

The toxicity of a PCB is dependent not only on the number of chlorine atoms present in the biphenyl structures, but also on the positions of the chlorine atoms. PCBs with many chlorines in ortho-position (nonco-planar) induce phenobarbital-type responses, while PCBs that lack chlorine atoms in the ortho-position but have chlorine atoms in both para-positions (4 and 4′) and at least one in the meta-position (3, 5, 3′, 5′) (co-planar), can rotate freely around the phenyl–phenyl (1,1′ bond). This means that they can exhibit structural resemblance to the dioxins, i.e. a relatively planar structure (coplanar PCBs), and may hence induce methylcholanthrene-type responses and dioxin-like toxicity [1214]. Most PCB congeners do not exhibit dioxin-like toxicity but have a different toxicological profile [13, 15]. Co-planar PCBs, like dioxins and furans, bind to the AL-receptor and may exert, thus, dioxin-like effects, in addition to AL-receptor independent effects, which they share with non-coplanar PCBs (e.g. tumor promoter) [13]. Association between elevated exposure to PCB mixtures and alterations in liver enzymes, hepatomegaly, and dermatological effects such as rashes and acne has been reported [13].

Due to PCBs’ complex composition, many researchers have also placed emphasis on the identification of the individual PCB congeners [16, 17]. Presently, all 209 PCB congeners have been commercially synthesized and are available for use as standards, and because of advances in high resolution gas chromatography (HRGC), it is possible to determine most of the individual PCB congeners in the environmental samples. However, separation and characterization of all 209 PCB congeners is still an extremely difficult (if not impossible) task that attracts on-going research focus in this field [18].

One of the most successful approaches to the prediction of physicochemical and biological properties of organic molecules, starting only from molecular structure information, is quantitative structure–property/activity relationships modeling (QSPRs/QSARs) [19]. QSPRs/QSARs are mathematical models that attempt to correlate the molecular structure of compounds and their biological, chemical, and physical properties. Among the most extensively studied properties are the chromatographic ones. It is considered that the same basic intermolecular interactions determine the behavior of chemical compounds in both biological and chromatographic environments [20]. Retention in chromatography is the result of a competitive distribution process of the solute between mobile and stationary phases, in which, the partitioning of the solute between these phases is largely determined by the molecular structure [21]. Predicting chromatographic behavior from molecular structure of solutes resulted in the quantitative structure-retention relationships (QSRR) methodology. QSRR are statistically derived relationships between the chromatographic parameters determined for a representative series of analytes in given separation systems and the molecular descriptors accounting for the structural differences among the investigated analytes [22]. Such relationships may provide insight into the molecular mechanism of separation in a given chromatographic system, generate knowledge about the various interactions taking place between the solute and the stationary phase, evaluate physicochemical properties of analytes and identify the most informative structural descriptors. Due to the need to control the PCBs level in the environment, the analytical methods for their analysis are currently based on their separation by GC [23] using on capillary columns with different polarities [24, 25] and specific detectors such as the flame ionization detector (FID) [26], the electron-capture detector (ECD) [27, 28] and mass spectrometry (MS) [29, 30]. The HRGC-RRT on a capillary column with detection by ECD is a unique characteristic of the PCBs and can be used for the identification purpose. Despite the broad range of GC stationary phases available, none can separate all PCBs from each other. Various techniques have been used based on the combination of commercially available GC columns to improve the separation efficiency of PCBs [3134]. At present, a database of RRTs and co-elutions for all 209 congeners on 20 different stationary phases with MS or ECD detection has been reported [17, 35].

In the past, several attempts have been made to build QSRR models on the prediction of RRT for PCBs on different stationary phases [3643]. Hasan and Jurs [37] used the five-variable regression equation for prediction of GC-RRT of 209 PCBs with R 2 = 0.997 and standard deviation of 0.017. Liu et al. [38] used the five-variable regression equation with R 2 = 0.9928 and the root mean square errors of 0.0152 based on molecular electronegativity distance vector (MEDV) descriptors to correlate with the GC-RRTs of 209 PCB congeners on the SE-54 stationary phase. Ren et al. [41] using CODESSA software package and principal component analyzed (PCA) presented a QSRR study for the GC–GC–TOFMS (time-flight mass spectrometry) chromatographic relative retention time of 209 PCB congeners. PCA was used to recognize groups of samples with similar behavior and assist the separation of the data into training and test sets. Jäntschi et al. [42] reported the use of a molecular descriptors family (MDF) in QSRR modeling to predict the chromatographic relative retention times of 209 PCBs on a capillary column of SE-54.

In the preceding paper [43], several estimation models derived from the HRGC-RRT values of all 209 PCB congeners on the 18 stationary phases by a GA-based best multiple linear regression analysis with four optimal descriptors were proposed. The present work is focused on the successful application of a semi-empirical topological method for the prediction of the HRGC-RRTs 209 PCBs congener’s values on the same stationary phases. Selecting some semi-empirical chemical descriptors such as molecular volume (V M) and molecular surface area (S M) as well as the number of substituted chlorine atoms as descriptors, the QSRR models correlating to the RRTs of 209 PCB congeners are developed using the elimination selection stepwise best multiple linear regression (BMLR) analysis. It has been found that the QSRR models have not only high estimation qualities and high stabilities but also good predictive potentials.

2 Experimental

2.1 Experimental Database Set

The observed HRGC-RRTs of 209 PCBs on 18 different stationary phases, 30m DB1, 30m SPB-Octyl, 60m SPB-Octyl, 100m CP-Sil5-C18, 30m DB5-MS, 60m RTX-5, 50m CP-Sil-13, 30m SPB-20, 30m HP-35, 60m RTX-35, 30m DB-17, 60m HP-1301, 30m DP-XLB, 30m DB-35-MS, 50m HT-8, 30m Apiezon L, 30m CNBP#2, 48m 007-23, reported by Frame [17, 35] served as experimental data in this study and the HRGC-RRTs designed as dependent variables. The structures of the PCBs together with their relative retention time values are listed in Table S1 (see Supplementary Material).

2.2 Computer Hardware and Software

All calculations were run on a Pentium IV personal computer (CPU at 2.6 MB) under Windows XP operating system. The ISIS/Draw version 2.3 software was used for drawing the molecular structures [44]. Molecular modeling and geometry optimization were employed by Hyperchem (version 7.1, HyperCube) [45]. Dragon software [46] was employed for calculation of theoretical molecular descriptors. SPSS software (version 13.0, SPSS) (http://www.spss.com/) was used for stepwise MLR analysis and other calculations were performed in the MATLAB (version 7.0, Math Works) environment.

2.3 Descriptor Generation

To obtain QSRR models, PCB congeners must be represented using molecular descriptors. Descriptors are generated solely from the molecular structures and aimed to numerically encode meaningful features of each molecule. The calculation process of the molecular descriptors is described as below: all the two-dimensional structures of the molecules were drawn using ISIS/Draw 2.3 program [44]. Then the 3D geometry structures of the molecules were pre-optimized using MM+molecular mechanics force filed and precisely optimized with semi-empirical AM1 method implemented in HyperChem software package (Hypercube, version 7.0) [45]. All calculations were carried out at restricted Hartree–Fock level without configuration interaction. The molecular structures were optimized using the Polak–Ribiere algorithm until the root mean square gradient was 0.01 kcal Å mol−1 (200 K, gas phase). In order to prevent the structures locating at local minima, geometry optimization was run many times with different stating points for each molecule.

During this investigation, 22 molecular descriptors were studied to characterize all 209 PCB congeners. Among these, six quantum chemical descriptors, like the dipole moment of the molecules at X, Y and Z directions (μ), the standard heat of formation (ΔH ºf ), the energy of the highest occupied molecular orbital (E HOMO), the energy of the lowest unoccupied molecular orbital (E LUMO), E 2HOMO , E 2LUMO ; 11 theoretical descriptors, molar mass (M w ), the natural logarithm of molar mass (lnM w ), molecular volume (V M), molecular surface area (S M), molecular polarizability (α), hydration energy (H e), total energy of the molecule (E tot), bending energy (E ben), electronic energy (Eele), nuclear energy (E nuc), molar refractivity (R f); and the other five descriptors, the number of ortho-substituted chlorines (o-Cl), meta-substituted chlorines (m-Cl) and para-substituted chlorines (p-Cl), the root square of the number of chlorine substituents (Cl1/2) and the square number of chlorine substituents (Cl2), belonged to topological-type ones; were calculated by the HyperChem and Dragon programs. The number of these descriptors presented the electronic structure-related information. For instance, E HOMO imply the electron-donating ability, E LUMO denote electron-acceptance ability, μ refers to molecular dipolar property, ΔH ºf is the heat released or absorbed (enthalpy change) during the formation of a pure substance from its elements at constant pressure. Some other descriptors revealed the molecular size and topological information such as M w , o-Cl, m-Cl, and p-Cl. In addition, the logarithm transferation and square along with root square operations are to calculate a number of nonlinear components in linear model.

3 Results and Discussions

The calculated semi-empirical topological descriptors were collected in a data matrix (D) whose number of rows and columns were the number of molecules and descriptors. At the beginning, in order to minimize the information overlapping in descriptors and to reduce the number of descriptors required in regression equation, the concept of non-redundant descriptors (NRD) [47] was used in our study. That is, when two descriptors are correlated by a linear correlation coefficient value greater than 0.85, both descriptors are correlated with the dependent variables and the better correlation is used for the actual analysis, leaving out the descriptors showing a lower correlation. This objective-based feature selection left reduced and predictive descriptors for the studied compounds. By using these criteria, 17 out of 22 original descriptors were eliminated. These descriptors can give some information on the affecting degree for RRTs of different descriptors and well understanding the correlation between the experimental and calculated values. Therefore, o-Cl, m-Cl, p-Cl, and one of S M or V M set of descriptors has been used in the QSRR models for RRTs prediction on all capillary stationary phases.

3.1 Linear Regression of RRTs with S M and V M

To reduce redundancy in the descriptor data matrix, correlation of descriptors with each other and with the RRT values of the PCB congeners was examined to detect collinear descriptors (i.e. r > 0.9). The correlation coefficient matrix for the descriptors used in this study, is listed in Table 2. Among these descriptors, molecular volume (V M) and molecular surface area (S M) are presenting the molecular size. They are highly correlated with each other. The correlation coefficient amounts to as high as 0.9891 for the present set of 209 PCBs congeners. Consequently, at the start, the correlation of the semi-empirical topological descriptors S M and V M along with the RRT of all PCB congeners of each stationary phase has been studied. The influences of the S M and V M descriptors on the correlation coefficient (R 2) and the standard deviation (SE2) for the resulted 18 univariate models versus stationary phases are plotted in Fig. 2. Obviously, low to very high correlation coefficients were obtained for both semi-empirical topological descriptors. As can be shown, for nine stationary phases (S4, S6, S8, S10, S20, S22, S24, S26, and S27), the S M produced a better model as the V M, while for four stationary phases (S1, S14, S15, and S16) the V M produced a better model as the S M but for other stationary phases (S11, S12, S13, S17, and S21), the correlation coefficient (R 2) and the standard deviation (SE2) are almost equally. Being colinear of both descriptors (S M and V M), to derive the best QSRR models for all stationary phases, caused to eliminate one of them from the selected set of descriptors. According to the obtained results, the molecular volume semi-empirical descriptor should be removed. Therefore, the ability of the resulting QSRR regression models to enable accurate prediction of the relative retention time is not related to colinearity between the variables.

Table 2 Correlation matrix for the inter-correlation of various semi-imperial descriptors of 209 PCBs congeners
Fig. 2
figure 2

The influences of the molecular surface area (S M) and molecular volume (V M) descriptors on the correlation coefficient (R 2) and the standard deviation (SE2) for the resulted 18 univariate models versus stationary phases

3.2 Multilinear Regressions with Descriptors

The general purpose of multiple regressions is to quantify the relationship between several independent or predictor and dependent variables. A set of coefficients defines the single linear combination of independent variables (molecular descriptors) that best describes RRT values. The RRT values for each PCB on the stationary phase would then be calculated as a composite of each semi-empirical topological descriptor weighted by the respective coefficients. A multiple linear model can be represented as:

$$ y = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \beta_{3} X_{3} + \cdots + \beta_{k} X_{k} + \varepsilon $$
(1)

where \( \left\{ {X_{1} , \ldots ,X_{k} } \right\} \) are molecular descriptors, β 0 is the regression model constant, β 1– β k are the coefficients corresponding to the descriptors X 1X k and y is dependent variable. The values for β 0 – β k are chosen by minimizing the sum of squares of the vertical distances of the points from the hyperplane so as to give the best prediction of y from X. Regression coefficients represent the independent contributions of each calculated molecular descriptor. In matrix notation, we will write the MLR model is defined in Eq. 2 as:

$$ y = {\mathbf{X}}b + e $$
(2)

When X is of full rank the least squares solution is: \( \hat{b} \) = (X T X)−1 X T y where \( \hat{b} \) is the estimator for the regression coefficients in \( \hat{b}. \) The advantages of MLR are that it is simple to use and the derived models are easy to interpret. The sign of the coefficients β 0 − β k shows whether the molecular descriptors contribute positively or negatively to the target property and their magnitudes indicates the relative importance of the descriptors to the target property. Table 3 summarizes the best correlation between the computed semi-empirical molecular structure descriptors including o-Cl, m-Cl, p-Cl, and S M and RRT through MLR analysis for the total sets of PCB congeners on 18 GC capillary columns, where R 2 is the calibration squared correlation coefficients, RMS is the root mean square error, REP is the relative error of prediction, F is the Fisher’s criterion at the 95% level probability of the equations. The value after the symbol “±” in the parenthesis is the standard deviation related to the regression coefficient. As can be seen in this Table, the R 2 values above 0.9651 with the exception of system 27 (R 2 = 0.9272) and the RMS and REP below 0.0179, 3.8538; except for system 27 (RMS = 0.0335, REP = 7.2682), indicated that the MLR models have good statistical qualities with low prediction error and demonstrated an excellent predictive power of the obtained QSRR models for all stationary phases. In addition, taking into account the signs of the correlation coefficients, the following explanation of RRT-semi-empirical descriptors relationships can be given. The terms of o-Cl, m-Cl and p-Cl are positively while the S M term is negatively correlated with the RRT in all QSRR models, which indicate increasing the number of chlorine atoms substituted in ortho, meta and para positions on biphenyl rings caused to increase further interaction with the stationary phases possessing different polarities, while increasing the molecular size caused to increase its surface area and stronger intermolecular dispersion force will be achieved, therefore the PCB molecule with larger S M value does not tend to be adsorbed onto the stationary phase (the RRT become less).

Table 3 QSRR models and statistical parameters of GC-RRT values versus (52 + 180) for the total sets (n = 209) of polychlorinated biphenyl (PCB) congeners on 18 GC capillary columns

3.3 Model Prediction-Validation

Model validation is a critical component of QSRR development. A number of procedures have been established to determine the quality of QSRR models. Therefore, a leave-one-out cross-validation (LOO-CV), Y-randomization, and external validation (EV) procedures through an odd–even number and division of the entire data set into training and test sets are used to validate the predictive ability and check the statistical significance of the developed 18 QSRR models.

3.4 Cross-Validation

The most popular validation method is cross-validation (CV), known as jack-knifing or leave-one-out (LOO). This method systematically removes one data point at a time from the training set, and constructs a model with the reduced data set. Subsequently, the model is used to predict the data point that has been left out. By repeating the procedure for the entire data set, a complete set of predicted properties and cross-validated statistics can be obtained. It has been argued that the LOO procedure often overestimates the predictivity of the model and that, subsequently, the QSRR models are overoptimistic [48]. For cross-validated statistics, it has been suggested that prediction residual error sum of squares (PRESS), cross-validated square correlation coefficient (R 2cv ) and root mean square error in cross-validation (RMScv) are good estimates of the real prediction error of a model:

$$ \hbox{PRESS} = \sum\limits_{i = 1}^{N} {(y_{{{\text{pred}},i}} - y_{{{\text{obs}},i}} )^{2} } $$
(3)
$$ R_{cv}^{2} = 1 - {\frac{{\sum\nolimits_{i = 1}^{N} {(y_{{{\text{pred,}}i}} - y_{{{\text{obs,}}i}} )^{2} } }}{{\sum\nolimits_{i = 1}^{N} {(y_{{{\text{obs,}}i}} - } \bar{y}_{\text{obs}} )^{2} }}} = 1 - {\frac{\hbox{PRESS}}{{\sum\nolimits_{i = 1}^{N} {(y_{{{\text{obs,}}i}} - \bar{y}_{\text{obs}} )^{2} } }}} $$
(4)
$$ \hbox{RMS}_{\rm{cv}} = \sqrt {{\frac{{\sum\nolimits_{i = 1}^{N} ( y_{{{\text{pred,}}i}} - y_{{{\text{obs,}}i}} )^{2} }}{N}}} $$
(5)

where N is the number of training patterns, y obs,i and y pred,i are the experimental, and predicted RRTs of the left-out compound i, respectively and \( \bar{y}_{\text{obs}} \) is the average experimental RRT of left-in compounds different from i. Values of R 2cv can range from one to less than zero. A value of one indicates a perfect prediction, and a value of 0 means that the QSRR derived has no modeling power. Negative values arise from a situation where the derived QSRR is a poorer description of data than no model at all. The R 2cv values can be considered as a measure of the predictive power of a model: whereas R 2 can always be increased artificially by adding more parameters, R 2cv decreases if a model is over parameterized [49], and is therefore a more meaningful summary statistic for predictive models. The correlation coefficients (R 2cv ) and RMScv for each subset are presented in Table 3 and the resulted values are plotted in Fig. 3. The cross-validation results show that the R 2cv are higher than 0.9230 and RMSCV lower than 0.0345 for all GC stationary phases. Furthermore, in all cases, the cross-validated R 2cv values are very close to the corresponding R 2 values and the cross-validated RMScv values are only slightly larger than the corresponding RMS values. Clearly, the cross-validation demonstrates the final models to be statistically significant.

Fig. 3
figure 3

Plots of the RRTs estimated for the odd set (◆, A) and even set (○, B) samples by holdout model versus that observed RRTs experimentally for all stationary phases of PCBs

This method is not a very rigorous model predictivity test and suffers from two other major deficiencies: the time to carry out the cross-validation increases as the square of the size of training set; the method produces n final models (each corresponding to one of the training set molecules being left out) and it is not clear which is the ‘best’ model. To further check the prediction ability of the resulting QSRR models two better methods are applied here, one by Hawkins and co-workers in 2004 [50] namely the odd–even external validation and the other better method is to remove a percentage of the training set into a prediction set [49, 51].

3.5 Odd–Even External Validation

To validate and develop a credible QSRR model, it is not enough to build a model for the whole data set. So, the 209 (208 for System 15) data set PCBs for all stationary phases were sorted in the ascending order of RRT values and then divided into two sets namely “odd set” and “even set” RRTs [49, 51]. This way of splitting ensures that the distribution of RRT values of the two subsets were very similar. The QSRR models were fitted to the odd set and even set samples separately and the resulted fitness were assessed by applying QSRR models to both samples. To compare the estimation abilities of the models, two statistical parameters namely root mean squares error (RMSE) and R 2, were calculated. The same data set (i.e. ‘calibration set’) that was already used to fit the models was employed to determine resubstitution parameters, i.e. RMSERS and R 2RS , also to determine holdout parameters, i.e. RMSEHO and R 2HO for the other data set, which was not involved in the fitting. The resubstitution statistical parameters of the samples base their predictions on the regression fitted to those samples and this is while the holdout statistical parameters base their predictions on the regression fitted to the other samples. The plots of RRTs estimated by odd- and even-set QSRR models (holdout prediction) versus the RRTs observed experimentally are given in Fig. 3, also Table 4 summarizes these statistical parameters achieved by this approach. As can be seen, in the odd-and even-set samples, the resubstitution and holdout RMSE are very similar, indicating that the same sample and other sample predictions are equally precise for all stationary phases.

Table 4 Statistical parameters of the over-fitting and predictive ability of the models

3.6 Y-Randomization Test

Another procedure that is easy to perform is a randomization test called Y-randomization (randomization of response, i.e. in our case RRT). In this method for each column stationary phase, the output RRTs values of the compounds are shuffled randomly, and the resulting data set is examined by the QSRR method against real (unscrambled) input descriptors to determine the correlation and predictivity of the resulting “model” [43, 5254]. The whole procedure is repeated on many different scrambled data sets. The rationale behind this test is that the significance of the real QSRR model would be suspect if there is a strong correlation between the selected descriptors and the randomized response variables. The randomization was repeated ten times. If the statistical qualities of these models are much lower than the original model, it can be considered that the model is reasonable and had not been obtained by chance. The results are shown in last column of Table 2. Very low level of R 2max (in the interval of 0.0012 for S1 and 0.0030 for S27) indicates good results in our original models and is not due to a chance correlation or structural dependency of the training set for each stationary phase of the GC column.

3.7 Calibration and Prediction Sets

In this investigation, for further testing the predictive ability of the models for the external compounds without the models, part of the congeners are picked up from 209 (208 for System 15) PCBs to construct a training set which is used to develop a prediction model and then predict the values of RRTs in the remaining congeners. How to pick up the compounds in the training set is very important for developing of the predictive QSRR models. In this case, before each training run, all data sets were split randomly into two separate sub-matrices: the training set matrix and external testing set matrix. Out of 160 congeners (159 for System 15) (76.6%) were used for the training set and 49 congeners (23.4%) were used as external validation. The PCBs constituting the training and testing sets are clearly presented in Table S1. Moreover, the same divisions were repeated with corresponding RRTs values. The test examples are marked as bold font and training set was also used to obtain the best fit equation of MLR with four semi-empirical descriptors. Furthermore, the testing set was used to monitor overfitting the MLR models. The resulted MLR models for training set congeners were the same as those obtained for the entire set of all PCBs in each subset subject to use descriptors of all congener’s models supporting sufficient ability for the prediction set of 49 PCBs. The resulting regression equations of the training set for individual HRGC column stationary phases with the optimal four descriptors are indexed in Table 5, and results obtained are plotted in Fig. 4. Statistical parameters for the best-fitted models are also presented in Table 6. The correlation coefficients (R 2) of the obtained models are >0.96 for all the stationary phases except for system 27 (0.9157), and the highest one is 0.9915 for System 6. The root mean square errors (RMS) and relative error prediction (REP) of estimation ranged from 0.0124, 2.7646 of System 4 to 0.0198, 4.4217 of System 14 (except for System 27), respectively, also the F statistic values are >977.4 (except System 27). The LOO-CV method was used to examine the stability of QSRR models, and the values of R 2cv and RMScv for the models were above 0.9590 and in the range of 0.0129 and 0.0205 (except for System 27). The predicted RRTs versus the observed RRTs of the 160 (159 for System 15) PCB training sets are plotted in Fig. 4 (diamond). As shown in Table 6 and Fig. 4, the QSRR statistical results exhibit good estimation capacity and stability for internal training set PCB samples to individual stationary phases. High predictive ability of QSRR models for external examples is another criterion of a good QSRR model. The predicted RRTs of 49 PCBs in the external testing set by the models in Table 5 are also demonstrated in Fig. 4 (circle) versus the observed RRTs of 18 GC stationary phases. For all 18 HRGC stationary phases, the regression of the observed and predicted RRTs had a high agreement with the diagonal of each chart. The predicted correlation coefficients (R 2) over 0.9726 with the exception of System 27 (R 2 = 0.9497), the root mean square errors (RMS) and relative error prediction (REP) below 0.0209 and 4.603, respectively, except for System 27 (RMS = 0.0370, REP = 7.2079) demonstrated an excellent predictive power of the obtained QSRR models.

Table 5 QSRR models of GC-RRT values versus (52 + 180) for the training sets (n = 160) of polychlorinated biphenyl (PCB) congeners on 18 GC capillary columns
Fig. 4
figure 4

Plots of the RRTs estimated by the QSPR models in Table 5 vs. that observed for 160 (system 15 159) training set PCBs (◆) and 49 testing set ones (○) for all stationary phases

Table 6 Statistical parameters of the QSRR models obtained using different molecular descriptors

4 Conclusion

The HRGC-RRT values of PCBs on 18 capillary stationary phases (S1, S4, S6, S8, S10, S11, S12, S13, S14, S15, S16, S17, S20, S21, S22, S24, S26, S27), depend on four semi-empirical molecular descriptors, o-Cl, m-Cl, p-Cl and S M. MLR with non-redundant descriptors (NRD) produced more predictive, informative and significantly improved QSRR models. The validation and predictive ability of the models were examined by three methods of leave-one-out cross-validation, Y-randomization, and external validation. The methods indicated that the resulted multiparametric QSRR models have high prediction ability and low overfitting. All QSRR models but one related to the System 27 column stationary phase provide a reasonably well calibrated correlation coefficient (R 2 = 0.9619–0.9915) and the LOO cross-validation correlation coefficient (R 2CV  = 0.9590–0.9909).