Introduction

Quantitative structure–activity relationship (QSAR) is a numerical method for finding the relationships between chemical structure and drug properties i.e., biological activity in drug discovery processes [1]. Developing a QSAR model composed of different stages i.e., (1) collecting data from the literature, (2) calculation of parameters performed by different software packages such as Dragon software or image analysis (2D-QSAR), force field calculations based on three-dimensional structures (3D-QSAR) and etc., (3) developing the QSAR model by various statistical technique e.g. multiple linear regression, artificial neural network and partial least square, and (4) validation of the model by internal (leave one out and leave many out) and external validation [2]. There are various critical points in QSAR studies that should be considered by researchers [3]. However, the challenges on selecting appropriate parameters for external validation have been seen in the literature [4, 5].

In QSAR studies, training a model by linear and non-linear models is not enough to confirm the prediction capability. The developed model should be applied to other data sets which did not synthesize in virtual screening and designing new drug compounds. On the way, whenever we can say a QSAR model is acceptable that it could predict the activity of other compounds with reasonable accuracy. Therefore, external validation (splitting data into training and test sets) is one of the major challenges in QSAR studies [6,7,8]. Various types of cross validation analysis i.e., leave one out, leave many out and repeated double cross validation are recommended in QSAR studies especially when the available sample size is small [9, 10]. However, external validation is one of the most common criteria for evaluating the validity of a QSAR model [11,12,13].

Different criteria and rules were proposed for evaluating the validity of the QSAR models, which most of them focused on the external validation [13, 14]. Five criteria proposed in authentic journals were selected in this study and details have been described in method section. They are highly cited and several researchers were used them to evaluate validity of QSAR models [15,16,17,18]. Designers of each criterion have been shown advantages of them in comparison with others for external validation of QSAR models [5, 6, 19,20,21]. Some models have certain defects from the statistical viewpoint and various results are observed based on the applied software e.g. the correlation coefficient (r2) of regression through origin [5]. Nevertheless, there is no comprehensive comparison between them for the evaluation of the external validity of QSAR models. The aim of this study is the comparison of external validation of QSAR models by them to find advantages and disadvantages of each method.

Methods

Forty-four data sets (training and test sets) composed of experimental biological activity and corresponding calculated activity (re-substitution value for training data set) using QSAR models with various statistical approaches were collected from the published articles [22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48] indexed in Scopus database (see Additional file 1 and Table 1). The absolute error (AE) of each datum (absolute difference between experimental and calculated data) was calculated. External validation of these data set was assessed with the following methods:

Table 1 The numerical values of statistical parameters which need to calculate mentioned criteria for external validation for 44 developed QSAR models

Proposed criteria by Golbraikh and Tropsha

I. r2 > 0:6, r2 is the coefficient of determination between the experimental activity and predicted values based on regression analysis.

II. 0.85 < K < 1.15 or 0.85 < K' < 1.15.

K and K' are slopes of regression lines through the origin between the experimental activity and predicted, and vice versa, respectively.

III. \(\frac{{\text{r}}^{2}-{\text{r}}_{0}^{2}}{{\text{r}}^{2}}\text{<0.1 or }\frac{{\text{r}}^{2}-{\text{r}}_{0}^{^{\prime}2}}{{\text{r}}^{2}}\text{<0.1}\)

r0.2 and \({\text{r}}_{0}^{^{\prime}2}\) is the coefficient of determination between the experimental activity and predicted values and predicted versus experimental activity, respectively, based on regression through origin analysis (linear regression by least square method without a constant term) [19].

Proposed criteria by Roy based on regression through origin (RTO)

Roy and coworkers suggested \({\text{r}}_{{\text{m}}}^{{2}}\) which calculated by Eq. 1, and it is one of the most famous equations which used by QSAR experts in literature [20, 49]:

$$r_{m}^{2} = r^{2} \left( {1 - \sqrt {r^{2} - r_{0}^{2} } } \right)$$
(1)

In this equation,\(r_{0}^{2}\) value computed using regression through origin (RTO) and RTO referred to linear regression by least square method without a constant term.

Concordance correlation coefficient (CCC)

Gramatica and coworker [4] suggested the concordance correlation coefficient (CCC) for external validation of a QSAR model:

$${\text{CCC}} = \frac{{2\sum\limits_{{{\text{i}} = 1}}^{{{\text{n}}_{{{\text{EXT}}}} }} {\left( {{\text{Y}}_{i} - \overline{{\text{Y}}} } \right)\left( {{\text{Y}}_{{{\text{i}}^{\prime } }} - \overline{{\text{Y}}}_{{{\text{i}}^{\prime } }} } \right)} }}{{\sum\limits_{{{\text{i}} = 1}}^{{{\text{n}}_{{{\text{EXT}}}} }} {\left( {{\text{Y}}_{{\text{i}}} - \overline{{\text{Y}}} } \right)^{2} } + \sum\limits_{{{\text{i}} = 1}}^{{{\text{n}}_{{{\text{EXT}}}} }} {\left( {{\text{Y}}_{{{\text{i}}^{\prime } }} - \overline{{\text{Y}}}_{{{\text{i}}^{\prime } }} } \right)^{2} + {\text{n}}_{{{\text{EXT}}}} \left( {{\text{Y}}_{{{\text{i}}^{\prime } }} - \overline{{\text{Y}}}_{{{\text{i}}^{\prime } }} } \right)^{2} } }}$$
(2)

Yi is the experimental value, \(\mathop {\text{Y}}\limits^{ - }\) is the average of experimental values, \({\text{Y}}_{{{\text{i}}^{\prime } }}\) is the predicted value of activity and \({\overline{\text{Y}}}_{{\text{i}}}\) is the average of the predicted value of the activity. EXT is external prediction set or test set. CCC > 0.8 accounts as a valid model.

Statistical significant between deviation of experimental activity and calculated data

In 2014, our research group challenged the regression through origin and proposed the calculation of model errors for training and test sets and comparison of them as a reliable method to external validation of QSAR models [5].

Criteria based on training set range and the deviation between experimental and calculated data

Roy and coworkers [21] similar to our method (method 4) proposed new principles based on training set range and absolute average error (AAE) i.e., the difference between experimental and the predicted values of test set, and corresponding standard deviation (SD) for training and test sets as follows:

Good prediction: AAE ≤ 0.1 × training set range and AAE + 3 × SD ≤ 0.2 × traning set range

Bad prediction: AAE > 0.15 × training set range or AAE + 3 × SD > 0.25 × traning set range

A good model should be passed both above criteria. However, the predictions which fall into one of the conditions could be considered as of moderately acceptable model.

Results and discussion

Table 1 listed the numerical values of statistical parameters that need to calculate the mentioned criteria for external validation of 44 developed QSAR models.

The main factor in the validation of QSAR models from a statistical point is different equations even to calculate simple parameters such as r2 and r02 [22, 50]. These different equations will affect the comparison. The r2 in this work was calculated by SPSS software based correlation between experimental and calculated values. However, in the studied criteria in this work, there is a controversy in the calculation of r02. The following equations were applied to the calculation of r02 and in method 1, 2 and Excel software [21]

$${\text{r}}_{0}^{2} = 1 - \frac{{\sum {\left( {Y_{i} - \left( {Y_{fit} = KY_{{i^{\prime}}} } \right)} \right)^{2} } }}{{\sum {\left( {Y_{i} - \overline{Y}_{i} } \right)^{2} } }}$$
(3)
$${\text{r}}_{{0}}^{{^{\prime}2}} = 1 - \frac{{\sum {\left( {{\text{Y}}_{{\text{i}}} - \left( {{\text{Y}}_{{{\text{fit}}}} = {\text{K}}^{\prime } \;{\text{Y}}_{{{\text{i}}^{\prime } }} } \right)} \right)^{2} } }}{{\sum {\left( {{\text{Y}}_{{\text{i}}} - \overline{{\text{Y}}}_{{\text{i}}} } \right)^{2} } }}$$
(4)

Instead, the alternative formula was proposed instead of the Eqs. 3 and 4 because of statistical defects to the calculation of r2 of RTO [5, 22] which recommended by statistical books in the literature [51, 52]:

$${\text{r}}_{{0}}^{{2}} = {\text{r}}_{{0}}^{{^{\prime}2}} { = }\frac{{\sum {\mathop {{\text{Y}}_{{{\text{fit}}}}^{{2}} }\limits^{{}} } }}{{\sum {\mathop {{\text{Y}}_{{\text{i}}}^{{2}} }\limits^{{}} } }}$$
(5)

In addition to statistical defects in Eq. (3) and (4) for the calculation of r02 and r0′2, QSAR researchers, may apply Eq. (5) which proposed as an appropriate equation for r02 and officinal statistical package such as SPSS, and do not give reasonable results. Calculation of \({\text{r}}_{{\text{m}}}^{{2}}\) based on computed \(r_{0}^{2}\) by Eq. (5) (or SPSS software) is not possible because of r2 is commonly less than \(r_{0}^{2}\) and therefore \({\text{r}}^{{2}} {\text{ - r}}_{{0}}^{{2}} { < 0}\). This is the most defect of methods 1 and 2 for the external validation of QSAR models.

Seven of the studied models have r2 < 0.6 (Table 2). Therefore, they could not account as valid models. r2 is simple parameter to evaluate the correlation between experimental and predicted values in QSAR studies and for estimating the correlation between concentration and response in analytical chemistry. It is a primary criterion, and a QSAR model or a developed analytical method with a high r2 value does not necessarily have an acceptable validity [53, 54]. In addition, the squared factors e.g. r2, negatively affects the possibility to distinguish errors in one or in another direction: overpredicted or underpredicted values; these two kinds of errors have a huge different in toxicity and regulatory evaluation.

Table 2 Values of the proposed criteria (method 1–5) for external validation of QSAR models

The numerical values of other proposed criteria in method 1 show that all models have K or K' between 0.85 and 1.15. The third rule (\(\frac{{\text{r}}^{2}-{\text{r}}_{0}^{2}}{{\text{r}}^{2}}\text{<0.1 or }\frac{{\text{r}}^{2}-{\text{r}}_{0}^{^{\prime}2}}{{\text{r}}^{2}}\text{<0.1}\)) is only non-acceptable for 7 models which 3 of them have r2 < 0.6. Therefore, based on the suggested principles in method 1, 11 models are not valid.

Method 2 proposed based on RTO and r02 calculated by Eq. (3). Twenty-six models have \({\text{r}}_{{\text{m}}}^{{2}}\) > 0.5, and the results are similar to method 1 (both of them are based on RTO). The valid models based on method 1 with r2 > 0.75 have \({\text{r}}_{{\text{m}}}^{{2}}\) > 0.5 except model 27 with r02 = 0.101 (close to threshold, 0.1).

The third studied method was proposed by Gramatica and named CCC [4]. Twenty-nine models have CCC > 0.8. All of them are valid models based on method 1. The results of methods 2 and 3 are very similar. Two models (20 and 27) only have CCC > 0.8 while the defined values near to threshold i.e., 0.4 < \({\text{r}}_{{\text{m}}}^{{2}}\) < 0.5. Method 3 is comparable to developed methods based on RTO. However, it has not statistical defects and non-identical datum for r02 based on proposed equations (Eq. (3) and (4) or Eq. (5)) or software (e.g. Excel or SPSS).

Method 4 is based on the calculation of model errors for training and test sets and compares them as a possible reliable method to external validation for models with r2 > 0.6 for test set. The aim of developing a QSAR model is the prediction and elucidation of mechanisms of drug action. It is obvious that the prediction capability of training and test sets should be identical. Without considering the training set, it possible statistical parameters for external validation of test set could be acceptable but a significant difference (independent t-test) between prediction power of training and test set might be a weakness for the model. Twenty-six models have r2 > 0.6 and no significant difference between absolute error (AE) of training and test sets (p > 0.05). Twenty-three models of them have been selected by CCC as a valid model (CCC > 0.8 and p > 0.05). Model 16 has a CCC = 0.55, and AAE of training and test sets are 0.412 ± 0.352 and 0.645 ± 0.489 (p = 0.16), respectively. High values for SD because of outlier data, is the possible reason for non-significant difference between AEs and it could not account validity of the developed model. On the other hand, models 5, 24 and 25 have CCC > 0.9 and p < 0.01. The relative frequencies of AEs for models 5, 24 and 25 sorted in three subgroups, < 0.1, 0.1–0.2 and > 0.2 and illustrated in Figure 1. In these models, AAE values are low; however, there is 50–250% difference between AAE of training and test sets. On the other hand, in model 5, 48% of the training set and 10% of test sets have AE less than 0.1 while 15% of the training set and 60% of test set have AE more than 0.2. Similar patterns are observed in models 24 and 25. In addition, for those models, residual plots have been illustrated in Figure 2. These plots confirm that there is a significant difference between the prediction capability of developed models for training and test sets and it could not be acceptable for a QSAR model to approve prediction capability.

Fig. 1
figure 1

Relative frequency of individual deviation (absolute error) for model 5 (a), model 24 (b) and model 25 (c)

Fig. 2
figure 2

Residual plots for model 5 (a), model 24 (b) and model 25 (c)

The last method (method 5) proposed by Roy’s research group based on the training set range and mean and standard deviation of test set data [21]. The models could be classified as GOOD, MODERATELY GOOD and BAD according to their proposed parameters. Most of the models were categorized as BAD (45%) and GOOD (39%) and a few models were MODERATELY GOOD models (Table 2). The first point that should be considered is r2 > 0.6 as a necessary criterion. All models which have r2 < 0.6 classified as BAD model. Moreover, a good correlation is observed between CCC and GOOD model based on method 5. However, model 11 is a GOOD model while CCC = 0.75 and there is a significant difference between AE of training and test set (AAE of training and test sets are 0.05 and 0.13, respectively and p = 0.01). In comparison with method 4, models 5, 24 and 25 (GOOD models) have a vast difference between AAE of training and test set (Figure 1), although the proposed principles in method 5 could not detect it. A model with a statistically significant difference between the AE of training and test sets might not confirm developing a valid model.

Furthermore, model 3 is a BAD model while CCC = 0.84 and p-value for the difference between AE of training and test is 0.18. AAE of the training set is 0.167 ± 0.171 and 0.266 ± 0.244 (AE ± SD), respectively. High values for SD of training and test sets indicate that there are outlier data which could be considered using statistical parameters e.g. SD of mean errors, in the external validation of QSAR models.

Typographic errors and un-uniformity of applied data set for QSAR modeling or mistake in the determination of the biological activity of studied compounds are a common reason for outlier data, which can decrease the prediction capability of a model. Docking study of outlier cases and comparison with other compounds can help researchers to detect outlier data in developing a QSAR model [55].

These results confirm the results of previous studies which more than a single criterion is recommended to assess the real external predictivity of QSAR models [56]. Moreover, other recommended guidelines in developing QSAR models such as cross validation, appropriate splitting training and test sets variable allocation and correlation coefficients adjusted by degrees of freedom, are other important issues in QSAR studies which should be considered by researchers [10, 57,58,59]. In addition, cross (internal) validation analysis e.g., leave many out and leave one out are recommended in QSAR studies especially when the sample size is small [9, 10], and some reports showed its superiority in external validation [60]. Therefore, both internal and external validation analysis with considering various criteria are necessary to check the validity of a QSAR model.

Conclusion

The aim of developing a QSAR model is an acceptable prediction of activity of a compound before synthesis and biological evaluation. Therefore, external validation is necessary. All of the developed methods for external validation of a QSAR model are useful and a good correlation was observed between the studied methods for the selected models. However, some differences were detected between established methods. Methods 1 and 2 are valuable but they are some questionable points in the applied equation for \(r_{0}^{2}\) calculation. CCC is a valuable parameter, though in some cases, it cannot detect outlier data. Similar to methods 1 and 2, training data set are not included in CCC. Method 4 and 5 established based on training and test sets. They detected most invalid models, but method 5 considered some model as a GOOD model while the difference between AE of training and test sets are substantial (p < 0.05). On the other way, high SD value in both of training and test sets may pass proposed criterion of method 4 while accounted as a invalid model because of outlier data in training and test sets. Finally, evaluation of a model with either established method is useful, but they did not necessarily mean validity/invalidity of a QSAR model. The results of this study show the importance of calculation error of training and test sets and detection of outliers for checking the validity of a model.