Introduction

The practical usefulness of every quantitative structure–activity and/or structure–property relationships (QSAR/QSPR) model depends on its realistic predictivity (i.e. the ability to accurately predict certain activity/property for the chemical compounds that have not contributed to the model’s development). Data splitting can be considered as a validation technique, based on the division of the input data into a training set and a test set. The model is developed and internally validated employing the training set, while its predictive power is assessed on the basis of differences between the predicted and experimental values (residuals) determined for a sufficient number of representative test set compounds. The latter procedure is called ‘external validation’. Only properly trained and validated models are able to provide reliable predictions for novel compounds [15].

Data splitting performed at the initial stage of the QSAR/QSPR development is particularly significant, as it determines, which data are utilized to train (fit) the model, and which are employed for its external validation. The quest to find the most appropriate methodology for selecting training and test set compounds has led to active investigations in this area. A vast range of recently published contributions focused on the importance of data splitting, for example [69], highlight two major conditions that should be met: (i) representivity of both training and test sets and (ii) sufficient diversity of the training set. However, no model, even when properly validated and yielding “good” values of validation statistics, is able to provide reliable predictions for the entire universe of chemicals. The model usually works much better for the compounds falling inside its applicability domain (typically defined by structural/mechanistic similarity) and the range of activity/property values within the training set [10]. Hence, in the ideal modelling case, chemical structures and the predicted response values for training and test sets should be possibly similar—the representative objects in the training set should be close to the objects in the test set and vice versa [11]. In other words, the training and test sets should scatter over the whole range of the considered space, defined by the descriptors of molecular structure (X) and the response (y) values [12].

In practice, several algorithms are employed to split the input data. The most common ones are based on the endpoint (y) values only (e.g. the repeated test set technique, random selection or activity sampling) [1316], while more sophisticated techniques take into account also the values of molecular descriptors (X) (e.g. maximum dissimilarity method, the Kennard–Stone algorithm, the duplex algorithm, Kohonen’s self-organising maps, D-optimal design or sphere exclusion) [35, 1725]. Endpoint-value-based methods of data splitting generate even distributions of compounds along with the endpoint values in both created sets. However, there is a danger that the application of such algorithms may be associated with significant loss of information, as the resulting training sets do not necessarily represent the entire descriptor space of the input data. Consequently, the test set compounds may be distant from those included in the training set. In contrast, algorithms in which X values contribute to the data splitting are more likely to generate representative sets consisting of compounds evenly distributed within the chemical space ranged by values of both y vector and X matrix. Such an approach should ensure the closeness between test and training set compounds [26]. Although opinions have been expressed in the academic literature, no firm and practical recommendations related to dataset splitting have been available so far in any of the official guidelines for QSAR/QSPR modellers.

In the present research, we focused on the influence of data splitting on the external predictivity of QSAR/QSPR models. By comparing a series of models redeveloped with use of different splitting schemes (y-based, X-based, or y- and X-based) and particular splitting techniques, we have tried to define some general recommendations for QSAR/QSPR practitioners based on the trends observed.

Materials and methods

Six case study models considered to be of high quality were selected from the available literature, and then they were redeveloped and validated on the basis of five alternative training/test sets splitting algorithms, namely: (i) a commonly used y-based algorithm we call ‘Z:1’, in which the compounds are sorted in ascending order, according to the values of the response (y), and then every Zth (e.g. third) object is selected into the test set, while the remaining compounds form the training set; (ii–iv) three variations of the Kennard–Stone algorithm (v) the duplex algorithm. The external validation statistics reported for each model served as a basis for the final comparison of the investigated methodologies.

Case study models selection

Six QSAR/QSPR models, published in peer-reviewed journals, were chosen as the case studies. From a large number of published models, we selected only those of “good quality”, developed and documented according to the Organisation for Economic Co-operation and Development (OECD) principles for the validation of (Q)SAR/(Q)SPR models [26]. Thus, we considered only models that were internally/externally validated, yielding good statistics for goodness-of-fit, robustness and predictivity and having a well-defined applicability domain. It is emphasised that the purpose of this exercise was not to correct or criticise any of the existing models, but only to use them as illustrative examples for expressing the relationships between their predictive performance and methodology of the training/test sets design.

Reproducibility of the original modelling procedures, appropriately documented by the models’ developers, was a crucial criterion for the case study selection. The reproducibility of a model itself is a very general concept. In practice, it depends on two main factors. The first concerns the availability of original data used for the model development and validation. Neither the model nor the training/test set should be proprietary, which means that the values of the dependent (y, response) and all independent (X, molecular descriptors) variables for each compound used in the model development should be disclosed. The second factor concerns the mathematical approach to the modelling itself. For the sake of reproducibility, models based on linear relationships, developed with more transparent techniques, such as (multiple) linear regression ((M)LR), would be more desirable. The MLR methodology is extensively described in numerous papers, e.g. [16, 27, 28]. In general, QSAR/QSPR equations developed with MLR consist of a relatively small number of independent variables and, as such, can be more readily interpreted. Moreover, the MLR modelling technique can be relatively easily repeated by other authors, also with software tools other than those originally used. Hence, in this study, we focused on MLR-derived QSAR/QSPR models.

Availability of the data and transparency of the mathematical algorithm are two necessary conditions, but they are not always sufficient to ensure the reproducibility of a QSAR/QSPR model. Another important factor is the adequate and transparent documentation of the applied modelling procedure (i.e. a step-by-step protocol). In order to find as well-documented models as possible, we screened the QSAR Model Reporting Format (QMRF) Database developed by the European Commission’s Joint Research Centre which is freely accessible online at http://ecb.jrc.ec.europa.eu/qsar/qsar-tools/index.php?c=QRF [29]. The QMRF Database is an inventory gathering information on several published QSAR/QSPR models, harmonised and structured according to the OECD (Q)SAR/(Q)SPR validation principles [26]. Many of the QMRF reports are supplemented with attachments (e.g. xls files or structure data (.sd) files) providing complete information on the training and test sets compounds (their structures, y and X values, etc.). The QMRFs provide transparent descriptions of subsequent steps of the modelling procedures used, as well as information on the statistical performance of the models and their applicability domains. For the purpose of the present investigation we have screened 56 documents published in the QMRF Database according to the modelling algorithm (MLR).

Our intention was to compare the impact of different data splitting algorithms on the predictive abilities of the models in the broadest possible sense. As such, we considered either global or local MLR models of various sizes, covering diverse toxicological/environmental endpoints, as well as predicting physical/chemical properties. The only limitation was the practical possibility of reproducing the original model development.

Initially, six QSAR/QSPR models were selected as the case studies (Table 1). Two QSPR models (model 1 and model 2) originated from our previous work [16]. Four QSARs were selected from the JRC QMRF Database. They were related to toxicokinetic (model 3), toxicological (model 4) and eco-toxicological (model 5 and model 6) endpoints [2935]. These models were well documented and providing all the necessary information on the training/test set compounds (we extracted the endpoint and descriptors values from .sd files attached to the individual QMRF reports) [3235].

Table 1 QSAR/QSPR models selected for the study

Before the final selection of the pre-selected case study models we verified that reproducing the original calculations would lead to the same equation coefficients and validation statistics as provided by the original authors. Each of the tentatively selected models was re-developed and re-evaluated in MATLAB v. R2010b [36], employing the original training and test sets. Data from the training sets were used to determine appropriate statistics describing goodness-of-fit, robustness and internal predictivity, namely: the squared correlation coefficient (R 2); the root-mean-square error of calibration (RMSEC); the leave-one-out cross-validation coefficient (Q CV 2) and the root-mean-square error of the leave-one-out cross-validation (RMSECV). Commonly used mathematical formulations of these statistics can be found elsewhere [16, 27]. The statistics obtained by using the test sets (the external validation coefficient, Q EXT 2 and the root-mean-square error of prediction, RMSEP) were utilized to verify the external predictivity of the models and had a crucial meaning in our comparisons. These parameters were calculated as follows:

$$ {\text{RMSE}}_{\text{P}} = \sqrt {{\frac{{\sum\nolimits_{i = 1}^{{n_{\text{v}} }} {(y_{i}^{\text{obs}} - y_{i}^{\text{pred}} )^{2} } }}{{n_{\text{v}} }}}} $$
(1)
$$ {Q_{\text{EXT}}}^{2} = 1 - {\frac{{\sum\nolimits_{i = 1}^{{n_{\text{v}} }} {(y_{i}^{\text{obs}} - y_{i}^{\text{pred}} )^{2} } }}{{\sum\nolimits_{i = 1}^{{n_{\text{v}} }} {(y_{i}^{\text{obs}} } - y_{\text{obs}}^{\text{mean}} )^{2} }}} $$
(2)

where n v is the number of the test set compounds, y obs i is the experimental response value for ith compound from the test set, y pred i is the model’s response value for ith compound from the test set and y meanobs is the mean value of the endpoint (y).

Re-development of the case study models with various data splitting methods

The essential step of the present investigation was a multiple re-development of each selected case study model with different training sets, designed by employing various data splitting algorithms, namely: (i) Z:1 algorithm; (ii) Kennard–Stone algorithm performed on the matrix of molecular descriptors (X matrix); (iii) Kennard–Stone algorithm performed on a matrix, in which the molecular descriptors (X) were augmented by an additional column including the response values (y); (iv) Kennard–Stone algorithm performed on a similar matrix to that in (iii), but this time the additional y vector (column) has been replicated k times to enhance the influence of the response on the splitting results; (v) Duplex algorithm performed on the descriptor matrix (X) only.

Z:1 is the most commonly applied algorithm in QSAR/QSPR studies, mainly due to its simplicity. It does not utilize the values of molecular descriptors—the splitting procedure involves the y (response) values only. As mentioned above, test compounds are selected in a systematic way based on their sorted response values. Such an approach produces two sets that accurately represent the data [16, 30, 31].

In contrast, the Kennard–Stone algorithm takes into account only the values of the molecular descriptors (X) [20, 28]. Initially, the most representative, ‘central’ compound is selected into the training set. The algorithm searches for a single compound having the values of all descriptors closest to their mean values calculated for the whole group of compounds. Then, a defined number (sufficiently large and determined by the developer) of the most dissimilar objects (chemicals) is also introduced into the training set. The similarity measure, in this case, is the squared Euclidean distance between particular objects in the multidimensional space in which each descriptor defines a single dimension. Thus, the most dissimilar compounds are the most distant ones (i.e. characterized by the maximal values of the squared Euclidean distance). The remaining compounds are incorporated into the test set.

Usually, the Kennard–Stone algorithm is performed only on the X matrix. However, in our contribution we tested also two variations of this methodology. In the first one (Xy), we added the response vector (y) as an additional column to the matrix of k descriptors (X). In the second modification (X k y), we added the response vector k times (k was equal to the number of descriptors in the X matrix), in order to enhance the impact of the response values on the data splitting results.

These ways of data splitting according to the Kennard–Stone algorithm and its modifications should lead to the formation of two representative sets including all types of chemical structures. The two modifications of the Kennard–Stone method, in principle, should ensure that the training set compounds are distributed evenly within not only in the space defined by the descriptors (X), but also by the response values (y). As such, the condition of closeness between test and training set compounds in both aspects (X and y) should be satisfied [26].

The duplex algorithm utilizes X values only. Its sequential methodology is based on maximizing the Euclidean distances between the newly selected compounds and the compounds already selected. In the first step, the two most distant (i.e. most dissimilar) objects are picked up and incorporated into the training set. From the remaining compounds, the two most dissimilar ones are included in the test set. Then, from the remaining objects, the one which is furthest away from those previously selected for the training set is labelled as a test set compound. Analogously, subsequent training set compound is selected. The two procedures are repeated alternately, until a sufficient number (indicated by the developer) of training set compounds is chosen. Such a procedure leads to the formation of two balanced sets, consisting of objects uniformly distributed within the whole descriptors (X) space [26, 37].

All calculations within this step of the study were performed in MATLAB v. R2010b [36] with external codes (m-files) for Kennard–Stone-based and duplex-based data splitting [12]. Individual models were developed by means of the MLR method [16]. Since the impact of training/test set size on the predictivity of models was not investigated here, when re-splitting and re-developing the models, we kept the ratio of training-to-test compounds proposed by authors of the original contributions. Each newly designed training set was used for the QSAR/QSPR model development, while each test set, for its external validation. The complete set of statistical parameters was calculated for each model (i.e. R 2, Q CV 2, RMSEC, RMSECV). However, for the purposes of this study, we focused mainly on the ones related to the external predictivity, namely: RMSEP and Q EXT 2 (Eqs. 1, 2).

Results and discussion

A positive outcome of the “reproducibility check” confirmed the consistency between the original and repeated calculations. An overview of the selected case study models, as well as the original (if available) and calculated (by the present authors) values of external validation statistics are provided in Table 2.

Table 2 Validation statistics of the models selected for the study

Each of the six selected QSARs/QSPRs was originally developed and validated with data sets split with the classical Z:1 algorithm. Since we re-developed the original models with four additional splitting algorithms, this yielded a set of 30 models in total to be compared. As mentioned above, we applied additional splitting algorithms while keeping the original ratio of training-to-test set compounds. It should be highlighted that models 1, 2 and 6 had relatively large test sets (Table 2), whereas test sets of models 35 were very small. This allowed us to observe additional, data splitting-related trends.

Interestingly, the external validation statistics of every original model could be improved by applying alternative data splitting methodologies, which are based not only on the response (y) values, but also on the molecular descriptors (X) (Table 3). We observed that such algorithms contribute to the formation of more balanced and homogeneous training and test sets. As such, the training/test set compounds were situated close to each other within the considered chemical space, and the condition for both sets to be representative was fulfilled. In the majority of cases, the best results (lowest RMSEP values) were observed for algorithms that regard the information of both y and X values, namely the Kennard–Stone Xy and the Kennard–Stone X k y. However, when considering Q EXT 2 as the measure of external predictivity, the best results (highest values) were obtained for those methods that take into account only information on the structural variance of the compounds (X) (duplex-based or Kennard–Stone-X-based data splitting). Indeed, the information on X seems to have more influence on the appropriate splitting than the response values (y).

Table 3 The impact of investigated data splitting algorithms on the statistical external validation parameters for selected case study models

Some additional observations can also be made. The external validation statistics of the models, when analyzed individually, exhibit different sensitivities to the on the replacement of the y-based data splitting methodology with the alternative ones (those taking into consideration the values of X). Differences in sensitivity can be observed, when analyzing the values of ∆RMSEP and ∆Q EXT 2 (Table 3) that quantitatively describe the improvement of external predictivity of the models. Moreover, in general, the sensitivity of QEXT 2 is much more dependent on size of the test set than the sensitivity of RMSEP. This becomes evident, when comparing variances for models 1, 2 and 6 (having large test sets) with models 35 (having small test sets) (Table 4).

Table 4 Variances (s 2) of the external validation parameters in comparison with size of the test set

The observations above can be explained by the following reasoning. The predictivity of particular QSAR/QSPR model is strongly driven by the distribution of the training set compounds in the chemical space defined by the X values on one hand, and by the y values on the other one. Ideally, the training set compounds should be evenly scattered over the whole space. Under such a condition, the model is well trained and the predictions of the response (y) are satisfactory. However, the ability to correctly predict the response for novel compounds (not used for training the model) must be verified with use of the external test set. To be representative, the test set should also evenly cover the whole chemical space. In practice, this condition can be fulfilled only for sufficiently large test sets. For small test sets there is a very high probability that all their constituents will be unevenly distributed within the considered chemical space. In the extreme situation, the test set compounds form a small cluster situated in only one region of the chemical space covered by the training set. Such a test set is neither representative nor well balanced and contributes to the superfluous results of the external validation.

Model 5 can serve as an illustrative example of such a situation. Unexpected values of its statistical validation parameters (Q EXT 2 < 0! for the Kennard–Stone-based data splitting) reflect the unusual localisation of test set compounds in the corresponding chemical space. The test set covers only the lowest values of y, thus the whole space of the response/descriptors is not appropriately represented. The statistical external validation, when performed only on the basis of Q EXT 2, reveals that such a model is completely externally unpredictive. Clearly, this is not entirely true. When the RMSEP value is considered, the lowest values of this statistic are observed after applying the Kennard–Stone-based splitting techniques. These contradictory results can be explained, when looking at the mathematical formulas of Q EXT 2 and RMSEP (Eqs. 1 and 2). Both statistics are calculated from the sum of squared residual values (i.e. differences between the observed and predicted values of y). RMSEP is simply the root of the average squared residual in the test set. The calculation of the external validation coefficient, however, is more complicated. The value of Q EXT 2 is the difference between 1 and the ratio of the sum of squared residuals (PRESS) to the sum of squared deviations of particular observed values of y from the average y (TSS). In consequence, the influence of one or more unexpected predictions (unusually high residuals) on Q EXT 2 is stronger than on RMSEP, since in the second case the (squared) residuals are averaged and the root value is calculated at the end. In the case of Q EXT 2, we are operating on the sum of squared values and there is no averaging. Thus, when one or two residuals are extremely high, and the test set is small, it is possible that the ratio PRESS/TSS is higher than 1. In such a case, the calculated external validation coefficient would be negative. This also explains why we have observed a strong influence of the test set size on the Q EXT 2 values.

This case study reflects that, particularly for the models evaluated on the basis of very small test sets, the conclusions on the final external predictivity should not be drawn on the basis of one statistical parameter alone, but should be related also to the other relevant measures. The small-size test set models are much more sensitive to the choice of data splitting methodology which means that the results obtained might be less robust and meaningful than those for the large-size test set ones. Consequently the decision concerning the data splitting algorithm must be made with particular care.

When discussing the most appropriate choice of splitting algorithm, a significant comment concerning the reliability of our results related to X-based techniques must be added. Actually, the truly external validation could be performed only for the models 12, since both were developed on the basis of the molecular descriptors selected a priori, on the mechanistic basis only. In case of the remaining models, the reasonable amount of independent variables were selected by the authors of the original contributions from broad “pools” of more than 1000 tentatively calculated descriptors. The selection of descriptors was performed on a statistical basis, for instance by using a genetic algorithm. This leads to a lack of reproducibility in the modelling procedure. In the majority of cases the complete information on the descriptors forming the large “pool” was not available in the original publication. Available data sets contained only the values of the final variables selected on a statistical basis and incorporated into the model equation. Therefore, in the case of models 36 we were only able to perform the alternative data splitting and calculate the validation statistics on the basis of the pre-selected independent (X) variables. As a consequence, since the compounds labelled as ‘test’ in our study had been previously involved in the variable selection, the validation procedures with such test sets were not strictly ‘external’. In a real situation, when X variables need to be selected from a large pool of calculated descriptors, test compounds should never be involved in the variable selection process. In contrast, it is highly probable that, when the splitting with X-based algorithms (i.e. Kennard–Stone or duplex) is performed on the whole pool of 1000 or more descriptors (before the final selection of variables), neither the training nor test set would be sufficiently representative (not evenly distributed in the space of the finally selected variables). This is a serious limitation of such splitting algorithms.

Our results are highly consistent with previous contributions to the area and supplement some of the findings. Leonard and Roy [9] demonstrated that the application of the K-means clustering technique (utilizing the descriptor X values) for input data splitting leads to much better external validation statistics of the resulting models than the random splitting and/or splitting methods that are based only on the response (y) values (i.e. ‘activity ranges algorithm’). Moreover, they highlighted that the splitting procedure should take into account the proximity of training and test compounds; both training and test sets should consist of the molecules representing the whole multidimensional descriptor space.

The authors [9] noticed high values of the external validation coefficient (Q EXT 2) irrespective of the size of the data set. It is worth noting, however, that three models studied by Leonard and Roy [9] were externally validated with test sets of moderate or even large size (n t = 9, 14 and 22), in comparison with the relatively small test sets of models 3, 4 and 5 investigated in our study. Thus, Leonard and Roy did not have a chance to observe the strong influence of test set size on the Q EXT 2 values, as elaborated here. Their results are in agreement with our suggestion that the influence becomes important when the test set is very small (contains fewer than 10 test compounds).

Leonard and Roy [9] also state that “The size of the test set is an important factor in identifying the predictive potential of the data set, so one may intend to explore the optimum size of the test set in relation to the size of the training set”. The same research group investigated also the impact of the training set size on the predictive ability of QSAR models [38]. They concluded that the optimum size of the training set depends on many factors: the particular data set, number and types of descriptors, as well as statistical analysis being used; no general rule can be formulated. In the context of our results, this leads to the recommendation that the experimental input data set should be large enough to ensure an appropriate number of training compounds (dependent on the factors mentioned above), and at least 10 test compounds. When the number of test compounds is much less than 10, an external validation is still possible, but one should expect the Q EXT 2 to be strongly dependent on the splitting technique.

Gramatica [6] performed a broad evaluation of other statistical approaches for the validation of QSAR/QSPR models. As the part of this project, she compared three data splitting algorithms, namely: (i) a D-optimal experimental design, (ii) the Kohonen Artificial Neural Network (K-ANN) and (iii) a random splitting. The conclusions were similar to those from our study: models with small test sets were found to be more sensitive to the data splitting methodology than models validated with the test sets containing many compounds. This confirms the importance of selecting the most appropriate splitting technique, whenever a QSAR/QSPR model is developed and validated with a small set of data, for instance in case of local models for particular congeneric groups of Persistent Organic Pollutants [39, 40].

Conclusions

In the present study we illustrated the impact of data splitting methodologies on the external predictivity of QSAR/QSPR models. We demonstrated that, although the results varied slightly for the selected models, it was possible to make some generalizations and identify several common trends.

The results of external validation are strongly dependent on the composition of the training and test sets. The application of splitting techniques that utilize the values of molecular descriptors alone (X) or in combination with the model response (y) always lead to the development of the models yielding better external predictivity in comparison with the models designed with methodologies based on the y values only.

In case of the models trained and validated with a very small number of compounds, the splitting methodology might influence the external validation results. We recommend that, since Q EXT 2 seems to be more sensitive to the splitting technique than RMSEP when the test set is small (contains between 5 and 10 compounds), both statistics should be taken into account when evaluating the external predictivity of such models.

Whenever the model input variables are selected by a statistical approach (e.g. with a genetic algorithm) from the large pool of calculated descriptors, y-based splitting techniques should be preferred to ensure the possibility of performing external validation and the best predictive ability of the final QSAR/QSPR.

In our contribution we selected the most commonly used methodologies of data splitting, other than those previously evaluated by Leonard and Roy [9] and Gramatica [6]. However, taking into account the strong research needs for developing practical guidance of QSAR/QSPR, further investigations should also include more sophisticated resampling methods, i.e. bootstrapping [41, 42].