Background

Natural toxins are toxic chemicals produced by living organisms including plants, algae, fungi and cyanobacteria. They encompass several different chemical classes and constitute a potential risk to human and environmental health [1]. Natural toxins are environmental contaminants of emerging concern, particularly in the context of climate change, which could drive changes in species distribution patterns, and metabolic paths that induce increased toxin production [2].

Natural toxins produced by plants (phytotoxins) pose a hazard in Europe, but the quantity and severity is unknown [1]. An example of a problematic plant-produced natural toxin is ptaquiloside, which is synthesized by Bracken fern (P. aquilinum) and is known to leach into soil and reach surface and groundwater that is sourced for drinking [3]. Another example is aristolochic acids, which can leach from European birthwort (A. clematitis), contaminate groundwater, and are known to cause kidney damage [4]. Recently, there have been some studies that reported the presence and abundance of other groups of phytotoxins, such as quinolizidine alkaloids [5], pyrrolizidine alkaloids [6] and saponins [7], but field data of this kind are still scarce. Assessing phytotoxins based on their chemical properties to identify other potential persistent, mobile and toxic (PMT) chemicals that could impact source water for drinking is hence an important research priority [8].

Recently, Günthardt and co-workers published the Toxic Plants-PhytoToxins (TPPT) database [9], which includes 1586 toxins produced by plants in Switzerland. They scored the phytotoxins in the TPPT database for their potential to contaminate water based on their persistence and mobility using Log KOW (i.e., the log of the octanol–water partition coefficient) and biodegradation half-lives predicted with quantitative structure–property relationship (QSPR) models in EPI Suite [10].

The QSPR models in EPI Suite have been widely applied for predicting physico-chemical properties and half-lives of chemicals for screening-level hazard assessment (for example, [11]). QSPR models, which are examples of QSAR (Quantitative Structure Activity Relationship) models use property measurements for a “training set” of compounds to build a model, and measurements for a “validation set” of compounds to quantify model performance. KOWWIN was developed to estimate measured Log KOW values [12], and BIOWIN to estimate half-lives from expert opinions [13]. Both models use a fragment approach, in which structural fragments and other molecular descriptors are the building blocks of regression models. The EPI Suite models were developed using property data measured for training sets of mostly anthropogenic chemicals, such as persistent organic pollutants and other organochlorine pesticides, modern pesticides, personal care products and industrial chemicals. Relatively few compounds of natural origin were included, and only 2% and 0.6% of the phytotoxins in the TPPT database are found in the training sets for KOWWIN and BIOWIN5, respectively.

It is crucial to define the applicability domain of QSPRs to evaluate the adequacy of the models to predict chemical properties [14]. The OECD principles for the validation of QSAR models highlight the need to have a defined domain of applicability which specifies the scope of the model [15]. Conceptually, a model’s predictions for chemicals within the applicability domain can be viewed as interpolations and those for chemicals outside the applicability domain can be viewed as extrapolations [16]. When QSPR models are extrapolated to predict properties for chemicals outside the applicability domain, the predictions are subject to unknown and unquantifiable uncertainties. However, there is no single accepted method to define the applicability domain of a QSPR. One way of evaluating the applicability domain of a QSPR model is by determining the locations of chemicals in the training set in a chemical space defined by the molecular descriptors used in the model, and defining an applicability domain boundary in that descriptor space [17].

In this work, we define model-specific applicability domain thresholds for KOWWIN and BIOWIN5 and assess whether the 1586 plant toxins in the TPPT database are within the applicability domains of these models using both our model-specific thresholds and generic threshold values that have been proposed in the literature. Our goal is to identify phytotoxins in the database that are outside the applicability domain of the QSPR models, and thus have unquantifiable uncertainties in their modeled persistence and mobility profile. These substances should be assigned a high priority for experimental determination of Log KOW and biodegradability to reduce uncertainties in their environmental fate profile, and to generate data that can be added to QSPR training sets to improve screening assessments of phytotoxins.

Materials and methods

The TPPT database details toxins produced by a variety of different plant species found in Switzerland and Central Europe [9]. The 1586 chemicals in the database are each assigned a “toxin number” (T1–T1586), and are characterized according to occurrence frequency in Switzerland, estimated toxicity and environmentally relevant properties, including half-lives and Log KOW values. For most toxins, Günthardt et al. included physico-chemical properties and half-lives in the TPPT database that were estimated from structure using EPI Suite v4.11 [10].

EPI Suite is a compilation of QSPR models that predict environmentally relevant properties of organic chemicals, such as Log KOW by KOWWIN and aerobic biodegradability by BIOWIN5. The QSPR models are based on molecular descriptors that represent the structure of molecules [18]. In EPI Suite, structural fragment counts and correction factors are used as molecular descriptors [19]. Each chemical’s structure is described as a numerical vector containing the number of times each descriptor used in the QSPR model is present. The models are formulated by multiple linear regression of the number of occurrences of each descriptor for the chemicals in the training set against the property of interest. KOWWIN has 2447 chemicals in the training set and 10,873 in the validation set, and BIOWIN5 has 589 chemicals in the training set and 295 in the validation set.

The entire list of molecular descriptors and coefficients for the QSPRs in EPI Suite is not directly accessible in the software, but the list of training and validation sets are available at http://esc.syrres.com/interkow/EpiSuiteData.htm We obtained the KOWWIN and BIOWIN5 training and validation sets by feeding the training and validation set compound structures into the models and parsing batch output files with Python v. 3.7.3 [20] and the pandas package [21]. This way, we re-assembled the dataset that was used to build the model, and we then constructed matrices that contained all information about the molecular descriptors of training and validation sets of KOWWIN and BIOWIN5. We used the same method to extract the descriptor counts for the phytotoxins and produce a matrix with the same format. The end results were three matrices of molecular descriptors for the training, validation and TPPT sets of chemicals, and the property data for the training and validation sets of chemicals.

To define the applicability domain of KOWWIN and BIOWIN5, we measured distance of chemicals to the centroid of the training set in multi-dimensional chemical descriptor space. To reduce the dimensionality of the chemical descriptor space, principal component analysis (PCA) was performed. In PCA, each component encompasses variability associated within several descriptors, reducing the number of dimensions. We therefore calculated the position of compounds in each of n PCA-reduced dimensions of molecular descriptor space. Then the unitless Euclidean distance (dE,x) that each compound (x) lies from the centroid of the space occupied by compounds in the training set is calculated as:

$${d}_{E,x}=\sqrt{\sum_{i=1}^{n}({p}_{i,x}-{\overline{p}}_{i}{)}^{2},}$$

where n is the number of dimensions in the PCA-reduced space of the QSPR model, pi,x is the location of compound x in PCA-reduced dimension i, and \({\overline{p}}_{i}\) is the average location of training set compounds in PCA-reduced dimension i in the training set. Greater dE values indicate higher structural dis-similarity to the compounds in the training set and hence dE thresholds have been used to classify chemicals as being outside of domain [22]. We calculated dE for every chemical in the TPPT set, the validation set and for the training set itself with R v. 4.0.2. [23], packages Rmarkdown [24] and papaja [25] as well as multiple helper functions [26,27,28,29,30].

Thresholds

We consider four threshold values of dE from literature [14]; to distinguish between toxins inside and outside domain; (1) the maximum distance of a member of the training set (dEmax), (2 and 3) three times (3\(\overline{d}\)E) and twice (2\(\overline{d}\)E), the average Euclidean distance of members of the training set (\(\overline{d}\)E), and 4) the 95th percentile distance (dE95) of the members of the training set.

Furthermore, we estimated model-specific dE thresholds by exploring the performance of the models at calculating property values for members of the validation set as a function of dE. To this end, we plotted the difference in root mean squared error (∆RMSE) of chemicals outside a continuum of boundaries versus chemicals inside the boundaries by sorting the chemicals in the training set by increasing dE, then iteratively calculating ∆RMSE for boundaries where (n = 1,2,3,…x) of the training set chemicals with lowest dE were assumed to be “in domain”. We then plotted ∆RMSE of the validation set as a function of the percentile rank of the training set chemical in our ordered list that defined the dE threshold. For reference, and to confirm that dE is a meaningful metric for defining domain of applicability, we performed the same analysis 10 times with the training set ordered randomly instead of by increasing dE. We then identified applicability domain thresholds unique to KOWWIN and BIOWIN5 at local maxima in the plot of ∆RMSE calculated for the training set ordered by increasing dE.

Results

Euclidean distances for KOWWIN and BIOWIN5 for training, validation and phytotoxin sets

For KOWWIN, Euclidean distances of the compounds in the three datasets from the centroid of the training set (Fig. 1, left panel) range from 1.34 to 78.41 for the training set, from 1.31 to 98.03 for the validation set, and from 1.47 to 131.89 for the phytotoxin set. In BIOWIN5, they range from 0.75 to 31.51 for the training set, from 0.84 to 46.85 for the validation set, and for phytotoxins from 0.79 to 77.24 (Fig. 1, right panel).

Fig. 1
figure 1

Distribution of dE values in the training, validation and phytotoxin compound sets of KOWWIN and BIOWIN5. Boxes encompass the interquartile range (IQR) and whiskers extend to 1.5 times IQR. Training and validation sets have similar dE values, while the phytotoxin set has consistently greater values of dE in the structural descriptor space of both models

Threshold estimation based on ∆RMSE

Results of our search for model-specific applicability domain boundaries at local maxima in prediction accuracy as a function of dE for KOWWIN and BIOWIN5 are shown in Fig. 2. The applicability domain of QSPR models is a property of the model and its training set. Therefore, for each model the primary horizontal axis of Fig. 2 is percentile of Euclidean distance of members of the training set from its centroid.

Fig. 2
figure 2

Red: difference in root mean squared error of predictions (ΔRMSE) for validation set compounds with Euclidean distances below and above a continuum of thresholds set at percentiles of Euclidean distance of training set compounds from their centroid, in both models (top: KOWWIN, bottom: BIOWIN). Grey: ΔRMSE for 10 trials of randomly ordered lists of the validation set chemicals

For KOWWIN (top panel in Fig. 2), the RMSE of property predictions for validation set compounds outside the applicability domain threshold is consistently higher than for compounds inside the domain threshold for nearly the entire range of possible thresholds (red line). ∆RMSE decreases with a shallow slope from a peak at about the 5th percentile dE of training set compounds until a first inflection point at approximately the 85th percentile dE of the training set compounds, where ∆RMSE begins to increase. A local maximum of ∆RMSE occurs at the 94th percentile dE of the training set compounds, which is marked as our proposed model-specific applicability domain threshold (vertical dotted line in the top panel in Fig. 2). Above the 94th percentile ∆RMSE first decreases, then swings randomly as the number of compounds outside the domain threshold becomes small. For reference, ∆RMSE for the continuum of thresholds in ten randomly ordered lists of the KOWWIN validation set compounds (grey lines in top panel of Fig. 2) are close to zero across a wide range of thresholds, and deviate randomly only above the 94th percentile dE of the training set compounds (grey lines in the top panel in Fig. 2).

For BIOWIN5 (bottom panel in Fig. 2), ∆RMSE for a continuum of thresholds of dE (red line) is close to zero and within the range of variability of thresholds defined in randomly ordered lists of compounds (grey lines) up to about the 80th percentile dE of the training set compounds. It rises to a local maximum at the 93rd percentile, which is marked in the bottom panel of Fig. 2 as our proposed applicability domain threshold (vertical dotted line). Above the 93rd percentile ∆RMSE plateaus, then increases with increasing dE, in a region where the high variability in the 10 randomly ordered lists indicates that ∆RMSE is determined by model performance for a small number of compounds classified as outside of domain (compare red and grey lines in the bottom panel of Fig. 2).

Comparison of chemicals outside and inside domain using different thresholds

Our model-specific applicability domain threshold for KOWWIN is at the 94th percentile dE of the training set, and is very close to the 95th percentile threshold suggested in previous literature [14] (Fig. 3 and Table 1). Our suggested model-specific threshold for KOWWIN is set at local maximum of ∆RMSE, where Log KOW of validation set compounds inside the threshold are predicted 0.17 log units more accurately on average than Log KOW of compounds outside the threshold (Table 1). However, ∆RMSE is also positive and similar in size for three of the other thresholds that have been suggested in literature, 3\(\overline{d}\) E, 2\(\overline{d}\) E and dE95q. The exception is dEmax, which has a ∆RMSE value close to zero that likely reflects random variability in prediction errors for the few validation set compounds that lie above this threshold (see Fig. 3).

Fig. 3
figure 3

KOWWIN: Euclidean distance to the centroid of the training set (dE, plotted on a log scale on the horizontal axis) versus number density of chemicals in the training set (red), validation set (blue), and for the phytotoxins (green). Vertical grey lines correspond to thresholds from literature [14] (dEmax, 3 \(\overline{\mathrm{d}}\) E, 2 \(\overline{\mathrm{d}}\)E and dE95q) and our estimated threshold from Fig. 2 is represented as a black line, dEmodel

Table 1 Comparison of five different thresholds to define the domain of applicability of KOWWIN and BIOWIN5

Our model-specific threshold for BIOWIN5 is at the 93rd percentile dE of the training set, and is again close to the 95th percentile threshold suggested in previous literature (Fig. 4 and Table 1). Because ∆RMSE for the BIOWIN validation set increases with increasing dE above our model-specific threshold (Fig. 2), the ∆RMSE values for our model-specific threshold are not the highest of the group of possible thresholds (Table 1).

Fig. 4
figure 4

BIOWIN: Euclidean distance to the centroid of the training set (dE, plotted on a log scale on the horizontal axis) versus number density of chemicals in the training set (red), validation set (blue), and for the phytotoxins (green). Vertical grey lines correspond to thresholds from literature [14] (dEmax, 3 \(\overline{\mathrm{d}}\) E, 2 \(\overline{\mathrm{d}}\) E and dE95q) and our estimated threshold from Fig. 2 is represented as a black line, dEmodel

Distribution of in and out-of-domain phytotoxins across chemical classes

Among chemical classes, phytotoxins, steroids and saponins are most likely to be out of domain in both KOWWIN and BIOWIN when using our model-specific thresholds (Table 2). All saponins and 95% of steroids were classified as out of domain for BIOWIN5. On the other hand, 97% of the alkaloids and phenylpropanoids in the TPPT database are classified as within the applicability domain of KOWWIN, but higher percentages (46% and 19%) of these chemicals are out-of-domain for BIOWIN5 (Table 2).

Table 2 Distribution of in and out-of-domain phytotoxins across chemical classes using the model-specific thresholds (dEmodel)

Discussion

There is no universal method to establish the applicability domain of a QSPR model [14, 31, 32]. In this work, we define applicability domain using different Euclidean distance thresholds. Structural similarity is measured exclusively using the chemical space defined by structural descriptors that are part of the models. All Euclidean distance threshold approaches to define applicability domain are thus specific to the selected model and its descriptors since a different chemical space defined by different molecular descriptors would be distorted [31].

Our results show that up to 46% of the TPPT database predictions for half-lives reported by Günthardt and co-workers [9] are outside the applicability domain of BIOWIN and thus subject to unquantifiable errors in the predictions. Moreover, multiple stereoisomers and a high number of ionized forms are expected to be present among many members of this group, which are not represented in the BIOWIN5 descriptor set.

The applicability domain of KOWWIN encompasses more of the natural toxins in the TPPT database. Only up to 21% of toxins in the database are outside the applicability domain of KOWWIN and thus subject to unquantifiable errors. Research to improve exposure hazard screening of phytotoxins should prioritize chemicals outside the applicability domain of KOWWIN and BIOWIN for experimental determination of properties, as these are the chemicals where new measurements will constrain unquantified uncertainties, and also could be used to expand the training sets of QSPRs in a way that also expands their domain of applicability.

The TPPT database of phytotoxins is focused on toxins produced by plants that grow in Switzerland, and thus is not fully representative of plant biodiversity in Europe. However, the database from Günthardt et al. is the largest compendium of European plant toxins to date, and a good starting point for screening assessment. Nevertheless, more specific knowledge of biodiversity and occurrence of toxin-producing plants is needed to estimate potential phytotoxin hazard for a specific region.

Conclusions

Our distance approach for determining applicability domain of QSPR models shed light on the adequacy of using KOWWIN and BIOWIN5 models for physico-chemical property estimation in the TPPT database. The model-specific approach produced applicability domain thresholds close to the 95th percentile for both models. The 95th percentile threshold [14] could thus be a preferred choice for defining the applicability domain of other similar QSPR models.

Using the model-specific threshold, almost half of the TPPT phytotoxins were out of domain of BIOWIN, and steroids and saponins were nearly all out of domain. Saponins are the biggest out-of-domain group in both models, suggesting their structure is underrepresented in the training sets. Therefore, measuring property data for these substances should be prioritized. Model predictions for out-of-domain phytotoxins are extrapolations, and thus their predicted hazard properties carry unknown uncertainties and should be interpreted with caution.