Immobilized artificial membrane-chromatographic and computational descriptors in studies of soil-water partition of environmentally relevant compounds

Chromatographic retention factor log kIAM obtained from immobilized artificial membrane (IAM) HPLC with buffered, aqueous mobile phases and calculated molecular descriptors (molecular weight — log MW; molar volume — VM; polar surface area — PSA; total count of nitrogen and oxygen atoms -(N + O); count of freely rotable bonds — FRB; H-bond donor count — HD; H-bond acceptor count — HA; energy of the highest occupied molecular orbital — EHOMO; energy of the lowest unoccupied orbital — ELUMO; dipole moment — DM; polarizability — α) obtained for a group of 175 structurally unrelated compounds were tested in order to generate useful models of solutes’ soil-water partition coefficient normalized to organic carbon log Koc. It was established that log kIAM obtained in the conditions described in this study is not sufficient as a sole predictor of the soil-water partition coefficient. Simple, potentially useful models based on log kIAM and a selection of readily available, calculated descriptors and accounting for over 88% of total variability were generated using multiple linear regression (MLR) and artificial neural networks (ANN). The models proposed in the study were tested on a group of 50 compounds with known experimental log Koc values by plotting the calculated vs. experimental values. There is a good close similarity between the calculated and experimental data for both MLR and ANN models for compounds from different chemical families (R2 ≥ 0.80, n = 50) which proves the models’ reliability. Supplementary Information The online version contains supplementary material available at 10.1007/s11356-022-22514-x.


Introduction
The fate and transport of solutes in the environment depend on their physico-chemical properties such as lipophilicity, volatility, water solubility, and ability to partition between soil and water. The soil-water partition coefficient K oc (normalized to the soil organic carbon content to reduce the differences among soils) is a very important parameter governing the fate of compounds in the soil-water compartment (Doucette 2003). It influences the chemicals' mobility in soil, their environmental persistence, and the processes of their removal in wastewater management facilities (Andrić et al. 2016). A very high value of K oc means a compound is strongly retained by soil and organic matter and does not move throughout the soil. A very low value means a molecule is highly mobile in soil. K oc is a very important input parameter for estimating environmental distribution and environmental risk related to a chemical substance. A highly mobile substance is more likely to move to groundwater or surface water from soil.
Direct determination of K oc is based on studies of partitioning phenomena in biphasic soil-water systems using batch or continuous flow experiments. Such tests are, however, tedious and time-consuming, and the results tend to be inconsistent due to experimental difficulties -incomplete separation of phases and volatilization and biological or chemical degradation of solutes (Hodson and Williams 1988). The need to obtain K oc values very rapidly and for large groups of compounds has led to the development of other methods. Entirely computational approaches are based on molecular descriptors such as water solubility WS (Gawlik et al. 1997), octanol-water partition coefficient P ow (Karickhoff et al. 1979;Hodson and Williams 1988;Pussemier et al. 1990;Müller and Kördel 1996;Gawlik et al. 1997;Doucette 2003), molecular connectivity indices MCI (Gawlik et al. 1997;Tao et al. 2001), topological indices (Meylan et al. 1992;Gawlik et al. 1997), or solvation parameters (Gawlik et al. 1997;Poole and Poole 1999;Nguyen et al. 2005;Poole et al. 2013). Alternatively, it is possible to predict soil-water partition using chromatographic descriptors obtained by liquid chromatography on octadecyl-, cyano-, diol-, ethyl-, trimethylammonium-, or aminopropyl-modified silica or on sorbents containing immobilized humic acid or soil (Helling and Turner 1968;Praveen-Kumar et al. 1987;Vowles and Mantoura 1987;Jamet and Thoisy-Dur 1988;Pussemier et al. 1990;Szabo et al. 1990a, b;Kördel et al. 1993Kördel et al. , 1995aKördel et al. , b, 1997Christianson and Howard 1994;Müller and Kördel 1996;Gawlik et al. 1997Gawlik et al. , 1998Gawlik et al. , 2000Poole and Poole 1999;Szabó et al. 1999;Xu et al. 1999Xu et al. , 2002Ravanel et al. 1999;Guo et al. 2004;Bermúdez-Saldaña et al. 2006;Mrozik et al. 2008;Andrić et al. 2010;Bi et al. 2010;Hidalgo-Rodríguez et al. 2012;Poole et al. 2013;Sobańska 2021). Chromatographic methods of K oc determination are fast and relatively cheap, with the benefit of high reproducibility, especially when commercially available stationary phases are used. It is generally accepted that log K oc is connected with the chromatographic retention factor log k via a Collander-type linear relationship: log K oc = a + b log k (Bermúdez-Saldaña et al. 2006). Linear relationships have also been discovered between log K oc and thin-layer chromatographic descriptors: R M value defined by Bate-Smith and Westall: R M = log (1/R f − 1) (Bate-Smith and Westall 1950) and measured for a single composition of a mobile phase; R M 0 value obtained from a series of chromatographic experiments with mobile phases containing different concentrations φ of a water-miscible solvent (organic modifier) by extrapolation of R M vs. φ plots to zero concentration of the modifier (most frequently by using the linear Soczewiński-Wachmeister equation: R M = R M 0 + S φ (Soczewiński and Wachtmeister 1962)); the slope S taken from the Soczewiński-Wachmeister equation; the real or hypothetical concentration of the organic modifier furnishing R M = 0 (i.e., R f = 0.5): φ 0 = −R M 0 /S; PC 1 (the 1st principal component computed for thin layer chromatographic retention data). Chromatographic retention parameters are usually considered to be sufficiently good as sole predictors of log K oc , but some authors suggest that other descriptors (e.g., polar surface area, PSA) should be incorporated to obtain better fitting models (Sobańska 2021).
Biomimetic liquid chromatography on immobilized artificial membrane (IAM) stationary phases has been used to predict the physicochemical properties and bioavailability of compounds for several years (Sobanska and Brzezinska 2017). IAM chromatographic phases are excellent tools to model drug distribution because of their ability to mimic natural membranes, and the main applications of IAM chromatography are in drug discovery (Valko 2019). The more recent applications of IAM chromatography extend from medicinal to environmental chemistry. Studies of compounds' bioconcentration in aquatic organisms (Tsopelas et al. 2017(Tsopelas et al. , 2018, ecotoxicity of pesticides (Stergiopoulos et al. 2019), and soil-water sorption of herbicides (Hidalgo-Rodríguez et al. 2012) prove that IAM chromatography is a source of valuable information in many areas of research. The objective of the present study was to propose useful models of the soil-water partition coefficients normalized to organic carbon (K oc ) using a large group of structurally diverse organic compounds, based their IAM chromatographic and computational descriptors. The K oc models considered in this study should be applicable during early steps of a drug discovery process, when drugs' physico-chemical and biological properties are often studied in vitro using IAM chromatography. At this stage, high throughput is more important than accuracy and the chromatographic and computational data are collected for other purposes. It was anticipated that IAM chromatographic and computational studies of soil-water partition of compounds could be carried out concurrently with pharmacokinetic studies.

Calculated molecular descriptors
The molecular descriptors for compounds investigated during this study were calculated with HyperChem 8.0, utilizing PM3 semi-empirical method with Polak-Ribiere conjugate gradient algorithm for semi-empirical structure optimization (Polak and Ribiere 1969): total dipole moment -DM (D), molecular weight -M w (g mol −1 ), energy of the highest occupied molecular orbital -E HOMO (eV), and energy of the lowest unoccupied molecular orbital − E LUMO (eV). Other physicochemical parameters (octanol-water partition coefficient -log K ow , polar surface area -PSA (Å 2 ), H-bond donor count -HD, H-bond acceptor count -HA, polarizabilityα (Å 3 ), molar volume -V M (cm 3 ), freely rotable bond count -FRB) were calculated using ACD/Labs 8.0 software, using the SMILES codes of molecules as the input data. (N + O) -total nitrogen and oxygen atom count was calculated from the molecular structures. The Abraham's solvation parameters (A -hydrogen bond acidity; B -hydrogen bond basicity; S -dipolar interactions; E -excess molar refractivity; V -McGowan's molecular volume) were taken from the paper by Sprunger et al. (2007). The calculated molecular descriptors are given in Table 1 (Supplementary Materials).

IAM chromatography
The chromatographic retention factors log k IAM used in this study were compiled by Sprunger at al. (Sprunger et al. 2007). They were obtained on a IAM.PC.DD2 HPLC column using an aqueous mobile phase buffered at pH ≤ 3 for carboxylic acids (to suppress their ionization) and in the pH range 6.5 to 7.5 for other compounds.

MLR and ANN models
Multiple linear regression (MLR) models were generated using Statistica v. 13, stepwise forward regression mode (F to enter set at 1 and F to remove set at 0).
Multilayer perceptron (MLP) networks, with the number of inputs the same as the number of variables, the varying number of hidden units, and one output unit, were generated using Statistica v. 13 (regression mode, Automated Network Search -ANS module, 1000 networks to train, 5 networks to retain). The neuron activation functions were selected from the following group: identity, logistic, hyperbolic tangent, and exponential. BFGS (Broyden-Fletcher-Goldfarb-Shanno algorithm) was used to train the network together with the sum of squares (SOS) error function.

Calculated log K oc coefficients
A soil-water partition coefficient normalized to the organic carbon content (K oc ) is an important parameter influencing the fate of solutes in the soil-water compartment. In this study, because of the lack of experimental data for the whole studied group of compounds, K oc was initially calculated using the four most commonly accepted computational approaches (Andrić et al. 2016) (Table 2, Supplementary  Materials): where MCI is the first-order molecular connectivity index; log K ow is the estimated octanol-water partition coefficient (KOWWIN); and A, B, S, E, and V are the Abraham's solvation parameters (A -hydrogen bond acidity; B -hydrogen bond basicity; S -dipolar interactions; E -excess molar refractivity; V -McGowan's molecular volume). Log K oc (1) and log K oc (2) values were calculated using the EpiSuite software (EpiWeb 4.1, KOCWIN module) (US EPA O 2015).
At this point, attention turned to log K oc (4) as a reference soil-water partition coefficient value readily available for the whole group of studied solutes. The known theoretical log K oc models (1) to (4) have their limitations: e.g., linear models based on log K ow (log K oc (2) ) are not particularly useful for polar or ionizable solutes (Franco and Trapp 2008). However, Eq. (4) appears to be applicable to compounds across a relatively wide range of chemical families, including ionizable or polar molecules (e.g., ionizable or polar compounds within the studied group, whose experimental log K oc values are available (US EPA O 2015) -ethanol, 35; acetic acid, 98; phenylacetic acid, 122 and benzoic acid, 144).

Calculated log K oc coefficients vs. IAM chromatographic descriptors
The values of log K oc calculated according to Eq. (4) were plotted against the IAM retention factor log k IAM . The linear relationship between log K oc (4) and log k IAM (Eq. (5a)) accounts for 81% of total log K oc variability.
(1) log K oc (1) = 0.52 MCI + 0.60 + corrections The results of log K oc modeling using a single chromatographic descriptor (log k IAM ) obtained in this study (Eq. (5a)) are slightly poorer than those reported by Hidalgo-Rodríguez et al. (2012), who found the IAM retention factor superior to other chromatographic measures of soil-water partition. However, Hidalgo-Rodrigues et al. generated their IAM-chromatographic model (Eq. (5b)) using a significantly smaller group of compounds (n = 38) of much more limited structural diversity: The group of compounds considered in the current study contains solutes of very different physicochemical properties, e.g., molecules that are relatively strong organic acids and bases, from different chemical families and of different usage. The studied group contains simple organic chemicals (e.g., solvents), drugs (including, e.g., steroids, non-steroid anti-inflammatory drugs, topical anesthetics, beta-blockers, antibacterial and antiparasitic drugs), and components of fragrances/ essential oils. Some of the studied compounds are likely to be released to the environment (e.g., by washing off the skin or with feces) and seriously affect the wildlife (e.g., by interacting with the reproductive processes of aquatic animals).

Multiple linear regression studies
In the current study, compared to the previous investigations (Hidalgo-Rodríguez et al. 2012), the focus is on the possibility of making the chromatographic method of K oc prediction using the IAM retention factor more universal by incorporating additional, easily calculated molecular descriptors, and by considering both linear and non-linear relationships between the independent variables and the studied property, rather than on selecting the most effective chromatographic system. Equation (5a) accounts for only 81% of total log K oc variability; attempts were therefore made to improve a model by incorporating additional independent variables. However, this was not an easy task, since log k IAM already encodes some properties, responsible for partition and transport phenomena (especially lipophilicity expressed as log K ow -the correlation between log K ow and log k IAM is linear, with R 2 = 0.84). When lipophilicity and redundant variables defining the molecular size were excluded from the analysis, Eq. The group of 175 studied compounds was divided into two subsets: a training set (n = 125) and a test set (n = 50, compounds with known log K oc exp values). Equation (7), generated for the training set, and containing the same variables as Eq. (6), is as follows: The values of log K oc (7) were calculated for the test set according to Eq. (7) and plotted against the reference log K oc (4) values to furnish a linear relationship (R 2 = 0.86, or R 2 = 0.93 when thiourea was removed as an outlier). The model (6) was also tested on the same test set of 50 compounds, whose log K oc exp values were available (US EPA O 2015). The resulting relationship between log K oc (6) and log K oc exp is linear, with R 2 = 0.85. Equation (6), despite encouraging results of validation, was found unsatisfying because it seems unpractical and over-parameterized -it contains 10 independent variables whose contributions (apart from log k IAM and log M w ) are not very significant. At this point, it was decided to simplify Eq. (6) as much as possible; the attempts to achieve it furnished  Eqs. (8), (9), and (10), based on reduced sets of independent variables (Figs. 2, 3 and 4).
The group of 175 studied compounds was divided into two subsets: a training set (n = 125) and a test set (n = 50, compounds with known log K oc exp values). Equations (11), (12), and (13) generated for the training set, and containing the same variables as Eqs. (8) The relationships between the log K oc values calculated for the test set according to Eqs. (11), (12), and (13) and the reference values -log K oc (4) are linear (R 2 = 0.90, 0.89, and 0.90, respectively). The models (8), (9), and (10) were also tested on the same test set of 50 compounds, whose log K oc exp values were available (US EPA O 2015). The resulting relationships between the predicted and experimental log K oc values are linear, with R 2 = 0.83, 0.80, and 0.80, respectively.
Equation (9) contains, apart from log k IAM and log M W , the total N and O atom count (N + O). This descriptor is a relatively good measure of polarity and H-bonding activity of molecules, and it has been used to predict their ADME properties, e.g., the blood-brain barrier permeability (Clark 2003) or skin permeability ). However, it plays a rather minor role in soil-water partition modeling using MLR and Eq. (10) seems to be sufficient for rapid log K oc predictions.

Artificial neural network analysis
At this point, attention turned to artificial neural network (ANN) models. Artificial neural networks are widely used to predict drugs' bioavailability (Carracedo-Reboredo et al. 2021) or properties such as an affinity for phospholipids using IAM chromatography and calculated descriptors (Ciura et al. 2021). The great advantages of neural networks compared to MLR are the possibility of utilizing both linear and non-linear relationships between input data and a predicted parameter and the ability of ANNs to learn these relationships directly from the data being modeled.
In this study, the ANN models were built for the same group of compounds (n = 175) that were used in the MLR analysis. 125 substances without the known log K oc exp data were used as a training set; the remaining compounds, whose log K oc exp were available (US EPA O 2015) were divided into a test set (n = 25) and a validation set (n= 25).
ANNs make it possible to process a large number of descriptors that can be easily obtained using readily available software. The selection of ANN input data is an important step because if the number of parameters is excessive considering the number of cases, overfitting may occur. An over-fitted model fits perfectly the data used as a training set but is likely to be inapplicable to additional cases (a simple example of an over-fitted model is a polynomial of a degree equal or close to the number of data points which fits the existing data very well but is not very good at generalization while extrapolated beyond the fitted data). In this study, a decision was made to consider, apart from the IAM retention factor log k IAM , a relatively small number of parameters, especially those that are known to be good predictors of environmental phenomena (Mamy et al. 2015). According to Mammy et al., 686 different molecular descriptors are found in 790 QSAR equations generated to predict 90 environmental parameters, but the descriptors that are most widely used are E HOMO , E LUMO , M w , α, and DM. The input data used in this part of the study were selected from the following group, involved earlier in multiple regression analysis: log k IAM , log M w , PSA, (N + O), HD, HA, α, DM, E HOMO , and E LUMO , and the predicted variable was log K oc (4) . In the case of every ANN model, 1000 networks were generated and five with the smallest error were retained for further examination in search for a network that would give the results being in the best agreement with the reference data (log K oc (4) ) for the whole group of studied compounds (n = 175) and with the experimental data (log K oc exp ) for a subgroup of cases (n = 50). The ANN models investigated at this stage of the study involved the following sets of independent variables, used earlier in the multiple regression analysis (Eqs. (6), (8), (9), and (10)) ( The details of all the ANNs considered at this stage of the study are given in Table 4 (Supplementary Materials). The networks ANN 1 to ANN 4 were studied using a tool known as global sensitivity analysis (GSA), which rates the importance of the models' input variable by computing sums of squared residuals for the model when the respective predictor is eliminated, compared to the full model. When an input variable scores 1 or less in GSA, it means that this particular network would perform better if this variable is omitted; however, no such situation occurred for the networks ANN 1 to ANN 4 . All the 20 most promising networks were analyzed by plotting the predicted log K oc ANN values against the reference log K oc (4) values obtained for the whole group of 175 studied compounds. The predicted log K oc ANN values were also plotted against the experimental log K oc exp values for 50 compounds, whose log K oc exp values are available. The resulting dependences between log K oc ANN and log K oc (4) and between log K oc ANN and log K oc exp are linear (see Table 4, Supplementary Materials). It was found that the networks of the ANN 2 family have the predictive abilities comparable to those of ANN 1 networks, but since they involve a smaller number of input variables and hidden layers, they are likely to be less prone to overfitting. ANN 4 networks, based on just two input variables (log k IAM and log M w ), give the prediction results comparable to those obtained using the ANN 3 networks; similar to MLR models (Eqs. (9) and (10)), it seems that just two variables, log k IAM and log M w , account for ca. 88-89% of the total log K oc variability (depending on the ANN's architecture).
The differences and similarities between the reference values (log K oc (4) ) and the predicted values obtained according to MLR and ANN models described above were studied briefly using Cluster Analysis (k-means method). 25 log K oc values (log K oc (4) and the values predicted using MLR and ANN models: log K oc (6) , log K oc (8) to log K oc , log K oc ANN1 , log K oc ANN2 , log K oc ANN3 , and log K oc ANN4 ) obtained for 175 compounds were separated between 5 clusters with minimum within-cluster variances and maximum variances between clusters (Table 5, Supplementary Materials). It was established that the reference values (log K oc (4) ) form a cluster with the values calculated using ANN 2-1 , ANN 2-2 , ANN 2-3 , and ANN 2-4 networks (Cluster 5). The mean log K oc values obtained for clusters 1 to 5 were plotted against the experimental log K oc exp values for 50 compounds whose experimental log K oc exp values are available. The resulting relationships are linear (R 2 = 0.84, 0.79, 0.85, 0.80, and 0.84, respectively). The differences between the prediction results for clusters 1 to 5 are relatively small (Fig. 5); for the great majority of studied compounds 1 to 175, the mean values of log K oc calculated for all the clusters are similar (S.D. below 0.15, Table 6, Supplementary Materials). However, a network that attracts particular attention is ANN 2-2 ( Fig. 6) based on 6 input variables and giving the prediction results superior to those obtained using the MLR model involving the same set of independent variables (Eq. (8)). The values of log K oc ANN2-2 calculated using this network are in agreement with the reference values log K oc (4) (R 2 = 0.95, n = 175) and with the experimental values log K oc exp (R 2 = 0.86, n = 50).

Conclusions
In this study, new models of the soil-water partition coefficient K oc were developed using IAM chromatographic retention factors and calculated physico-chemical parameters of 175 structurally diverse compounds. The relationships between log K oc and the independent variables were either linear (multiple linear regression, MLR models) or nonlinear (artificial neural network, ANN models).
Due to the limited availability of experimental permeability data for the solutes investigated in this study, the reference soil-water partition coefficients were calculated according to a widely accepted theoretical model based on Abraham's solvation parameters. The values of log K oc obtained using this model are in agreement with the experimental data (where available).
Both MLR and ANN models proposed in this study have certain advantages. Compared to the MLR model, the ANN model gives slightly better prediction results which indicate non-linearity and complexity of correlations between the studied descriptors and the soil-water partition coefficient. The main descriptors, responsible for the variability of log K oc in MLR equations, are log k IAM and log M w . When other independent variables are incorporated, the ANN models gain more in predictive power than MLR equations. Nonlinear modeling is promising in terms of predicting compounds' mobility in soil with high accuracy. However, linear relationships (Eq. (10)) give sufficiently good results and have the benefit of simplicity. Linear models are also more intuitive and can be easily applied during early steps of a drug discovery process. At this stage, high throughput is essential; IAM chromatographic and computational descriptors are collected to study drugs' pharmacokinetic properties, so acquisition of input data for soil mobility studies would not require any additional effort. IAM chromatographic and computational studies of soil-water partition of compounds may therefore be a valuable extension of ADMET studies.
Data availability Data used in this manuscript are given as Supplementary Materials or in references cited in the text.

Ethical approval Not applicable
Studies in humans and animals No human subjects or animal experiments were involved during the preparation of this manuscript.

Consent for publication Not applicable
Competing interests The authors declare no competing interests. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.