1 Introduction

The octanol–air equilibrium partitioning ratio (KOA) describes the partitioning of a chemical between octanol and air and is commonly used to approximate the partitioning between various organic phases and the gas phase, including soil organic matter [1, 2] plant foliage [3, 4], atmospheric particulate matter [5, 6], and milk and blood [7]. Previous work on contaminants in outdoor environments (e.g., [8,9,10]) and in various terrestrial organisms (e.g., [11]) has shown the importance of KOA in understanding the distribution, fate, and bioaccumulation potential of volatile and semi-volatile organic chemicals in the environment. However, these processes often occur at temperatures other than standard temperatures (i.e., 25 °C) and accurate understanding of the partitioning of a chemical in these systems may require a temperature correction.

The temperature dependence of log10 KOA can be described by the internal energy of phase transfer from octanol to air (\(\Delta {U^{ \circ}_{\text{OA}}}\), kJ·mol−1) using the van’t Hoff equation:

$${\mathrm{log}}_{10}\frac{{K}_{\mathrm{OA}} \mathrm{at} {T}_{2}}{{K}_{\mathrm{OA}} \mathrm{at} {T}_{1}}=- \frac{U^{ \circ}_{\mathrm{OA}}}{R}\cdot \left( \frac{1}{{273.15+T}_{2}}-\frac{1}{{273.15+T}_{1}} \right)\cdot {\mathrm{log}}_{10}e$$
(1)

where R is the ideal gas constant (8.314 ·10–3 kJ·K−1·mol−1), T1 and T2 are two temperatures (°C), and log10 e (i.e., 0.43) is applied for the logarithm base change. While \(\Delta {U^{ \circ}_{\text{OA}}}\) is itself temperature dependent, the derivation of Eq. 1 requires it to be constant over small temperature ranges [12] and therefore a \(\Delta {U^{ \circ}_{\text{OA}}}\) can be derived by regressing log10 KOA against reciprocal temperature (273.15 + T)−1.

The concentration of a solute in the gas phase can be expressed volumetrically (e.g., in units of mol·m−3) or as a partial pressure (in Pa). If the KOA is calculated using volumetric gas phase concentrations, a regression of log10 KOA against reciprocal temperature yields \(\Delta {U^{ \circ}_{\text{OA}}}\). However, when KOA is defined in terms of partial pressure, the enthalpy of phase transfer (\(\Delta {H^{ \circ}_{\text{OA}}}\), kJ·mol−1) is obtained. \(\Delta {U^{ \circ}_{\text{OA}}}\) or \(\Delta {H^{ \circ}_{\text{OA}}}\) can be converted into each other using [12, 13]:

$$\Delta {U^{\circ }_{\mathrm{OA}}}=\Delta {H^{\circ }_{\mathrm{OA}}}+{RT}$$
(2)

The direct determination of the KOA and its temperature dependence using various experimental approaches can be challenging and time consuming. Recent work by Baskaran et al. [14] found that, compared to other KOA prediction techniques, poly-parameter linear free energy relationships (ppLFERs) using Abraham descriptors are a reliable, fast and easy-to-use method for estimating KOA. Traditionally ppLFERs using Abraham descriptors combine six system constants and five solute descriptors to describe how a chemical interacts and partitions between two phases [15]. The system constants, represented by lower-case letters, are determined using a multiple linear regression of the property against the solute descriptors, expressed with upper-case letters. These solute descriptors include: E (excess molar refraction), S (polarizability/dipolarity), A (hydrogen bond acidity), B (hydrogen bond basicity), V (McGowan molar volume), and L (log10 of the hexadecane–air partition ratio). The product of a solute descriptor and a system constant, such as sS or bB, describes the energetic contribution of one particular type of intermolecular interaction to the property [15].

Properties describing the interaction of a solute between the gas phase and a condensed phase can be calculated using two different ppLFERs which differ in the use of either the V or E parameter.

$${\mathrm{log}}_{10}\;K=c+eE+sS+ aA+bB+lL$$
(3)
$${\mathrm{log}}_{10}\;K=c+sS+aA+bB+vV+lL$$
(4)

Equation 3 is intended to exclusively describe partitioning between the gas phase and a condensed phase, while Eq. 4 can be applied to any two phases, including two condensed phases [15].

In previous work [14] we presented a ppLFER without either V or E to estimate the KOA at 25 °C, that performed as well as the 5-parameter equations by Abraham and Acree [16] and Endo and Goss [17]. Jin et al. [18] also developed a ppLFER equation for log10 KOA which directly incorporates temperature in the multiple linear regression but used a small and limited data set [14]. In order to predict KOA values at temperatures other than 25 °C, Baskaran et al. [14] recommend using the 4-parameter ppLFER equation for KOA and a ppLFER equation for \(\Delta {H^{ \circ}_{\text{OA}}}\) by Mintz et al. [19]. The \(\Delta {H^{ \circ}_{\text{OA}}}\) and the log10 KOA at 25 °C estimated this way were found to be highly correlated (R2 > 0.98) [14], in analogy to previous work that has shown enthalpies of absorption (\(\Delta {H^{ \circ }_{\text{ads}}}\)) and the logarithm of the adsorption constants (log10 Kads) to be linearly correlated (R2 > 0.91) [20]. Likewise, strong correlations between \(\Delta {H^{ \circ }_{\text{vap}}}\) and PL [20, 21] have been observed and Goss and Schwarzenbach [20] note that previous work (e.g., [22]) indicated strong relationships between enthalpies and partitioning ratios. The high correlation between log10 KOA and \(\Delta {H^{ \circ }_{\text{OA}}}\) suggests that the temperature dependence of KOA can be estimated from KOA directly.

Other quantitative structure property relationships (QSPRs) for log10 KOA use different molecular, topographic, geometric, and quantum-chemical descriptors [12], which require commercial software or intensive computational power. The Open (Quantitative) Structure-activity/property Relationship App (OPERA) model by Mansouri et al. [23], uses two molecular descriptors computed with the PaDEL descriptor software [24] to estimate log10 KOA at 25 °C. PaDEL calculates 1, 2, and 3 dimensional (1D, 2D, and 3D) molecular descriptors, whereby an increase in the dimensionality corresponds to the complexity of the encoded information [24]; we collectively refer to these descriptors as PaDEL descriptors. The OPERA model utilizes the PaDEL-predicted log10 hexadecane-air partition ratio (LPaDEL) and the number of hydrogen-bond donor atoms (nHBDon), both 2D descriptors [25]. The LPaDEL is identical to the L Abraham solute descriptor, while the nHBDon is somewhat similar to the A Abraham solute descriptor, because they both describe the capacity of the chemical to donate protons.

Both the Abraham and PaDEL descriptors are easily acquired. Two types of Abraham descriptors, experimental and estimated, are available from the UFZ-LSER website [26]. Experimental descriptors are measured directly or through chromatographic retention time techniques [15]. Solute descriptors can also be predicted from a chemical’s Simplified Molecular Input Line Entry System (SMILES) notation using the IFS-QSAR models [27] integrated into the UFZ-LSER website and the EAS-E Suite platform [28] or directly from the standalone python package available on GitHub [29]. All PaDEL descriptors can be estimated from the SMILES strings using either a stand-alone software [24] or through QSARINS [30].

Since the development of the \(\Delta {H^{ \circ }_{\text{OA}}}\) ppLFER by Mintz et al. [19, 31], a large database of log10 KOA has been assembled [12]. This database can be used to create a comprehensive dataset of \(\Delta {U^{ \circ }_{\text{OA}}}\) for chemicals with measured log10 KOA data at multiple temperatures. In this work we utilize a newly assembled and curated \(\Delta {U^{ \circ }_{\text{OA}}}\) database to develop linear regression models for \(\Delta {U^{ \circ }_{\text{OA}}}\) using Abraham solute descriptors, PaDEL descriptors, and/or the log10 KOA at 25 °C.

2 Methods

2.1 Data Curation

The development of a reliable model for \(\Delta {U^{ \circ }_{\text{OA}}}\) relies on acquiring empirical data and processing the data through several data curation steps to reduce errors. We use measured log10 KOA values from the KOA database [12] to derive experimental \(\Delta {U^{ \circ }_{\text{OA}}}\) values. We also searched for directly measured \(\Delta {H^{ \circ }_{\text{OA}}}\) and \(\Delta {U^{ \circ }_{\text{OA}}}\) values, obtained using calorimetric techniques, to build our training and external validation datasets.

2.1.1 Data from the K OA Database

All measured values of log10 KOA were extracted from the KOA database and filtered to remove any measurements made (i) using indirect techniques (data from gas chromatographic retention times in octanol-filled columns were considered directly obtained), (ii) using water-saturated octanol, (iii) for a mixture or a chemical with an ambiguous structure, (iv) for inorganic and labelled compounds. We also removed duplicate values and values considered unreliable within the database. Finally, chemicals with measured KOA values at less than four different temperatures were removed, which left 149 chemicals. Log10 KOA values were linearly regressed against inverse absolute temperature to obtain \(\Delta {U^{ \circ }_{\text{OA}}}\) from the slope. We then considered the strength of the correlation based on R2, the standard error of \(\Delta {U^{ \circ }_{\text{OA}}}\) as derived from the error of the slope, and how the \(\Delta {U^{ \circ }_{\text{OA}}}\) calculated from this regression compared with published data.

We eliminated 10 chemicals from the dataset because the log10 KOA values used to calculate \(\Delta {U^{ \circ }_{\text{OA}}}\) values were obtained from a single reference and the \(\Delta {U^{ \circ }_{\text{OA}}}\) values calculated in our regression disagreed with the published value from that reference. In many instances, the published \(\Delta {U^{ \circ }_{\text{OA}}}\) was calculated from a regression of three log10 KOA values at different temperatures. Including a fourth log10 KOA measurement (obtained via personal communication with the authors, see [12]) caused the calculated \(\Delta {U^{ \circ }_{\text{OA}}}\) to deviate from the reported value. We assume that the authors of these original measurement did not hold strong confidence in these unpublished log10 KOA measurements. Thus, data for chlorinated dibenzo-p-dioxin (CDD) 48, 50, 54, 66 and 73, and brominated diphenyl ether (BDE) 183 were removed from the \(\Delta {U^{ \circ }_{\text{OA}}}\) dataset. For o,pʹ-dichlorodiphenyltrichloroethane, p,pʹ-dichlorodiphenyldichloroethane, cis-nonachlor and endrin, removing individual log10 KOA values did not lead to an agreement between previously published and calculated \(\Delta {U^{ \circ }_{\text{OA}}}\) values, which suggests that there may be some disparity in the log10 KOA values used in the regressions (also obtained via direct communication with the authors) and the published data. Four chemicals (isopropyl ether, methane, BDE 99, and BDE 153) in the dataset were removed because the R2 was below 0.95 and removing any outlier left only three datapoints in the regression. The regressions of the log10 KOA against inverse temperature for these chemicals are shown in Figs. SI 1–SI 3.

Although there have been multiple measurements of log10 KOA for perfluoroalkyl substances (PFAS), agreement between data from different studies is poor. As such, all PFAS compounds (perfluorooctane sulfonamido ethanols, perfluorooctane sulfonamide, fluorotelomer alcohols, and fluorotelomer acrylates) were excluded from the \(\Delta {U^{ \circ }_{\text{OA}}}\) dataset. The discrepancies between these values is further discussed in Sect. 3.1.1.

By assessing the temperature regression plots for a few chemicals, it was clear that some measured log10 KOA data deviated from others. In many cases these outliers were from the same paper and often used older and/or objectively less precise measurement techniques (e.g., [32,33,34,35,36,37,38,39,40]). In these instances, we removed the outliers and recalculated the \(\Delta {U^{ \circ }_{\text{OA}}}\) using the remaining log10 KOA data. In the case of propanol, after removing the obvious outliers from Eger et al. [34], we took the average \(\Delta {U^{ \circ }_{\text{OA}}}\) value calculated from regressing temperature dependent log10 KOA data from Lei et al. [41] and Gruber et al. [42], which subsequently caused the elimination of a single datapoint published by Abraham et al. [43]. Plots of these regressions, including the outliers, are available in Figs. SI 5 and SI 6 in the supporting information.

For some compounds the R2 for individual sets of temperature dependent log10 KOA values was high, however, individual log10 KOA values at specific temperatures deviated by 0.3 to 1 log10 units between papers. Therefore, we took the average \(\Delta {U^{ \circ }_{\text{OA}}}\) value calculated from log10 KOA values from each paper, in other words, we essentially took the average of the two slopes. This applied to hexachlorobenzene (HCB), and the polychlorinated biphenyls (PCBs) 153 and 180 (see Fig. SI 7).

After this step, we were left with 123 chemicals with reliable \(\Delta {U^{ \circ }_{\text{OA}}}\) values from the KOA database.

2.1.2 Direct Measurements of \(\Delta {H^{ \circ }_{\text{OA}}}\) and \(\Delta {U^{ \circ }_{\text{OA}}}\)

Mintz et al. [19, 31] used \(\Delta {H^{ \circ }_{\text{OA}}}\) values for 138 chemicals to develop ppLFER equations for the enthalpy of phase change between wet and dry octanol and air. \(\Delta {H^{ \circ }_{\text{OA}}}\) values used to calibrate those models were cross-referenced against the original sources cited by Mintz et al. [31]. During this process, we found more directly measured \(\Delta {H^{ \circ }_{\text{OA}}}\) values in the literature. The \(\Delta {H^{ \circ }_{\text{OA}}}\) values were subsequently converted to \(\Delta {U^{ \circ }_{\text{OA}}}\) using Eq. 2. We took the average of the \(\Delta {U^{ \circ }_{\text{OA}}}\) value when more than one literature value was available.

Any chemical from the Mintz et al. dataset of \(\Delta {H^{ \circ }_{\text{OA}}}\) values that already had a \(\Delta {U^{ \circ }_{\text{OA}}}\) value calculated from the KOA database was excluded, because these \(\Delta {H^{ \circ }_{\text{OA}}}\) values were direct measurements made using calorimetric measurement or were based on log10 KOA data already included in the KOA database [12]. This way preference was given to values obtained from multiple direct measurements of log10 KOA over individual calorimetric measurements. 72 chemicals with directly measured \(\Delta {H^{ \circ }_{\text{OA}}}\) values (converted to \(\Delta {U^{ \circ }_{\text{OA}}}\)) were included in our dataset.

2.2 Chemical Identifiers, Descriptors, and Data Splitting

The chemical names, acronym, CAS number and SMILES notation for all 195 chemicals in the \(\Delta {U^{ \circ }_{\text{OA}}}\) dataset were obtained from either the KOA database or from the CompTox Dashboard. We considered three types of descriptors for estimating \(\Delta {U^{ \circ }_{\text{OA}}}\): (i) the log10 KOA at 25 °C, (ii) Abraham descriptors, and (iii) PaDEL descriptors.

Abraham solute descriptors (E, S, A, B, V, and L) were obtained from the UFZ-LSER database [26]. For experimental solute descriptors, we gave preference to UFZ pre-selected values over ABSOLV values. Estimated solute descriptors for all chemicals were calculated from their SMILES using the IFS-QSARs built into EAS-E Suite [28, 29]. 178 chemicals had a full set of experimental solute descriptors, 4 chemicals were only missing experimental E values, and 13 chemicals had only estimated solute descriptors.

1444 1D and 2D PaDEL descriptors and 881 PubChem fingerprints (v2.21) were obtained from the chemicals’ SMILES notation using QSARINS [30]. 3D molecular descriptors were not included, because while the chemicals in the \(\Delta {U^{ \circ }_{\text{OA}}}\) dataset are structurally simple, requiring 3D geometrically optimized chemical structures (e.g., SDF or MOL) and 3D optimization cannot be efficiently scaled up for high throughput applications and as such is not ideal for property estimations in the context of chemical risk assessment. Filters were applied to the PaDEL descriptors, wherein descriptors that had pair-wise correlation greater than 95% or where more than 80% of the descriptor was constant for the whole dataset were excluded. We also removed all descriptors that had missing values. In the end 531 PaDEL descriptors and fingerprints were considered, which included PaDEL calculated Abraham descriptors and 32 PubChem fingerprints. PubChem Fingerprints were excluded from model development and only used in a cluster analysis for data splitting as described later in this section.

Directly measured log10 KOA values were extracted from the KOA database [12] and the average used when more than one experimental value existed. Estimated log10 KOA values were calculated using the experimental and calculated Abraham solute descriptors and the 4-parameter ppLFER for KOA [14].

The curated \(\Delta {U^{ \circ }_{\text{OA}}}\) dataset was split into a model development and an external validation dataset based on the availability of descriptors for each chemical. 54 chemicals that had either some missing experimental Abraham solute descriptors or an estimated log10 KOA values were set aside to be used for external validation. The remaining 141 chemicals were used to develop multiple linear regression (MLR) models using the descriptors.

In order to reduce bias in descriptor selection, we split the development dataset using four different splitting techniques, with a ratio of 3:1 (i.e., 75% of chemicals were used in the training set and 25% in the validation set) [44]. In the first and second splits, chemicals were ordered by \(\Delta {U^{ \circ }_{\text{OA}}}\) and log10 KOA at 25 °C, respectively, and the chemicals with the highest and lowest value were included in the training dataset. In the random split, 75% of chemicals were randomly selected to be a part of the training dataset. We used cluster analysis (Ward’s method and Tanimoto distance) using PubChem fingerprints and Principal Component Analysis (PCA) of experimental Abraham solute descriptors to group chemicals in the model development dataset into 4 structural clusters (see Figs. SI 11 and SI 12). From each cluster we randomly selected 75% of the chemicals to be used as a training dataset. Figure 1 shows how the chemical datasets were split and used to develop \(\Delta {U^{ \circ }_{\text{OA}}}\) models. Table SI 2 contains \(\Delta {U^{ \circ }_{\text{OA}}}\) values, descriptors, and splits for all chemicals in the \(\Delta {U^{ \circ }_{\text{OA}}}\) dataset.

Fig. 1
figure 1

A scheme of the source of \(\Delta {U^{ \circ }_{\text{OA}}}\) data and how this dataset is split and used for model development and validation. 3:1 split indicates that 75% of the chemicals in the dataset were used for the training data and the remaining 25% of chemicals were included in an internal validation process

2.3 Model Calculation and Selection

Model calculation and selection were completed using QSARINS [30] on the four different splits of the model development dataset. For each split we developed different models based on the different kind of molecular descriptors. First, we explored the traditional ppLFER equations assessing all combinations (all-subsets with 1 to 5 descriptors) using experimental Abraham descriptors. Then, we substituted the Abraham L with the experimental log10 KOA at 25 °C, as both variables describe partitioning between the gas phase and a hydrophobic condensed phase. Given the high number of PaDEL descriptors, we used the all-subset (up to 2 variables) and the Genetic Algorithm (up to 4 variables, parameters are included in the SI) built into QSARINS to identify the best PaDEL descriptors combination. Finally, we assessed the performance of the model using the two PaDEL descriptors used by the OPERA model for estimating log10 KOA, namely the number of hydrogen bond donors (nHBDon) and log10 hexadecane-air partition ratio (L) [25].

During model development and selection, we filtered out models with p-values for the regression coefficients higher than 0.05, as we could not be confident that the values of these coefficients were different from 0. We restricted the maximum number of variables (i.e., descriptors) to be included in a regression equation to 4 to ensure models remained relatively simple and to avoid overfitting issues. Model performance across all four splits was assessed using different variations of the determination coefficient (\({\text{R}^{2} _{\text{adj}}}\), \({Q^{2} _{\text{Loo}}}\), \({Q^{2} _{\text{LMO}}}\), \({Q^{2} _{\text{F1}}}\), \({Q^{2} _{\text{F2}}}\), \({Q^{2} _{\text{F3}}}\)), the concordance correlation coefficients (CCCEXT), root mean squared errors (RMSETR, RMSEEXT) and mean absolute errors (MAETR, MAEEXT) on the training set and on the prediction/external set [45,46,47,48]. Table SI 1 summarises these statistics for the models within each split.

After identifying the best models for different types of descriptors, we used all 141 \(\Delta {U^{ \circ}_{\text{OA}}}\) values as the training set to develop new equations using the same descriptors (i.e., the so-called full model). These models were assessed using the 54 chemicals in the external validation dataset. It is important to note that in the external validation set, not all chemicals have empirical descriptors namely for A, L, and log10 KOA at 25 °C. When experimental A and L values were not available, IFS-QSAR estimated solute descriptors were used. Log10 KOA values at 25 °C were estimated using the 4-parameter ppLFER equation for KOA [14] using experimental solute descriptors. All PaDEL descriptors are calculated from the SMILES notation of a chemical and so all descriptors are considered estimates.

3 Results and Discussion

3.1 Data Availability and the General Applicability Domain

Analysis of the log10 KOA database showed a bimodal distribution in the experimental data [12]. Given the high correlation between log10 KOA and \(\Delta {U^{ \circ }_{\text{OA}}}\) [14], it is not surprising that the same trend is observed for the curated \(\Delta {U^{ \circ }_{\text{OA}}}\) dataset (Fig. 2). This pattern arises because most \(\Delta {U^{ \circ }_{\text{OA}}}\) values in this dataset are derived from temperature dependent measurements of the log10 KOA and thus limitations on the experimental techniques for log10 KOA are extended to the available \(\Delta {U^{ \circ }_{\text{OA}}}\) measurements. Direct measurements of \(\Delta {U^{ \circ }_{\text{OA}}}\) using the calorimetric techniques are applied to small volatile compounds and the \(\Delta {U^{ \circ }_{\text{OA}}}\) of these compounds are less negative and fall within a limited range. The most common class of chemicals in the \(\Delta {U^{ \circ }_{\text{OA}}}\) dataset are PCBs (Fig. SI 13), followed by polychlorinated naphthalenes (PCNs), alcohols, and ketones.

Fig. 2
figure 2

Distribution of \(\Delta {U^{ \circ}_{\text{OA}}}\) data used to train and validate the models. Chemicals were included in the training data when all experimental Abraham solute descriptors and an experimental log10 KOA value at 25 °C were available

Most chemicals in the \(\Delta {U^{ \circ }_{\text{OA}}}\) dataset have A values of 0 and B values less than 0.6 (Fig. SI 14), however only 36 (~ 18%) chemicals have both A and B values equal to 0. The polarity of a chemical (defined as aA + bB, where the system descriptors are from the four parameter ppLFER equation for KOA [12, 14]), shows there is some variability in the hydrogen bonding ability with octanol of the chemicals in the \(\Delta {U^{ \circ }_{\text{OA}}}\) dataset. However, there are few very polar compounds. Descriptor values for E and S are more widely distributed, while the distributions of the L and V descriptors resembles the distribution of \(\Delta {U^{ \circ }_{\text{OA}}}\) values.

3.1.1 Perfluoroalkyl Substances (PFAS)

The partitioning properties of PFAS compounds are extremely difficult to measure because they can act as surfactants and likely have a very low solubility in octanol. Data from different studies regularly display divergent results and it is not possible to establish which ones are correct.

Most fluorotelomer alcohols (FTOHs) have log10 KOA values around 5, which is at the lower limits of the generator column technique. They are also too polar to apply the gas-chromatographic retention time technique [49]. In combination, this means obtaining reliable measurements for the KOA of FTOHs is quite difficult. Figure SI 15 compares the \(\Delta {U^{ \circ}_{\text{OA}}}\) values of FTOHs calculated from KOA measurements reported by Goss et al. [50] and Thuens et al. [51] (temperature regression plots in Fig. SI 10). We would expect that \(\Delta {U^{ \circ}_{\text{OA}}}\) values for FTOHs become more negative with increasing chain length, because the KOA values increase with chain length. While this occurs for \(\Delta {U^{ \circ}_{\text{OA}}}\) values from Goss et al. [50], \(\Delta {U^{ \circ}_{\text{OA}}}\) declines with increasing chain length when using the data from Thuens et al. [51]. As the \(\Delta {U^{ \circ}_{\text{OA}}}\) values from Goss et al. [50] have been calculated from measurements at only two (4:2 FTOH) and three (6:2 FTOH and 8:2 FTOH) temperatures, there is insufficient data to include \(\Delta {U^{ \circ}_{\text{OA}}}\) values for FTOHs.

The temperature dependence of the KOA for methyl and ethyl perfluorooctane sulfonamido ethanols (Me-FOSE and Et-FOSE) reported by two studies [52, 53] and therefore also the \(\Delta {U^{ \circ}_{\text{OA}}}\) derived from those data are quite different (Fig. SI 8). Dreyer et al. [52] also published KOA data on methyl and ethyl perfluorooctane sulfonamides (Me-FOSA and Et-FOSA), whereby the plot suggests that the values obtained at very low and high temperatures may be outliers. Given the uncertainty of the KOA values for Me- FOSA, Et-FOSA, and Et-FOSE, we have also excluded those for Me-FOSE as there is no way for us to tell whether they are valid. Similar to Et-FOSE, the log10 KOA reported by Dreyer et al. [52] for 6:2 and 8:2 fluorotelomer acrylates (6:2 FTAc and 8:2 FTAc) also display poor linearity with reciprocal temperature (R2 < 0.9) and therefore we chose to exclude also the KOA data for 10:2 FTAc from the \(\Delta {U^{ \circ}_{\text{OA}}}\) dataset (Fig. SI 9). Ultimately, we decided to exclude all PFAS compounds from the data set.

3.2 Model Selection

3.2.1 Model Prioritization

As not all chemicals with \(\Delta {U^{ \circ}_{\text{OA}}}\) values had all available descriptors, a subset of 141 chemicals was used to test and identify the best performing models within each split. We selected five models to examine in detail. Those are the regressions using (i) A and L, (ii) only the log10 KOA at 25 °C, (iii) A and log10 KOA at 25 °C, (iv) nHBDon and LPaDEL, and (v) nHBDon, number of carbons, nC, and number of halogens, nX. As initially hypothesized, the log10 KOA at 25 °C alone proved to be a very good predictor of \(\Delta {U^{ \circ}_{\text{OA}}}\) (Table SI 1). Meanwhile the model using the Abraham solute descriptors A and L performed almost as well. The log10 hexadecane-air partition ratio L can describe the non-polar interactions between the compound and the solvent while the hydrogen bond acidity A describes the potential for a chemical to donate a hydrogen. These two parameters describe similar chemical interactions as the two PaDEL descriptors LPaDEL and nHBDon, used to predict log10 KOA in the OPERA model.

The model selected with the Genetic Algorithm using only PaDEL descriptors uses nHBDon, nC, and nX. These descriptors are not able to describe chemicals to the same extent Abraham descriptors do and their selection is a consequence of the data used to train the models. This is discussed in more detail in the following section.

All five models performed well during internal and external validation processes across all four splits (Table SI 1). The \({\text{R}^{2} _{\text{adj}}}\) describes the R2 while correcting for the number of descriptors in a model. The RMSETR and MAETR was in almost all cases slightly smaller than the RMSEEXT and MAEEXT, however given the range of the \(\Delta {U^{ \circ}_{\text{OA}}}\) values used in model development, the difference is likely unremarkable. Internal cross validation using leave-one-out (\({Q^{2} _{\text{LOO}}}\)) and leave-more-out (\({Q^{2} _{\text{LMO}}}\)) was very similar to the R2 and \({\text{R}^{2} _{\text{adj}}}\) for all models (> 0.90). However, the external predictive ability of the models (\({Q^{2} _{\text{F1}}}\), \({Q^{2} _{\text{F2}}}\), \({Q^{2} _{\text{F3}}}\)) were slightly lower than the R2. The CCCEXT, describing both the precision and accuracy of all models was almost always greater than 0.90.

3.2.2 External Validation

Of the 54 chemicals in the external validation dataset, 17 chemicals had experimental log10 KOA values and 41 chemicals had experimental A and L solute descriptors that were retrieved from the UFZ-LSER database. All chemicals without experimental log10 KOA values had all the experimental solute descriptors necessary to estimate log10 KOA using the four parameter ppLFER equation [14]. As the PaDEL descriptors for all chemicals were estimated, we did not differentiate between the source of A, L, and log10 KOA at 25 °C used to externally validate the model, however wherever possible we used experimental log10 KOA values and solute descriptors. Full model equations are listed in Table 1.

Table 1 Proposed equations for calculating the \(\Delta {U^{ \circ}_{\text{OA}}}\) (kJ·mol−1) of an organic compound

By using all 141 chemicals to train the models and 54 chemicals to externally validate the models we see that Model 3 using log10 KOA and A had the best overall performance (Tables 2, 3). Model 3 has the highest overall R2, \({\text{R}^{2} _{\text{adj}}}\), \({Q^{2} _{\text{F1}}}\), \({Q^{2} _{\text{F2}}}\), \({Q^{2} _{\text{F3}}}\), and CCCEXT. Model 2, using log10 KOA, performs almost as well as Model 3, followed by Model 1. Models 4 and 5 had the poorest performance.

Table 2 Internal validation results for the five selected models with all 141 chemicals in the training dataset
Table 3 External validation results for the five selected models with all 141 chemicals in the training dataset and 54 chemicals used for external validation

In Fig. 3, we can see that all models generally perform better at higher \(\Delta {U^{ \circ}_{\text{OA}}}\) (i.e., less negative) values. Models 4 and 5 appear to have a larger number of outliers (> 10 kJ·mol−1) at lower \(\Delta {U^{ \circ}_{\text{OA}}}\) values relative to Models 1, 2, and 3; however, it is important to recognize that this relative difference between predicted and experimental values is smaller for increasingly negative \(\Delta {U^{ \circ}_{\text{OA}}}\) values. An error of − 10 kJ·mol−1 for a \(\Delta {U^{ \circ}_{\text{OA}}}\) of − 90 kJ·mol−1 corresponds to ~ 11% whereas the same error for a \(\Delta {U^{ \circ}_{\text{OA}}}\) of − 30 kJ·mol−1 is ~ 33%.

Fig. 3
figure 3

Estimated \(\Delta {U^{ \circ}_{\text{OA}}}\) values plotted against experimental values for the training and external validation compounds. 1:1 lines are shown in black. Grey dashed and dotted lines indicate ± 5 and ± 10 kJ·mol−1, respectively

In Fig. SI 16 we present Williams plots of the standardized residual plotted against the leverage. The standardized residual (sri) of a chemical i is obtained by correcting residuals (ri) using the standard deviation (sd) of the model and the leverage of a prediction (hii):

$${sr}_{i}=\frac{{r}_{i}}{\mathrm{sd}\cdot \sqrt{1-{h}_{ii}}}$$
(5)

Standardized residual values are considered high when the absolute value is greater than 2.5 [44]. The leverage of a training chemical indicates how much influence it has on the regression; chemicals with high leverages have a large influence on the model [44]. A chemical has a high leverage value if the leverage is greater than the warning leverage (h*) which is a function of the number of chemicals in the training dataset (n) and the number of parameters in the model (k) [44]:

$${h}^{*}=3\cdot \frac{k+1}{n}$$
(6)

Standardized residuals and leverage values were obtained from QSARINS.

Considering only the number of chemicals with high leverage and high standardized residuals (Table 3), Model 2 has no chemical in the training set with a large leverage value and it has only 7 chemicals with high standardized residuals. The largest standardized residual was for 1,2,3,4-tetramethylbenzene in the external validation dataset; however, the log10 KOA (4.4) measured using the variable headspace ratio technique is known to be at the upper limit of the technique [41]. Thus, it likely that this high standardized residual value is due to the high uncertainty of the reported KOA, \(\Delta {U^{ \circ}_{\text{OA}}}\) or both, rather than an error in the model.

Models 1 and 3 had the same chemicals (external validation and training chemicals) with high leverage values. This high leverage is driven entirely by the A parameter used in both models. Chemicals with A greater than 0.35 have high leverage values. This is because most chemicals in the \(\Delta {U^{ \circ}_{\text{OA}}}\) dataset have A values equal to 0. For example, methanoic acid (in the external validation dataset) has both a high leverage values and high standardized residuals in Models 1 and 3 and has the highest A value of 0.76.

Many of the same chemicals with high leverage values in Models 1 and 3 also had high leverage values in Model 4, driven by the nHBDon values. While nHBDon is a countable parameter, the \(\Delta {U^{ \circ}_{\text{OA}}}\) dataset only includes chemicals with either nHBDon values equal to 0 or 1. Thus, all chemicals with an nHBDon value equal to 1 had a high leverage value. Nonetheless, these chemicals had low standardized residuals. The three chemicals with high leverage in Model 5 (2 training and 1 external validation compounds) all have long carbon chains. However, they are not the only chemicals of that size.

There was also an overlap in chemicals with high standardized residuals between models. Fluorene had a high standardized residual in all models. PCB 155 and 1,2,3,4-tetramethylbenzene had high standardized residuals in all models except Model 1 and 4, respectively. The standardized residuals for BDE 154 and trans-nonachlor was high in Models 1, 2, and 3, while β-HCH and δ-HCH had high standardized residuals in Models 1, 4, and 5. Heptachlor had high standardized residuals in model 1 and 5, and PCBs 61 and 77 had high standardized residuals in models 2 and 3.

The internal validation results presented in Table 2 show Models 1, 2, and 3 perform better than Models 4 and 5 based on R2, \({\text{R}^{2} _{\text{adj}}}\), RMSETR, MAETR, RMSECV, MAECV, \({Q^{2} _{\text{LOO}}}\), and \({Q^{2} _{\text{LMO}}}\). Model 5 has the highest RMSEEXT followed by Model 1. Model 5 also has the highest MAEEXT and lowest \({Q^{2} _{\text{F1-F3}}}\) and CCCEXT. Despite Model 4’s comparatively poor internal validation results, the external validation results (RMSEEXT, MAEEXT, \({Q^{2} _{\text{F1-F3}}}\), CCCEXT) for this model are similar if not better than results for Models 1, 2, and 3.

While Models 1 and 3 have excellent internal and external validation results, the applicability domain of these models is limited by the A descriptor. Model 4’s internal validation results are poor compared to the other models, and are limited to chemicals with nHBDon values of 0. Model 5 was the worst performing model with high internal and external errors (RMSE and MAE values). Model 2 is consistently one of the best models when considering the internal and external validations, the number of outliers (high standardized residuals) and the number of chemicals with high leverage. Thus, the log10 KOA at 25 °C (Model 2) is generally sufficient to make a quick and reliable estimation of the \(\Delta {U^{ \circ}_{\text{OA}}}\).

3.3 Comparison with Other \(\Delta {U^{ \circ}_{\text{OA}}}\) Models

The \(\Delta {H^{ \circ}_{\text{OA}}}\) ppLFER model by Mintz et al. [19] has been shown to reliably predict the temperature dependence of KOA [14]. We compared the performance of this ppLFER for \(\Delta {H^{ \circ}_{\text{OA}}}\) [19], converted to \(\Delta {U^{ \circ}_{\text{OA}}}\), with Model 2. Since both models are built on similar datasets and an evaluation with chemicals that had been used to train and develop the models would bias the results, we rely only on chemicals not used in the training dataset for either model. This limits the comparison to 28 chemicals including several BDEs and CDDs. The training data for the Mintz et al. [19] model includes 138 chemicals and experimental solute descriptors and experimental \(\Delta {H^{ \circ}_{\text{OA}}}\) values for all chemicals are provided in a previous publication [31]. During our data curation process we found small errors in the reported \(\Delta {H^{ \circ}_{\text{OA}}}\) and misidentification of a few chemicals which means some of the descriptors used are erroneous. The Mintz et al. [19] model also includes inorganic chemicals (e.g., nitrogen and xenon) which were excluded from our model development. Due to these differences, we compare the statistics of the model residuals (Table 4) and the residual of each prediction (Fig. 4).

Table 4 Statistics on the residual (kJ·mol−1) for \(\Delta {U^{ \circ}_{\text{OA}}}\) predictions made using Model 2 (log10 KOA) and the Mintz et al. [19] model for chemicals not included in the training set of either model (n = 28) and all chemicals in the \(\Delta {U^{ \circ}_{\text{OA}}}\) dataset (n = 195)
Fig. 4
figure 4

Comparing the performance of Model 2 (log10 KOA) with the Mintz et al. model [19] using chemicals not included in the training set of both models. Grey dashed lines indicate errors of ± 5 kJ·mol−1 and grey dotted lines indicate errors of ± 10 kJ·mol−1

For both models we use the best available descriptors (i.e., experimental values wherever possible and otherwise IFS-QSAR estimated values). In this combined external validation process (Table 4), Model 2 has a smaller bias (average residual), MAE, standard deviation (SD), and RMSE. The Mintz et al. [19] model has a larger number of chemicals with errors exceeding 10 kJ·mol−1 and Model 2 has more \(\Delta {U^{ \circ}_{\text{OA}}}\) estimates with absolute errors less than 5 kJ·mol−1 (Fig. 4).

If we compare the performance of the Mintz et al. [19] model and Model 2 when using all 195 chemicals in the \(\Delta {U^{ \circ}_{\text{OA}}}\) dataset, we see that Model 2 performs better than the Mintz et al. [19] model (Table 4). While there is little difference between the statistics on the residuals, fewer chemicals have residuals greater than 10 kJ·mol−1 when using Model 2. Both models overestimated the \(\Delta {U^{ \circ}_{\text{OA}}}\) by at least 10 kJ·mol−1 for fluorene, PCN 57, cis-chlordane, trans-chlordane, and trans-nonachlor—all chemicals within the training data set of Model 2, apart from trans-nonachlor. The models also underestimated the \(\Delta {U^{ \circ}_{\text{OA}}}\) for CDD 1, 1,2,3,4-tetramethylbenzene, BDE 17, BDE 28, BDE 154, PCB 61, PCB 155, of which CDD 1 and BDEs 17, 28, and 154 were in the external validation dataset for Model 2.

In addition to the model by Mintz et al. [19], there is a ppLFER for wet-octanol air \(\Delta {H^{ \circ}}\) values [31] and models for \(\Delta {H^{ \circ}_{\text{OA}}}\) using a support vector machine method, artificial neural networks, and MLRs based on various molecular descriptors [54]. The wet-octanol air \(\Delta {H^{ \circ}_{\text{OA}}}\) model is based on the same dataset used to develop the Mintz et al. [19] model above [31]. The latter three models also use the same dataset as a basis for model development but randomly split their dataset of 127 chemicals into a training dataset of 89 chemicals and validation dataset of 38 chemicals [54]. The reported correlation coefficients for these models [31, 54] are similar to, or greater than what we report for Model 2 and the standard errors of the models developed using a support vector machine method and artificial neural networks is smaller than the RMSE of Model 2. However, these models cannot be used easily in high throughput applications because acquiring the molecular descriptors requires multiple software applications [54], which can be time-consuming and lends itself to increased chances of human error.

3.4 Sources for Descriptors

While we used only experimentally derived solute descriptors to develop and train the models in this work, the source of solute descriptors may impact model performance. Sources for the descriptors include experimental values from the UFZ-LSER database [26], IFS-QSAR estimates from the standalone IFS-QSAR software [27, 29] or EAS-E Suite [28], or PaDEL estimates made using QSARINS [30]. The first two are easily obtained from non-commercial web-based software. In Fig. 5 we compare Model 2’s estimated \(\Delta {U^{ \circ}_{\text{OA}}}\) if only experimental, IFS-QSAR predicted, or PaDEL predicted solute descriptors are used. The log10 KOA at 25 °C used by Model 2 is calculated using the ppLFER equation with the 4 parameters S, A, B, and L [14]. Because not all chemicals have experimental solute descriptors, fewer estimations are available for comparison. Table 5 shows that the IFS-QSAR-estimated solute descriptors perform almost as well as experimental solute descriptors. Regardless of which set of descriptors is used, \(\Delta {U^{ \circ}_{\text{OA}}}\) values of 1-decanol, β-HCH, δ-HCH, fluorene, dieldrin, BDE 29, PCB 77, and PCN 57 are overestimated and \(\Delta {U^{ \circ}_{\text{OA}}}\) values of 1,2,3,4-tetramethylbenzene, BDE 154, PCB 61, PCN 180, and PCB 155 are underestimated. Clearly, the IFS-QSARs are preferable over PaDEL for predicting solute descriptors, considering the smaller associated MAE, standard deviation, and RMSE for the former.

Fig. 5
figure 5

Residuals between observed \(\Delta {U^{ \circ}_{\text{OA}}}\) and the \(\Delta {U^{ \circ}_{\text{OA}}}\) predicted by Model 2 depending on whether experimental, IFS-QSAR-predicted, or PaDEL-predicted Abraham descriptors were used. Grey dashed lines indicate errors of ± 5 kJ·mol−1 and grey dotted lines indicate errors of ± 10 kJ·mol−1

Table 5 Statistics on the residuals for the performance of Model 2 using estimated KOA values using experimental, IFS-QSAR-predicted, and PaDEL-predicted Abraham descriptors

3.5 The Relationship Between log10 K and ΔU°

The strong relationship between log10 KOA and \(\Delta {U^{ \circ}_{\text{OA}}}\) in Model 2 mirrors similar relationships between vapour pressure (log10 PL) and the enthalpy of vaporization (\(\Delta {H^{ \circ}_{\text{vap}}}\)) and between adsorption constants (log10 Kads) and the enthalpies of adsorption (\(\Delta {H^{ \circ}_{\text{ads}}}\)) [20]. Goss and Schwarzenbach cite other such relationships and provide a thermodynamic explanation for their occurrence [20]. The equation relating \(\Delta {H^{ \circ}_{\text{vap}}}\) and PL [20] can be combined with a linear equation relating a chemical’s PL with its log10 KOA [55] and a conversion to \(\Delta {U^{ \circ}_{\text{vap}}}\) (\(\Delta {U^{ \circ}_{\text{vap}}}\) = \(\Delta {H^{ \circ}_{\text{vap}}}\) − RT) [13]:

$$\Delta {U^{\circ }_{\mathrm{vap}}}=8.90\cdot {\mathrm{log}}_{10}{K}_{\mathrm{OA}}+10.42 - {RT}$$
(7)

The slope in this equation is very similar to the absolute value of the slope in Model 2 (− 8.75). This is apparent in a vertical shift when estimated \(\Delta {U^{ \circ}_{\text{OA}}}\) and \(- \Delta {U^{ \circ}_{\text{vap}}}\) values are plotted against experimental \(\Delta {U^{ \circ}_{\text{OA}}}\) values (Fig. SI 18). The difference in Eq. 2 in Table 1 and Eq. 7 corresponds to the internal energy of dissolution of the chemical in octanol (i.e., \(\Delta {U^{ \circ}_{\text{dissol}}}\) = \(\Delta {U^{ \circ}_{\text{OA}}}\) + \(\Delta {U^{ \circ}_{\text{vap}}}\)). This analysis suggests that the \(\Delta {U^{ \circ}_{\text{dissol}}}\) is very small and is relatively constant for compounds with a wide range in KOA. The observed relationship between KOA and \(\Delta {U^{ \circ}_{\text{OA}}}\) can perhaps be extrapolated to other partitioning systems such as KOW and \(\Delta {U^{ \circ}_{\text{OW}}}\). Indeed, if \(\Delta {U^{ \circ}_{\text{dissol}}}\) is a small and constant for a wide range of compounds it may be possible to estimate \(\Delta {U^{ \circ}_{\text{OW}}}\) from the enthalpy of dissolution of a chemical in water or even the water solubility of a compound. However large datasets with experimental values of each property are necessary before such a relationship can be verified.

4 Conclusion

We developed new models for predicting the \(\Delta {U^{ \circ}_{\text{OA}}}\) which use fewer descriptors than earlier models. The best performing model (Model 2) relies only on the log10 KOA at 25 °C and additional parameters do not notably improve model performance. This model has similar or slightly improved performance relative to previous estimation techniques for \(\Delta {H^{ \circ}_{\text{OA}}}\) and \(\Delta {U^{ \circ}_{\text{OA}}}\), which relied on more or more complex descriptors. Our work parallels previous findings that that \(\Delta {H^{ \circ}_{\text{vap}}}\) can be predicted quite well just from PL [20]. Further measurements of log10 KOA and \(\Delta {U^{ \circ}_{\text{OA}}}\) values could improve the applicability domain of these models particularly for chemicals with log10 KOA values between 4 and 6 (\(\Delta {U^{ \circ}_{\text{OA}}}\) values between − 70 and − 50 kJ·mol−1), and for more polar compounds with more complex hydrogen bonding abilities.