Predicting the Temperature Dependence of the Octanol–Air Partition Ratio: A New Model for Estimating ΔUOA∘\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Delta {U^{ \circ}_{\text{OA}}}$$\end{document}

The octanol–air partition ratio (KOA) describes the partitioning of a chemical between air and octanol and is often used to approximate other partitioning phenomena in environmental chemistry (e.g., blood–air, atmospheric particulate matter–air, polyurethane foam-air). Such partitioning processes often occur at environmental temperatures other than 25 °C. Enthalpies ΔHOA∘\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Delta {H^{ \circ}_{\text{OA}}}$$\end{document} or internal energies ΔUOA∘\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Delta {U^{ \circ}_{\text{OA}}}$$\end{document} of phase transfer are used to express the temperature dependence of the KOA. Existing poly-parameter linear free energy relationships (ppLFERs) for predicting ΔHOA∘\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Delta {H^{ \circ}_{\text{OA}}}$$\end{document} were developed using a relatively small dataset. In this work we utilize a recently developed comprehensive KOA database to create and curate a ΔUOA∘\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Delta {U^{ \circ}_{\text{OA}}}$$\end{document} dataset containing 195 chemicals and use this dataset in the development of new predictive equations. Using the QSAR development platform QSARINS we evaluate the use of Abraham descriptors, other molecular descriptors, and the log10KOA at 25 °C as variables in different multilinear regression equations for ΔUOA∘\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Delta {U^{ \circ}_{\text{OA}}}$$\end{document}. The ΔUOA∘\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Delta {U^{ \circ}_{\text{OA}}}$$\end{document} of neutral organic chemicals can be reliably predicted using only the log10KOA (RMSEEXT = 6.86 kJ·mol−1, Radj2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\text{R}^{2} _{\text{adj}}}$$\end{document} = 0.94), only the solute’s hydrogen acidity A and the logarithm of the hexadecane–air partition ratio L (RMSEEXT = 7.23 kJ·mol−1, Radj2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\text{R}^{2} _{\text{adj}}}$$\end{document} = 0.93), or A and log10KOA (RMSEEXT = 6.76 kJ·mol−1, Radj2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\text{R}^{2} _{\text{adj}}}$$\end{document} = 0.95).


Introduction
The octanol-air equilibrium partitioning ratio (K OA ) describes the partitioning of a chemical between octanol and air and is commonly used to approximate the partitioning between various organic phases and the gas phase, including soil organic matter [1,2] plant foliage [3,4], atmospheric particulate matter [5,6], and milk and blood [7]. Previous work on contaminants in outdoor environments (e.g., [8][9][10]) and in various terrestrial organisms (e.g., [11]) has shown the importance of K OA in understanding the distribution, fate, and bioaccumulation potential of volatile and semi-volatile organic chemicals in the environment. However, these processes often occur at temperatures other than standard temperatures (i.e., 25 °C) and accurate understanding of the partitioning of a chemical in these systems may require a temperature correction.
The temperature dependence of log 10 K OA can be described by the internal energy of phase transfer from octanol to air ( ΔU • OA , kJ·mol −1 ) using the van't Hoff equation: where R is the ideal gas constant (8.314 ·10 -3 kJ·K −1 ·mol −1 ), T 1 and T 2 are two temperatures (°C), and log 10 e (i.e., 0.43) is applied for the logarithm base change. While ΔU • OA is itself temperature dependent, the derivation of Eq. 1 requires it to be constant over small temperature ranges [12] and therefore a ΔU • OA can be derived by regressing log 10 K OA against reciprocal temperature (273.15 + T) −1 .
The concentration of a solute in the gas phase can be expressed volumetrically (e.g., in units of mol·m −3 ) or as a partial pressure (in Pa). If the K OA is calculated using volumetric gas phase concentrations, a regression of log 10 K OA against reciprocal temperature yields ΔU • OA . However, when K OA is defined in terms of partial pressure, the enthalpy of phase transfer ( ΔH • OA , kJ·mol −1 ) is obtained. ΔU • OA or ΔH • OA can be converted into each other using [12,13]: The direct determination of the K OA and its temperature dependence using various experimental approaches can be challenging and time consuming. Recent work by Baskaran et al. [14] found that, compared to other K OA prediction techniques, polyparameter linear free energy relationships (ppLFERs) using Abraham descriptors are a reliable, fast and easy-to-use method for estimating K OA . Traditionally ppLFERs using Abraham descriptors combine six system constants and five solute descriptors to describe how a chemical interacts and partitions between two phases [15]. The system constants, represented by lower-case letters, are determined using a multiple linear regression of the property against the solute descriptors, expressed with upper-case letters. These solute descriptors include: E (excess molar refraction), S (polarizability/ dipolarity), A (hydrogen bond acidity), B (hydrogen bond basicity), V (McGowan molar volume), and L (log 10 of the hexadecane-air partition ratio). The product of a solute descriptor and a system constant, such as sS or bB, describes the energetic contribution of one particular type of intermolecular interaction to the property [15].
Properties describing the interaction of a solute between the gas phase and a condensed phase can be calculated using two different ppLFERs which differ in the use of either the V or E parameter. is intended to exclusively describe partitioning between the gas phase and a condensed phase, while Eq. 4 can be applied to any two phases, including two condensed phases [15].
In previous work [14] we presented a ppLFER without either V or E to estimate the K OA at 25 °C, that performed as well as the 5-parameter equations by Abraham and Acree [16] and Endo and Goss [17]. Jin et al. [18] also developed a ppLFER equation for log 10 K OA which directly incorporates temperature in the multiple linear regression but used a small and limited data set [14]. In order to predict K OA values at temperatures other than 25 °C, Baskaran et al. [14] recommend using the 4-parameter ppLFER equation for K OA and a ppLFER equation for ΔH • OA by Mintz et al. [19]. The ΔH • OA and the log 10 K OA at 25 °C estimated this way were found to be highly correlated (R 2 > 0.98) [14], in analogy to previous work that has shown enthalpies of absorption ( ΔH • ads ) and the logarithm of the adsorption constants (log 10 K ads ) to be linearly correlated (R 2 > 0.91) [20]. Likewise, strong correlations between ΔH • vap and P L [20,21] have been observed and Goss and Schwarzenbach [20] note that previous work (e.g., [22]) indicated strong relationships between enthalpies and partitioning ratios. The high correlation between log 10 K OA and ΔH • OA suggests that the temperature dependence of K OA can be estimated from K OA directly.
Other quantitative structure property relationships (QSPRs) for log 10 K OA use different molecular, topographic, geometric, and quantum-chemical descriptors [12], which require commercial software or intensive computational power. The Open (Quantitative) Structure-activity/property Relationship App (OPERA) model by Mansouri et al. [23], uses two molecular descriptors computed with the PaDEL descriptor software [24] to estimate log 10 K OA at 25 °C. PaDEL calculates 1, 2, and 3 dimensional (1D, 2D, and 3D) molecular descriptors, whereby an increase in the dimensionality corresponds to the complexity of the encoded information [24]; we collectively refer to these descriptors as PaDEL descriptors. The OPERA model utilizes the PaDEL-predicted log 10 hexadecane-air partition ratio (L PaDEL ) and the number of hydrogen-bond donor atoms (nHBDon), both 2D descriptors [25]. The L PaDEL is identical to the L Abraham solute descriptor, while the nHBDon is somewhat similar to the A Abraham solute descriptor, because they both describe the capacity of the chemical to donate protons.
Both the Abraham and PaDEL descriptors are easily acquired. Two types of Abraham descriptors, experimental and estimated, are available from the UFZ-LSER website [26]. Experimental descriptors are measured directly or through chromatographic retention time techniques [15]. Solute descriptors can also be predicted from a chemical's Simplified Molecular Input Line Entry System (SMILES) notation using the IFS-QSAR models [27] integrated into the UFZ-LSER website and the EAS-E Suite platform [28] or directly from the standalone python package available on GitHub [29]. All PaDEL descriptors can be estimated from the SMILES strings using either a stand-alone software [24] or through QSARINS [30].
Since the development of the ΔH • OA ppLFER by Mintz et al. [19,31], a large database of log 10 K OA has been assembled [12]. This database can be used to create a comprehensive dataset of ΔU • OA for chemicals with measured log 10 K OA data at multiple temperatures. In this work we utilize a newly assembled and curated ΔU • OA database to develop linear (3) log 10 K = c + eE + sS + aA + bB + lL (4) log 10 K = c + sS + aA + bB + vV + lL regression models for ΔU • OA using Abraham solute descriptors, PaDEL descriptors, and/or the log 10 K OA at 25 °C.

Data Curation
The development of a reliable model for ΔU • OA relies on acquiring empirical data and processing the data through several data curation steps to reduce errors. We use measured log 10 K OA values from the K OA database [12] to derive experimental ΔU • OA values. We also searched for directly measured ΔH • OA and ΔU • OA values, obtained using calorimetric techniques, to build our training and external validation datasets.

Data from the K OA Database
All measured values of log 10 K OA were extracted from the K OA database and filtered to remove any measurements made (i) using indirect techniques (data from gas chromatographic retention times in octanol-filled columns were considered directly obtained), (ii) using water-saturated octanol, (iii) for a mixture or a chemical with an ambiguous structure, (iv) for inorganic and labelled compounds. We also removed duplicate values and values considered unreliable within the database. Finally, chemicals with measured K OA values at less than four different temperatures were removed, which left 149 chemicals. Log 10 K OA values were linearly regressed against inverse absolute temperature to obtain ΔU • OA from the slope. We then considered the strength of the correlation based on R 2 , the standard error of ΔU • OA as derived from the error of the slope, and how the ΔU • OA calculated from this regression compared with published data.
We eliminated 10 chemicals from the dataset because the log 10 K OA values used to calculate ΔU • OA values were obtained from a single reference and the ΔU • OA values calculated in our regression disagreed with the published value from that reference. In many instances, the published ΔU • OA was calculated from a regression of three log 10 K OA values at different temperatures. Including a fourth log 10 K OA measurement (obtained via personal communication with the authors, see [12]) caused the calculated ΔU • OA to deviate from the reported value. We assume that the authors of these original measurement did not hold strong confidence in these unpublished log 10 K OA measurements. Thus, data for chlorinated dibenzo-p-dioxin (CDD) 48, 50, 54, 66 and 73, and brominated diphenyl ether (BDE) 183 were removed from the ΔU • OA dataset. For o,pʹ-dichlorodiphenyltrichloroethane, p,pʹdichlorodiphenyldichloroethane, cis-nonachlor and endrin, removing individual log 10 K OA values did not lead to an agreement between previously published and calculated ΔU • OA values, which suggests that there may be some disparity in the log 10 K OA values used in the regressions (also obtained via direct communication with the authors) and the published data. Four chemicals (isopropyl ether, methane, BDE 99, and BDE 153) in the dataset were removed because the R 2 was below 0.95 and removing any outlier left only three datapoints in the regression. The regressions of the log 10 K OA against inverse temperature for these chemicals are shown in Figs. SI 1-SI 3.
Although there have been multiple measurements of log 10 K OA for perfluoroalkyl substances (PFAS), agreement between data from different studies is poor. As such, all PFAS compounds (perfluorooctane sulfonamido ethanols, perfluorooctane sulfonamide, fluorotelomer alcohols, and fluorotelomer acrylates) were excluded from the ΔU • OA dataset. The discrepancies between these values is further discussed in Sect. 3.1.1.
By assessing the temperature regression plots for a few chemicals, it was clear that some measured log 10 K OA data deviated from others. In many cases these outliers were from the same paper and often used older and/or objectively less precise measurement techniques (e.g., [32][33][34][35][36][37][38][39][40]). In these instances, we removed the outliers and recalculated the ΔU • OA using the remaining log 10 K OA data. In the case of propanol, after removing the obvious outliers from Eger et al. [34], we took the average ΔU • OA value calculated from regressing temperature dependent log 10 K OA data from Lei et al. [41] and Gruber et al. [42], which subsequently caused the elimination of a single datapoint published by Abraham et al. [43]. Plots of these regressions, including the outliers, are available in Figs. SI 5 and SI 6 in the supporting information.
For some compounds the R 2 for individual sets of temperature dependent log 10 K OA values was high, however, individual log 10 K OA values at specific temperatures deviated by 0.3 to 1 log 10 units between papers. Therefore, we took the average ΔU • OA value calculated from log 10 K OA values from each paper, in other words, we essentially took the average of the two slopes. This applied to hexachlorobenzene (HCB), and the polychlorinated biphenyls (PCBs) 153 and 180 (see Fig. SI 7).
After this step, we were left with 123 chemicals with reliable ΔU • OA values from the K OA database.

Direct Measurements of 1H •
OA and 1U •

OA
Mintz et al. [19,31] used ΔH • OA values for 138 chemicals to develop ppLFER equations for the enthalpy of phase change between wet and dry octanol and air. ΔH • OA values used to calibrate those models were cross-referenced against the original sources cited by Mintz et al. [31]. During this process, we found more directly measured ΔH • OA values in the literature. The ΔH • OA values were subsequently converted to ΔU • OA using Eq. 2. We took the average of the ΔU • OA value when more than one literature value was available. Any chemical from the Mintz et al. dataset of ΔH • OA values that already had a ΔU • OA value calculated from the K OA database was excluded, because these ΔH • OA values were direct measurements made using calorimetric measurement or were based on log 10 K OA data already included in the K OA database [12]. This way preference was given to values obtained from multiple direct measurements of log 10 K OA over individual calorimetric measurements. 72 chemicals with directly measured ΔH • OA values (converted to ΔU • OA ) were included in our dataset.

Chemical Identifiers, Descriptors, and Data Splitting
The chemical names, acronym, CAS number and SMILES notation for all 195 chemicals in the ΔU • OA dataset were obtained from either the K OA database or from the CompTox Dashboard. We considered three types of descriptors for estimating ΔU • OA : (i) the log 10 K OA at 25 °C, (ii) Abraham descriptors, and (iii) PaDEL descriptors.
Abraham solute descriptors (E, S, A, B, V, and L) were obtained from the UFZ-LSER database [26]. For experimental solute descriptors, we gave preference to UFZ pre-selected values over ABSOLV values. Estimated solute descriptors for all chemicals were calculated from their SMILES using the IFS-QSARs built into EAS-E Suite [28,29]. 178 chemicals had a full set of experimental solute descriptors, 4 chemicals were only missing experimental E values, and 13 chemicals had only estimated solute descriptors.
1444 1D and 2D PaDEL descriptors and 881 PubChem fingerprints (v2.21) were obtained from the chemicals' SMILES notation using QSARINS [30]. 3D molecular descriptors were not included, because while the chemicals in the ΔU • OA dataset are structurally simple, requiring 3D geometrically optimized chemical structures (e.g., SDF or MOL) and 3D optimization cannot be efficiently scaled up for high throughput applications and as such is not ideal for property estimations in the context of chemical risk assessment. Filters were applied to the PaDEL descriptors, wherein descriptors that had pair-wise correlation greater than 95% or where more than 80% of the descriptor was constant for the whole dataset were excluded. We also removed all descriptors that had missing values. In the end 531 PaDEL descriptors and fingerprints were considered, which included PaDEL calculated Abraham descriptors and 32 PubChem fingerprints. PubChem Fingerprints were excluded from model development and only used in a cluster analysis for data splitting as described later in this section.
Directly measured log 10 K OA values were extracted from the K OA database [12] and the average used when more than one experimental value existed. Estimated log 10 K OA values were calculated using the experimental and calculated Abraham solute descriptors and the 4-parameter ppLFER for K OA [14].
The curated ΔU • OA dataset was split into a model development and an external validation dataset based on the availability of descriptors for each chemical. 54 chemicals that had either some missing experimental Abraham solute descriptors or an estimated log 10 K OA values were set aside to be used for external validation. The remaining 141 chemicals were used to develop multiple linear regression (MLR) models using the descriptors.
In order to reduce bias in descriptor selection, we split the development dataset using four different splitting techniques, with a ratio of 3:1 (i.e., 75% of chemicals were used in the training set and 25% in the validation set) [44]. In the first and second splits, chemicals were ordered by ΔU • OA and log 10 K OA at 25 °C, respectively, and the chemicals with the highest and lowest value were included in the training dataset. In the random split, 75% of chemicals were randomly selected to be a part of the training dataset. We used cluster analysis (Ward's method and Tanimoto distance) using PubChem fingerprints and Principal Component Analysis (PCA) of experimental Abraham solute descriptors to group chemicals in the model development dataset into 4 structural clusters (see Figs. SI 11 and SI 12). From each cluster we randomly selected 75% of the chemicals to be used as a training dataset. Figure 1 shows how the chemical datasets were split and used to develop ΔU • OA models.

Model Calculation and Selection
Model calculation and selection were completed using QSARINS [30] on the four different splits of the model development dataset. For each split we developed different models based on the different kind of molecular descriptors. First, we explored the traditional ppLFER equations assessing all combinations (all-subsets with 1 to 5 descriptors) using experimental Abraham descriptors. Then, we substituted the Abraham L with the experimental log 10 K OA at 25 °C, as both variables describe partitioning between the gas phase and a hydrophobic condensed phase. Given the high number of PaDEL descriptors, we used the all-subset (up to 2 variables) and the Genetic Algorithm (up to 4 variables, parameters are included in the SI) built into QSARINS to identify the best PaDEL descriptors combination. Finally, we assessed the performance of the model using the two PaDEL descriptors used by the OPERA model for estimating log 10 K OA , namely the number of hydrogen bond donors (nHBDon) and log 10 hexadecane-air partition ratio (L) [25]. During model development and selection, we filtered out models with p-values for the regression coefficients higher than 0.05, as we could not be confident that the values of these coefficients were different from 0. We restricted the maximum number of variables (i.e., descriptors) to be included in a regression equation to 4 to ensure models remained relatively simple and to avoid overfitting issues. Model performance across all four splits was assessed using different variations of the determination coef- , the concordance correlation coefficients (CCC EXT ), root mean squared errors (RMSE TR , RMSE EXT ) and mean absolute errors (MAE TR , MAE EXT ) on the training set and on the prediction/external set [45][46][47][48]. Table SI 1 summarises these statistics for the models within each split.
After identifying the best models for different types of descriptors, we used all 141 ΔU • OA values as the training set to develop new equations using the same descriptors (i.e., the so-called full model). These models were assessed using the 54 chemicals in the external validation dataset. It is important to note that in the external validation set, not all chemicals have empirical descriptors namely for A, L, and log 10 K OA at 25 °C. When experimental A and L values were not available, IFS-QSAR estimated solute descriptors were used. Log 10 K OA values at 25 °C were estimated using the Fig. 1 A scheme of the source of ΔU • OA data and how this dataset is split and used for model development and validation. 3:1 split indicates that 75% of the chemicals in the dataset were used for the training data and the remaining 25% of chemicals were included in an internal validation process 4-parameter ppLFER equation for K OA [14] using experimental solute descriptors. All PaDEL descriptors are calculated from the SMILES notation of a chemical and so all descriptors are considered estimates.

Data Availability and the General Applicability Domain
Analysis of the log 10 K OA database showed a bimodal distribution in the experimental data [12]. Given the high correlation between log 10 K OA and ΔU • OA [14], it is not surprising that the same trend is observed for the curated ΔU • OA dataset (Fig. 2). This pattern arises because most ΔU • OA values in this dataset are derived from temperature dependent measurements of the log 10

Perfluoroalkyl Substances (PFAS)
The partitioning properties of PFAS compounds are extremely difficult to measure because they can act as surfactants and likely have a very low solubility in octanol. Data from different studies regularly display divergent results and it is not possible to establish which ones are correct.
Most fluorotelomer alcohols (FTOHs) have log 10 K OA values around 5, which is at the lower limits of the generator column technique. They are also too polar to apply the gas-chromatographic retention time technique [49]. In combination, this means obtaining reliable measurements for the K OA of FTOHs is quite difficult. Figure SI 15 Fig. SI 10). We would expect that ΔU • OA values for FTOHs become more negative with increasing chain length, because the K OA values increase with chain length. While this occurs for ΔU • OA values from Goss et al. [50], ΔU • OA declines with increasing chain length when using the data from Thuens et al. [51]. As the ΔU • OA values from Goss et al. [50] have been calculated from measurements at only two (4:2 FTOH) and three (6:2 FTOH and 8:2 FTOH) temperatures, there is insufficient data to include ΔU • OA values for FTOHs. The temperature dependence of the K OA for methyl and ethyl perfluorooctane sulfonamido ethanols (Me-FOSE and Et-FOSE) reported by two studies [52,53] and therefore also the ΔU • OA derived from those data are quite different (Fig. SI 8). Dreyer et al. [52] also published K OA data on methyl and ethyl perfluorooctane sulfonamides (Me-FOSA and Et-FOSA), whereby the plot suggests that the values obtained at very low and high temperatures may be outliers. Given the uncertainty of the K OA values for Me-FOSA, Et-FOSA, and Et-FOSE, we have also excluded those for Me-FOSE as there is no way for us to tell whether they are valid. Similar to Et-FOSE, the log 10 K OA reported by Dreyer et al. [52] for 6:2 and 8:2 fluorotelomer acrylates (6:2 FTAc and 8:2 FTAc) also display poor linearity with reciprocal temperature (R 2 < 0.9) and therefore we chose to exclude also the K OA data for 10:2 FTAc from the ΔU • OA dataset (Fig. SI 9). Ultimately, we decided to exclude all PFAS compounds from the data set.

Model Prioritization
As not all chemicals with ΔU • OA values had all available descriptors, a subset of 141 chemicals was used to test and identify the best performing models within each split. We selected five models to examine in detail. Those are the regressions using (i) A and L, (ii) only the log 10 K OA at 25 °C, (iii) A and log 10 K OA at 25 °C, (iv) nHBDon and L PaDEL , and (v) nHB-Don, number of carbons, nC, and number of halogens, nX. As initially hypothesized, the log 10 K OA at 25 °C alone proved to be a very good predictor of ΔU • OA (Table SI 1). Meanwhile the model using the Abraham solute descriptors A and L performed almost as well. The log 10 hexadecane-air partition ratio L can describe the non-polar interactions between the compound and the solvent while the hydrogen bond acidity A describes the potential for a chemical to donate a hydrogen. These two parameters describe similar chemical interactions as the two PaDEL descriptors L PaDEL and nHBDon, used to predict log 10 K OA in the OPERA model.
The model selected with the Genetic Algorithm using only PaDEL descriptors uses nHBDon, nC, and nX. These descriptors are not able to describe chemicals to the same extent Abraham descriptors do and their selection is a consequence of the data used to train the models. This is discussed in more detail in the following section.
All five models performed well during internal and external validation processes across all four splits (Table SI 1). The R 2 adj describes the R 2 while correcting for the number of descriptors in a model. The RMSE TR and MAE TR was in almost all cases slightly smaller than the RMSE EXT and MAE EXT , however given the range of the ΔU • OA values used in model development, the difference is likely unremarkable. Internal cross validation using leave-one-out ( Q 2 LOO ) and leave-more-out ( Q 2 LMO ) was very similar to the R 2 and R 2 adj for all models (> 0.90). However, the external predictive ability of the models ( Q 2 F1 , Q 2 F2 , Q 2 F3 ) were slightly lower than the R 2 . The CCC EXT , describing both the precision and accuracy of all models was almost always greater than 0.90.

External Validation
Of the 54 chemicals in the external validation dataset, 17 chemicals had experimental log 10 K OA values and 41 chemicals had experimental A and L solute descriptors that were retrieved from the UFZ-LSER database. All chemicals without experimental log 10 K OA values had all the experimental solute descriptors necessary to estimate log 10 K OA using the four parameter ppLFER equation [14]. As the PaDEL descriptors for all chemicals were estimated, we did not differentiate between the source of A, L, and log 10 K OA at 25 °C used to externally validate the model, however wherever possible we used experimental log 10 K OA values and solute descriptors. Full model equations are listed in Table 1.
By using all 141 chemicals to train the models and 54 chemicals to externally validate the models we see that Model 3 using log 10 K OA and A had the best overall performance  , and CCC EXT . Model 2, using log 10 K OA , performs almost as well as Model 3, followed by Model 1. Models 4 and 5 had the poorest performance.
In Fig. 3, we can see that all models generally perform better at higher ΔU • OA (i.e., less negative) values. Models 4 and 5 appear to have a larger number of outliers (> 10 kJ·mol −1 ) at lower ΔU • OA values relative to Models 1, 2, and 3; however, it is important to recognize that this relative difference between predicted and experimental values is smaller for increasingly negative ΔU • OA values. An error of − 10 kJ·mol −1 for a ΔU • OA of − 90 kJ·mol −1 corresponds to ~ 11% whereas the same error for a ΔU • OA of − 30 kJ·mol −1 is ~ 33%. In Fig. SI 16 we present Williams plots of the standardized residual plotted against the leverage. The standardized residual (sr i ) of a chemical i is obtained by correcting residuals (r i ) using the standard deviation (sd) of the model and the leverage of a prediction (h ii ): Standardized residual values are considered high when the absolute value is greater than 2.5 [44]. The leverage of a training chemical indicates how much influence it has on the regression; chemicals with high leverages have a large influence on the model [44]. A chemical has a high leverage value if the leverage is greater than the warning leverage (h * ) which is a function of the number of chemicals in the training dataset (n) and the number of parameters in the model (k) [44]: Standardized residuals and leverage values were obtained from QSARINS.
Considering only the number of chemicals with high leverage and high standardized residuals (Table 3), Model 2 has no chemical in the training set with a large leverage value and it has only 7 chemicals with high standardized residuals. The largest standardized residual was for 1,2,3,4-tetramethylbenzene in the external validation dataset; however, the log 10 K OA (4.4) measured using the variable headspace ratio technique is known to be at the upper limit of the technique [41]. Thus, it likely that this high standardized residual value is due to the high uncertainty of the reported K OA , ΔU • OA or both, rather than an error in the model.  There was also an overlap in chemicals with high standardized residuals between models. Fluorene had a high standardized residual in all models. PCB 155 and 1,2,3,4-tetramethylbenzene had high standardized residuals in all models except Model 1 and 4, respectively. The standardized residuals for BDE 154 and trans-nonachlor was high in Models 1, 2, and 3, while β-HCH and δ-HCH had high standardized residuals in Models 1, 4, and 5. Heptachlor had high standardized residuals in model 1 and 5, and PCBs 61 and 77 had high standardized residuals in models 2 and 3.
The internal validation results presented in Table 2 show Models 1, 2, and 3 perform better than Models 4 and 5 based on R 2 , R 2 adj , RMSE TR , MAE TR , RMSE CV , MAE CV , Q 2 LOO , and Q 2 LMO . Model 5 has the highest RMSE EXT followed by Model 1. Model 5 also has the highest MAE EXT

Comparison with Other 1U • OA Models
The ΔH • OA ppLFER model by Mintz et al. [19] has been shown to reliably predict the temperature dependence of K OA [14]. We compared the performance of this ppLFER for ΔH • OA [19], converted to ΔU • OA , with Model 2. Since both models are built on similar datasets and an evaluation with chemicals that had been used to train and develop the models would bias the results, we rely only on chemicals not used in the training dataset for either model. This limits the comparison to 28 chemicals including several BDEs and CDDs. The training data for the Mintz et al. [19] model includes 138 chemicals and experimental solute descriptors and experimental ΔH • OA values for all chemicals are provided in a previous publication [31]. During our data curation process we found small errors in the reported ΔH • OA and misidentification of a few chemicals which means some of the descriptors used are erroneous. The Mintz et al. [19] model also includes inorganic chemicals (e.g., nitrogen and xenon) which were excluded from our model development. Due to these differences, we compare the statistics of the model residuals (Table 4) and the residual of each prediction (Fig. 4).
For both models we use the best available descriptors (i.e., experimental values wherever possible and otherwise IFS-QSAR estimated values). In this combined external validation process (Table 4), Model 2 has a smaller bias (average residual), MAE, standard deviation (SD), and RMSE. The Mintz et al. [19] model has a larger number of chemicals with errors exceeding 10 kJ·mol −1 and Model 2 has more ΔU • OA estimates with absolute errors less than 5 kJ·mol −1 (Fig. 4).
If we compare the performance of the Mintz et al. [19] model and Model 2 when using all 195 chemicals in the ΔU • OA dataset, we see that Model 2 performs better than the Mintz et al. [19] model (Table 4). While there is little difference between the statistics on the residuals, fewer chemicals have residuals greater than 10 kJ·mol −1 when using Model 2. Both models overestimated the ΔU • OA by at least 10 kJ·mol −1 for fluorene, PCN 57, cischlordane, trans-chlordane, and trans-nonachlor-all chemicals within the training data set of Model 2, apart from trans-nonachlor. The models also underestimated the ΔU • OA for CDD 1, 1,2,3,4-tetramethylbenzene, BDE 17, BDE 28, BDE 154, PCB 61, PCB 155, of which CDD 1 and BDEs 17, 28, and 154 were in the external validation dataset for Model 2.

3
In addition to the model by Mintz et al. [19], there is a ppLFER for wet-octanol air ΔH • values [31] and models for ΔH • OA using a support vector machine method, artificial neural networks, and MLRs based on various molecular descriptors [54]. The wet-octanol air ΔH • OA model is based on the same dataset used to develop the Mintz et al. [19] model above [31]. The latter three models also use the same dataset as a basis for model development but randomly split their dataset of 127 chemicals into a training dataset of 89 chemicals and validation dataset of 38 chemicals [54]. The reported correlation coefficients for these models [31,54] are similar to, or greater than what we report for Model 2 and the standard errors of the models developed using a support vector machine method and artificial neural networks is smaller than the RMSE of Model 2. However, these models cannot be used easily in high throughput applications because acquiring the molecular descriptors requires multiple software applications [54], which can be time-consuming and lends itself to increased chances of human error.

Sources for Descriptors
While we used only experimentally derived solute descriptors to develop and train the models in this work, the source of solute descriptors may impact model performance.  Sources for the descriptors include experimental values from the UFZ-LSER database [26], IFS-QSAR estimates from the standalone IFS-QSAR software [27,29] or EAS-E Suite [28], or PaDEL estimates made using QSARINS [30]. The first two are easily obtained from non-commercial web-based software. In Fig. 5

The Relationship Between log 10 K and ΔU°T
he strong relationship between log 10 K OA and ΔU • OA in Model 2 mirrors similar relationships between vapour pressure (log 10 P L ) and the enthalpy of vaporization ( ΔH • vap ) and between adsorption constants (log 10 K ads ) and the enthalpies of adsorption ( ΔH • ads ) [20]. Goss and Schwarzenbach cite other such relationships and provide a thermodynamic explanation for their occurrence [20]. The equation relating ΔH • vap and P L [20] can be combined with a linear equation relating a chemical's P L with its log 10 K OA [55] and a conversion to The slope in this equation is very similar to the absolute value of the slope in Model 2 (− 8.75). This is apparent in a vertical shift when estimated ΔU • OA and −ΔU • vap values are plotted against experimental ΔU • OA values (Fig. SI 18). The difference in Eq. 2 in Table 1 and Eq. 7 corresponds to the internal energy of dissolution of the chemical in octanol (i.e., ΔU • dissol = ΔU • OA + ΔU • vap ). This analysis suggests that the ΔU • dissol is very small and is relatively constant for compounds with a wide range in K OA . The observed relationship between K OA and ΔU • OA can perhaps be extrapolated to other partitioning systems such as K OW and ΔU • OW . Indeed, if ΔU • dissol is a small and constant for a wide range of compounds it may be possible to estimate ΔU • OW from the enthalpy of dissolution of a chemical in water or even the water solubility of a compound. However large datasets with experimental values of each property are necessary before such a relationship can be verified.

Conclusion
We developed new models for predicting the ΔU • OA which use fewer descriptors than earlier models. The best performing model (Model 2) relies only on the log 10 K OA at 25 °C and additional parameters do not notably improve model performance. This model has (7) ΔU • vap = 8.90 ⋅ log 10 K OA + 10.42 − RT similar or slightly improved performance relative to previous estimation techniques for ΔH • OA and ΔU • OA , which relied on more or more complex descriptors. Our work parallels previous findings that that ΔH • vap can be predicted quite well just from P L [20]. Further measurements of log 10 K OA and ΔU • OA values could improve the applicability domain of these models particularly for chemicals with log 10 K OA values between 4 and 6 ( ΔU • OA values between − 70 and − 50 kJ·mol −1 ), and for more polar compounds with more complex hydrogen bonding abilities. Data Availability All data generated or analysed during this study are included in this published article, its supplementary information files, and in previously published cited works.

Conflict of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Ethical Approval This work was funded by the European Chemical Industry Council (CEFIC) through project ECO 41 of the Long-range Research Initiative (LRI).
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.