Introduction

Considerable research interest has been devoted to understanding the principal mechanism of reversed-phase (RP) chromatographic separations, which has not been fully elucidated due to its complexity and a multitude of parameters that greatly affect the chromatographic experiment (Poole 2019). Poole (2015) described a qualitative interphase model which aimed at simplifying the complexity of the retention mechanism. He described the active chromatographic separatory area inside the analytical column and segmented it into three main domains: the silica particle surface, the stationary phase (SP), and the bulk mobile phase (MP). At a constant composition of the MP, the SP becomes solvated. Analytes approaching from the bulk MP can interact with the SP in a combination of two retention mechanisms: a partition and an adsorption mechanism (Poole 2019, 2015). The predominant mechanism, by which retention of the analyte occurs, is a function of MP composition, column temperature, condition and age of the SP, and the flow rate (Vailaya 2005). From a modelling perspective, this complexity presents the greatest challenge, as certain simplifications are unavoidable.

Retention prediction and modelling in chromatography

Haddad et al. (2021) recently published an overview of modern approaches to modelling chromatographic retention data. They can be divided into two main classes: statistical and physicochemical modelling. Statistical approaches include quantitative structure-retention relationships (QSRR) (Amos et al. 2018) and design of experiments (DoE) (Sahu et al. 2018), while, physicochemical models encompass solvatophobic models (Horváth et al. 1976; Moldoveanu and David 2015), solvent strength models (Neue and Kuss 2010; Tyteca and Desmet 2015), and linear free-energy relationships (LFER) (Vailaya 2005), which can be further divided into exothermodynamic relations (Ovčačíková et al. 2016), hydrophobic subtraction models (Snyder et al. 2004) and solvation parameter models (Abraham et al. 2004).

Gilar et al. (2020) previously compared the nonlinear and linear solvent strength models and found that the linear model can serve as an estimate for the prediction of analyte retention. The nonlinear model was implemented to try to correct the experimentally observed nonlinear concave behaviour in the \(\ln k_{i}\) versus solvent strength plots. The origin of this nonlinear behaviour in RP liquid chromatography is still not fully understood.

The analytes used in modelling are usually divided into two sets: a training set containing analytes that are directly integrated into the model, and a test set with analytes that are not used to construct the model. The latter serve as data, used to evaluate the generalization error of a model. The training error is determined using the training set. Together, these two metrics provide evaluation about the performance of the model. An ideal theoretical model has a low training error for analytes used in training the model and a low degree of generalization error for analytes added to the model for testing (Spears et al. 2018).

Solubility parameters

Hansen solubility parameters (HSP) are empirical thermodynamic parameters that aim to quantify the notion of “like dissolves like” (Hansen 1967). The core idea of HSP is summarized in Eq. 1, where a clear distinction of three basic molecular interaction types can be seen: dispersion interactions (d), polar interactions (p), and hydrogen bonding interactions (h).

$$\begin{array}{*{20}c} {\delta^{2} = 4\delta_{d}^{2} + \delta_{p}^{2} + \delta_{h}^{2} } \\ \end{array}$$
(1)

The sum of the three partial solubility parameters represents the total solubility parameter \(\delta\), which is defined as the square root of the cohesion energy density i.e. the energy originating from intermolecular interactions divided by the molar volume, Eq. 2, (Hildebrand 1949).

$$\begin{array}{*{20}c} {\delta = \sqrt {\frac{{E_{{{\text{coh}}{.}}} }}{{V_{{\text{m}}} }}} } \\ \end{array}$$
(2)

Hansen distance is a numerical value that quantifies the thermodynamic similarity of two analytes based on the strength of the constituent interaction types. A Hansen distance for two analytes can be calculated from their respective HSP, Eq. 3.

$$\begin{array}{*{20}c} {\lambda_{i,j} = 4\left( {\delta_{d,i} - \delta_{d,j} } \right)^{2} + \left( {\delta_{p,i} - \delta_{p,j} } \right)^{2} + \left( {\delta_{h,i} - \delta_{h,j} } \right)^{2} } \\ \end{array}$$
(3)

HSP’s were developed mainly for predicting the solubility of paints, resins, and polymer components (Hansen 1967). They can be used in the development and study of the mechanical properties of thermoplastic-lignin composites (Zhao et al. 2018b, a). Their use in the pharmaceutical industry includes predicting the solubility of pharmaceutical excipients (Adamska et al. 2007) and the compatibility of biopolymers (Adamska et al. 2016). Sánchez-Camargo et al. have reviewed the use of HPS to predict solubilities for the selection of more environmentally acceptable extraction solvents (Sánchez-Camargo et al. 2019). The work of Schoenmakers et al. (1982), Schoenmakers and de Galan (1981) shows that there is a mathematical relationship between the solubility parameter and the chromatographic activity coefficients that allows a theoretical derivation of retention as a function of solubility parameters for all analytes and phases in chromatographic separations.

Aim of the study

The aim of this study is twofold. First, to formally derive and investigate a novel LFER physicochemical model for modelling and predicting RP chromatographic separations based on Hansen solubility parameters using the experimental retention data obtained with a series of three homologous phthalates. Secondly, to investigate the training error and generalization error of the model using a retention information dataset of chemically different analytes that have a similar aromatic structural component with different functional groups, in order to evaluate the model as a tool for routine retention prediction of analytes.

Results and discussion

Derivation of the model equation

The primary objective of the study is to develop and provide a detailed mathematical and theoretical derivation of the model. The model derivation begins with the mathematical treatment of retention equilibria occurring during RP chromatographic separations. After the analyte, denoted by the index \(i\), is introduced into the MP, the dynamic equilibrium between the activities of the analyte, corresponding to both phases, denoted by \({\mathcal{S}}\) for the SP and \({\mathcal{M}}\) for the MP, is formed, Eq. 4.

$$\begin{array}{*{20}c} {a_{{{\mathcal{S}},i}} \rightleftharpoons a_{{{\mathcal{M}},i}} } \\ \end{array}$$
(4)

The dynamic equilibrium can be described by its equilibrium constant, Eq. 5.

$$\begin{array}{*{20}c} {K_{i} = \frac{{a_{{{\mathcal{M}},i}} }}{{a_{{{\mathcal{S}},i}} }}} \\ \end{array}$$
(5)

The retention factor is a function of the chromatographic equilibrium constant, Eq. 6.

$$\begin{array}{*{20}c} {K_{i} = k_{i} \cdot \frac{{V_{{\mathcal{S}}} }}{{V_{{\mathcal{M}}} }}} \\ \end{array}$$
(6)

Combining Eqs. 5 and 6, with the formal definition for activity being the product of the molar concentration and the activity coefficient, results in Eq. 7.

$$\begin{array}{*{20}c} {k_{i} = \frac{{\gamma_{{{\mathcal{M}},i}} }}{{\gamma_{{{\mathcal{S}},i}} }} \cdot \frac{{n_{{\mathcal{S}}} }}{{n_{{\mathcal{M}}} }}} \\ \end{array}$$
(7)

Schoenmakers et al. (1982), Schoenmakers and de Galan (1981) have previously related the chromatographic activity coefficient \(\gamma_{i}\) with solubility parameters \(\delta\), Eq. 8, with the subscript f indicating the phase.

$$\begin{array}{*{20}c} {\ln \gamma_{f,i} = \frac{{V_{m,i} }}{RT} \cdot \left( {\delta_{i} - \delta_{f} } \right)^{2} } \\ \end{array}$$
(8)

Combination of Eqs. 7 and 8 and replacement of the molar volume and the universal gas constant with the cavitation volume and Boltzmann constant, due to the theoretical treatment of single molecules during the DFT cavitation volume calculations, results in

$$k_{i} = \frac{{\exp \left( {\frac{{V_{\kappa ,i} }}{{k_{B} T}} \cdot \left( {\delta_{i} - \delta_{{\mathcal{M}}} } \right)^{2} } \right)}}{{\exp \left( {\frac{{V_{\kappa ,i} }}{{k_{B} T}} \cdot \left( {\delta_{i} - \delta_{{\mathcal{S}}} } \right)^{2} } \right)}} \cdot \frac{{n_{{\mathcal{S}}} }}{{n_{{\mathcal{M}}} }}$$

and after simplification

$$k_{i} = \exp \left( {\frac{{V_{\kappa ,i} }}{{k_{B} T}} \cdot \left( {\left( {\delta_{i} - \delta_{{\mathcal{M}}} } \right)^{2} - \left( {\delta_{i} - \delta_{{\mathcal{S}}} } \right)^{2} } \right)} \right) \cdot \frac{{n_{{\mathcal{S}}} }}{{n_{{\mathcal{M}}} }}$$

The natural logarithm of ki is presented in Eq. 9.

$$\begin{array}{*{20}c} {\ln k_{i} = \frac{{V_{\kappa ,i} }}{{k_{B} T}} \cdot \left( {\left( {\delta_{i} - \delta_{{\mathcal{M}}} } \right)^{2} - \left( {\delta_{i} - \delta_{{\mathcal{S}}} } \right)^{2} } \right) + \ln \frac{{n_{{\mathcal{S}}} }}{{n_{{\mathcal{M}}} }}} \\ \end{array}$$
(9)

Then Hansen solubility parameters as Hansen distances are introduced

$$\begin{aligned} & \lambda_{{{\mathcal{M}},i}} = \left( {\delta_{i} - \delta_{{\mathcal{M}}} } \right)^{2} \\ & \lambda_{{{\mathcal{S}},i}} = \left( {\delta_{i} - \delta_{{\mathcal{S}}} } \right)^{2} \\ \end{aligned}$$

each encompassing the thermodynamic similarity between the analyte and the chromatographic phase. The Hansen distance for the analyte and SP is trivial to calculate, using the formal definition, Eq. 10.

$$\begin{array}{*{20}c} {\lambda_{{{\mathcal{S}},i}} = 4(\delta_{d,i} - \delta_{{d,{\mathcal{S}}}} )^{2} + (\delta_{p,i} - \delta_{{p,{\mathcal{S}}}} )^{2} + (\delta_{h,i} - \delta_{{h,{\mathcal{S}}}} )^{2} } \\ \end{array}$$
(10)

As the mobile phase is composed of a mixture of water and ACN, the corresponding MP-analyte Hansen distance is more complicated to calculate. It was assumed that HSP of mixtures are equal to the weighted sums of all HSP of pure substances. The molar fraction is used as a weight. Using this, the Hansen distance for the combination of analyte and the MP becomes Eq. 11.

$$\begin{array}{*{20}c} {\lambda_{{{\mathcal{M}},i}} = 4\left( {\delta_{d,i} - \mathop \sum \limits_{j = 1}^{n} x_{j} \delta_{{d,{\mathcal{M}},j}} } \right)^{2} + \left( {\delta_{p,i} - \mathop \sum \limits_{j = 1}^{n} x_{j} \delta_{{p,{\mathcal{M}},j}} } \right)^{2} + \left( {\delta_{h,i} - \mathop \sum \limits_{j = 1}^{n} x_{j} \delta_{{h,{\mathcal{M}},j}} } \right)^{2} } \\ \end{array}$$
(11)

The molar fraction of a bicomponent mixture adds up to unity, thus enabling the transformation of Eq. 11 into Eq. 12, where x denotes the molar fraction of ACN.

$$\begin{aligned} \lambda_{{{\mathcal{M}},i}} & = 4\left( {\delta_{d,i} - x\delta_{d,1} - \left( {1 - x} \right)\delta_{d,2} } \right)^{2} + \left( {\delta_{p,i} - x\delta_{p,1} - \left( {1 - x} \right)\delta_{p,2} } \right)^{2} \\ & \quad + \left( {\delta_{h,i} - x\delta_{h,1} - \left( {1 - x} \right)\delta_{h,2} } \right)^{2} \\ \end{aligned}$$
(12)

Expanding the equation and factoring out the molar fraction, with further algebraical manipulation described in detail in the Additional file 1: section S1, leads to Eq. 13.

$$\begin{array}{*{20}c} {\lambda_{{{\mathcal{M}},i}} = \Lambda_{\alpha } x^{2} + \Lambda_{\beta } x + \Lambda_{\gamma } } \\ \end{array}$$
(13)

By combining the Hansen distances and Eq. 9, we arrive at Eq. 14.

$$\begin{array}{*{20}c} {\ln k_{i} = \frac{{V_{\kappa ,i} }}{{k_{B} T}} \cdot \left( {\lambda_{{{\mathcal{M}},i}} - \lambda_{{{\mathcal{S}},i}} } \right) + \ln \frac{{n_{{\mathcal{S}}} }}{{n_{{\mathcal{M}}} }}} \\ \end{array}$$
(14)

Then the thermodynamic descriptor as Eq. 15 can be defined.

$$\begin{array}{*{20}c} {\Xi_{i} = \frac{{V_{\kappa ,i} }}{{k_{B} T}} \cdot \left( {\lambda_{{{\mathcal{M}},i}} - \lambda_{{{\mathcal{S}},i}} } \right)} \\ \end{array}$$
(15)

With the assumption that the second logarithm in Eq. 14 is constant and by implementing the thermodynamic descriptor, the central theoretical model Eq. 16 is derived.

$$\begin{array}{*{20}c} {\ln {\text{k}}_{{\text{i}}} = \beta_{0} + \beta_{1} {\Xi }_{i} } \\ \end{array}$$
(16)

The model is based on Eq. 16a simple linear regression, with regression coefficients denoted as \(\beta\). The Hansen distances are calculated for each combination of SP, MP composition and analyte type, to obtain unique values of the thermodynamic descriptor. The analyte cavitation volume is calculated using theoretical DFT calculations, as described in the Materials and methods section. The model relates the theoretical thermodynamic descriptor \({\Xi }_{i}\), which encompasses the variables of column temperature, Hansen distances and analyte cavitation volumes, with the experimentally determined retention factor \(\ln k_{i}\).

Statistical analysis and general observations of the model

Plotting the theoretical thermodynamic descriptor \({\Xi }_{i}\) against the experimental \(\ln k_{i}\) reveals a pattern, which is presented in Fig. 1. Each point represents an average of three experimentally determined retention factors.

Fig. 1
figure 1

Plot of experimental \(\ln k_{i}\) values for the training set analytes (DMP, DEP, DBP) on two different SP (C8 and C18) vs. the thermodynamic descriptor \({\Xi }_{i}\)

A clear distinction of the retention data of individual analytes (DMP, DEP and DBP) is observed including the formation of two groups corresponding to the two studied SP (C8 and C18). Furthermore, differently oriented bands of individual analyte within the SP groups are observed. The results indicate the lack of generality of the thermodynamic descriptor \({\Xi }_{i}\) for the complete description of structural variability as different retention behaviours for the investigated analytes are observed. A cumulative model, which includes all the data points, is therefore not theoretically justified. Hence six separate linear regression models were created, one for each analyte and the corresponding SP (singular models) and two cumulative models for all three analytes within the training set, but for separate SP.

Statistical data processing of the results is presented in Table 1 and complemented with regression coefficients presented in the Additional file 1: Table S1.

Table 1 Results of the statistical analysis of the theoretical model. p-values below 0.05 denote a statistical significance of the whole model

Residual analysis was performed to investigate their possible correlations, the results of which are presented in Fig. 2. The correlation between the residual values could be observed as a trend indicated by the yellow line, deviating from the horizontal zero line. The trends have the minima at data points corresponding to the experiments at the MP composition of 60 vol. % of ACN. The cumulative model (Fig. 2d) reveals even more pronounced residual correlations. Thus, according to these observations, the models are not linear, even though a positive p-value test suggests a linear relationship. The same trend was observed in the experiments with C18 SP.

Fig. 2
figure 2

Residual analysis of all models on SP C8. a DMP model, b DEP model, c DBP model, d cumulative model. The yellow line indicates the trend of the residuals

Evaluation of the training error and generalization error

Regardless of the observed nonlinearity, the model’s training error was investigated. The predicted \(\ln k_{i}\) values were compared with experimentally gained values for DMP, DEP and DBP. Relative error and root mean squared error of prediction (RMSEP) of the retention time were calculated (Table 2).

Table 2 Evaluation of the model’s training error

The singular models that included only one analyte underestimated retention times, whereas the cumulative model for all three analytes at selected SP overestimated the retention times with larger deviations.

The evaluation of the generalization error with analytes from the test set revealed even greater differences (Table 3), which minimized the ability to predict retention times for analytes not used in the construction of the model.

Table 3 RMSEP values for the test set analytes

The model can therefore hardly be generalized for retention prediction for chemically diverse analytes that are not intrinsically included in the model.

Comparison with solvent strength models

Unlike conventional retention modelling approaches that use isothermal experiments (Haddad et al. 2021), the HSP-based model includes column temperature as an independent variable used to calculate the thermodynamic descriptor. The residual standard error (RSE) value was calculated for each model (Fig. 3) to compare the novel theoretical model with conventional linear and nonlinear solvent strength models. All metrics of model quality and regression coefficients are presented in Additional file 1: Tables S2–S4. Data obtained at 25 °C were used, except for the model that included all three column temperatures. The solvent strength models achieve a lower RSE value for singular analyte models. Their RSE values for cumulative models are much higher than the values of the HSP model. The HSP model is therefore more suitable compared to the conventional solvent strength model when several analytes are modelled simultaneously. The low values of RSE for singular solvent strength models and their higher values for the cumulative models could indicate a greater degree of overfitting of the retention data.

Fig. 3
figure 3

A comparison of the RSE values for the investigated models. NLSS: nonlinear solvent strength model, LSS: linear solvent strength model, HSP_1T: HSP model with one temperature, HSP_3T: HSP model with three temperatures

Exploring beyond the theoretical model

Nonlinearity was further investigated with multiple linear regression. In addition to the thermodynamic descriptor, the column temperature and the total adsorption of ACN on the C18 SP, based on data from Buntz et al. (Buntz et al. 2012), were used as independent variables. The nonlinear model is described by Eq. 17.

$$\begin{array}{*{20}c} {\ln k_{i} = \beta_{0} + \beta_{1} \Xi_{i} + \beta_{2} A + \beta_{3} T } \\ \end{array}$$
(17)

where A denotes the total molar adsorption of ACN on the SP. Since data for the total adsorption of ACN was only available for the C18 and not for the C8 SP, the models were built only for data collected with the corresponding C18 column.

The results of the statistical analysis of the multiple regression model, presented in Table 4, reveal an improvement in the statistical regression measure of quality, mainly as a decrease in the RSE values compared to values presented in Table 1. Regression coefficients are reported in the Additional file 1: Table S5. A consecutive statistical residual analysis (reported in the Additional file 1: Figure S1) revealed a nonlinear correlation of the residuals, indicating that the independent variables used in this model do not fully describe the chromatographic separation model, analogous to the theoretical model equation using a simple linear regression.

Table 4 Results of the statistical analysis of the multiple linear regression model. p values below 0.05 denote a statistical significance of the whole model

A statistical investigation of the experimental-only retention data provided insight into the observed nonlinear anomaly. The differences in retention between the three temperatures at a fixed MP composition are presented in Fig. 4. First, a general trend is observed in the effect of temperature and MP composition on retention. Increasing the ACN content in MP decreases the extent to which a change in temperature affects the \(\ln k_{i}\) for all three analytes in the training set, which is evident from the interpolated trends. Second, an anomalous retention is observed at 60 vol. % of ACN. The anomaly coincides with the minima observed in simple linear and multiple linear regression models. The trends for DMP and DEP are similar, while DBP shows a slightly different pattern with increasing ACN content. The general trend is observed for both investigated SP.

Fig. 4
figure 4

The differences in the experimental retention data calculated at constant mobile phase compositions and three different temperatures on both investigated SP

Conclusion

In this paper, we present a novel theoretical model for predicting the retention of analytes in RP chromatographic separations, together with a complete mathematical derivation of the central model equation. To date, this is the first study to use Hansen solubility parameters as possible thermodynamic descriptors for predicting retention behaviour in RP HPLC. We have found that Hansen solubility parameters cannot satisfactorily describe multiple chemical interactions that occur in RP chromatographic separations. The theoretical model combining HSP with theoretically derived solute cavitation volumes is not applicable for the routine prediction of analyte retention times, due to an unfavourable training and generalization error. Moreover, the assumed linear relationship proves to be statistically unjustifiable. Incorporating the thermodynamics of dissolution and mixing, as suggested by Louwerse et al. (Louwerse et al. 2017), could lead to the development of a more accurate HSP based theoretical model. Although, a comparison with conventional solvent strength models reveals that the HSP model is better at simultaneously predicting the retention of multiple analytes with only one regression model. This represents a shift from various model development methods in which multiple regression equations corresponding to each analyte under study are created to describe retention behaviour (McEachran et al. 2018). A multiple linear regression model was also performed, including additional independent variables such as column temperature and total adsorption of ACN to the SP, to investigate the origin of the nonlinear behaviour deviating from the theory. The nonlinearity was still observed. To verify that the observed anomaly is not an artefact of the model, the numerical differences in the experimental retention data were calculated. A correlation was found between change in MP composition and temperature on retention. The anomaly at the MP composition of 60 vol. % ACN is again observed, raising an open research question about the origin of the surveyed nonlinear anomaly.

Materials and methods

Chemicals

Diethyl phthalate (DEP) (99.5%), dimethyl phthalate (DMP) (≥ 99%), dibutyl phthalate (DBP) (99%), 3-methylbenzoic acid (99%), 2,4,6-triiodophenol (97%), uracil (99%) and benzaldehyde (≥ 99%) were purchased from Sigma-Aldrich (Germany) and vanillin (99%) from Merck (Germany). Acetonitrile (ACN) (Fischer Scientific, UK, ≥ 99.9%) and deionised water, purified with the Mill-Q® system, were used as chromatographic solvents.

1 mg mL−1 stock solutions of DMP, DEP, DBP, benzaldehyde, 2,4,6-triiodophenol, 3-methylbenzoic acid and vanillin were prepared in ACN. A 1 mg mL−1 stock solution of uracil was prepared in deionised water. 10 mg L−1 solutions of investigated analytes were prepared by dilution with ACN. To each solution, uracil was added at a concentration of 10 mg L−1, which was used as a marker for the void time. Its determination using uracil was previously described as an acceptable estimation technique (Bidlingmeyer et al. 1991; Rimmer et al. 2002).

Instrumental conditions

The HPLC experiments were carried out on an Agilent Technologies system 1100 with degasser, quaternary pump, auto-sampler, column thermostat and a diode array detector. Two RP columns were used; YMC Triart C18 (150 × 4.6 mm, 5 μm, pore size of 12 nm) and YMC Triart C8 (150 × 4.6 mm, 5 μm, pore size of 12 nm). All chromatographic separations were isocratic, with a constant flow of 1 mL min−1. Absorption spectra were recorded with a diode array detector in the spectral range from 210 to 400 nm at a collection frequency of 5 Hz.

Experimental design for data acquisition

Retention data for uracil, DMP, DEP and DBP using the solution in ACN were collected using a full factorial experimental design by varying the column temperature (25, 35 and 45 °C), MP composition (40, 50, 60, 70, 80 vol. % ACN) and the SP (C8 and C18). Each data point was collected three times, by repeating the full factorial experimental sequence.

The solutions of analytes in the training and test sets were used in HPLC experiments at 25 °C on the C8 SP at 65, 75 and 85 vol. % of ACN in the MP for the collection of data, used in the investigation of the training error and the generalization error of the model. All retention data are presented in the Additional file 1: Tables S6–S12.

Ab initio density functional theory calculations

Density functional theory (DFT) calculations were implemented using GAMESS (version R2 September 30, 2020 for Microsoft Windows) (Barca et al. 2020) and the graphical interface software Winmostar (Winmostar—Student V10.1.3 for 64bit Windows). The B3LYP hybrid functional was used. The geometry was preliminarily optimized using a smaller 3-21G* basis set, followed by an aug-cc-pVTZ basis set for the final optimization. Water was simulated using the polarizable continuum model (Mennucci 2012) for the determination of solute cavitation volumes. The calculated analyte cavitation volumes are listed in the Additional file 1: Table S13.

HSP calculation

Analyte HSP were calculated according to the group contribution method described by Stefanis and Panayiotou (2012, 2008). Utilizing this technique, each analyte molecule is broken down into discrete first-order molecular fragments i.e. aromatic ArC-H, methyl -CH3 and ester COO fragments. After identifying all first-order molecular fragments, second-order fragments can be identified. These include resonance stabilized functional groups i.e. aromatic ester ArCCOO fragments. Once a molecule is described by first- and second-order fragments, the dispersion, polar and hydrogen bond contributions for each fragment are summed. These sums are consecutively modified by adding terms, derived by regression of a large set of experimentally determined HSP parameters. n-octane and n-octadecane were used as SP approximations and their HSP were calculated using the same method. The calculated HSP are presented in the Additional file 1: Table S14. HSP for deionised water and ACN, Additional file 1: Table S15, were collected from tabulated experimental data, provided by Hansen (2007).

Statistical analysis and model construction

All statistical analyses were implemented using R (ver. 4.0.2) within Rstudio (ver. 1.3.1056). R was also used in the production of the figures. The retention prediction model was implemented using Python (ver. 3.6.1) within PyCharm (2019.3.5 Professional Edition).