FormalPara Take-home message

This multicenter external evaluation in critically ill patients provides a selection of different pharmacokinetic models for piperacillin as promising candidates for successful clinical use following implementation in an open-access dosing software. The selected models should be used in combination with timely therapeutic drug monitoring (Bayesian forecasting) to achieve most reliable and precise optimization of individual antibiotic therapy.

Introduction

Infections increase the length of stay and mortality of patients in the intensive care unit (ICU) [1, 2]. Timely administration of antibiotics and dosing that results in effective exposure are of paramount importance to treatment outcomes [3,4,5]. However, critically ill patients commonly exhibit pathophysiological changes, altering the volume of distribution (Vd) and the elimination of antibiotics, making dosing and pharmacodynamic (PD) target attainment challenging [6,7,8]. While excessive drug/multidrug concentrations might cause life-threating toxicity, subtherapeutic concentrations increase the risk of treatment failure and emerging antibacterial resistance [9, 10]. To avoid these scenarios, individualized dosing approaches are urgently needed [11].

Piperacillin (PIP) is the most frequently prescribed β-lactam antibiotic in German ICUs [12]. Its time-dependent antibacterial activity is characterized by the time (T) during which free (f) concentrations exceed the minimum inhibitory concentration (MIC) of the pathogen (fT>MIC) [13]. Standard doses of PIP have been reported to result in poor target attainment [14, 15]. Dose adjustments using traditional therapeutic drug monitoring (TDM) have therefore been investigated. Hagel et al. showed increased, but still poor attainment of target concentrations during continuous infusion (CI) in 249 critically ill patients (37.3 vs. 14.6%, odds ratio (OR) 4.5, 95% confidence interval 2.9–6.9, p < 0.001) [16]. Along with others, the authors subsequently proposed to investigate the additional benefit of using dosing software [16, 17]. Model-informed precision dosing (MIPD) is a promising predictive technology that supports dosing and combination with TDM results. It is available through dosing software and, thus, convenient to use [18,19,20]. Prior to widespread bedside use, a thorough evaluation of the population pharmacokinetic (PopPK) model(s) underlying the dosing software is strongly recommended by the United States (US) Food and Drug Administration [21], and by an expert panel defining research priorities towards antibiotic precision dosing [11]. Similarly, Cotta et al. called for “externally validated and clinically appropriate PopPK models” in their presentation of “ideal characteristics of MIPD software in the ICU” [22].

Numerous PopPK models have been developed for PIP, most of which are based on small monocentric studies [23] and may not be generalizable to other populations. A recent evaluation of six PIP models in 30 ICU patients receiving CI demonstrated large inter-model variability regarding predictability [24]. The transferability of these results to other populations is uncertain due to the limited number of patients and the monocentric setting. Furthermore, a clinically oriented model assessment in conjunction with TDM (Bayesian forecasting) was lacking [25]. The aim of the present study was to evaluate the predictive performance of available PIP PopPK models with and without TDM using an external multicenter dataset to facilitate model selection for MIPD in critically ill patients.

Methods

Evaluation dataset

Clinical data of 561 ICU patients treated with PIP (3654 samples) were available from four previous studies including eleven different German centers [16, 26,27,28]. Details are described in electronic supplementary material (ESM_Main, page 2) and summarized in Table 1.

Table 1 Demographic and clinical characteristics of patients in the evaluation dataset

Evaluation of population pharmacokinetic models for piperacillin

The systematic literature review in PubMed, including search terms and a detailed flowchart of PopPK model screening, as well as the software used for (i) model reconstruction, (ii) prediction of concentrations, and (iii) output processing, is presented in ESM (ESM_Main, page 3–5, Fig. S1). An overview of the twenty-four identified models and underlying studies is provided (ESM_Main Tables S1–2). The report was guided by the TRIPOD checklist [31]. If models were based on free PIP concentrations (PIPunbound), 70% of the total PIP concentrations (PIPtotal) available in the evaluation dataset were assumed for model assessment [24, 32]. Three prediction scenarios were examined for each PK model, two of which investigated a Bayesian approach, i.e., the combination of PopPK model and TDM results based on the Bayes’ theorem:

  1. (1)

    A priori (A): prediction of all PIP concentrations based on dosing history and patient covariates only (npredicted_samples_A = 3654)

  2. (2)

    Bayesian 1 (B1): prediction of concentrations considering the first TDM sample for each patient (npredicted_samples_B1 = 3093)

  3. (3)

    Bayesian 2 (B2): prediction of concentrations considering the first and second TDM sample for each patient (npredicted_samples_B2 = 2532).

By considering TDM data, model parameter variability is taken into account to refine the initial PopPK parameters for the individual patient. The integrated samples for Bayesian forecasting were different, however, the large number and wide range of concentrations increased the comparability of predictions.

Model evaluation included statistical and graphical methods, employing prediction- and simulation-based diagnostics as recommended [25]. First, numerical comparisons between predicted and observed PIP plasma concentrations (cpred and cobs) served to quantify model-specific relative prediction errors (rPE) based on the total evaluation dataset and additionally for each individual study. The median relative prediction error (MPE) and median absolute relative prediction error (MAPE) were calculated to reflect accuracy and precision.

$${\mathrm{rPE}}_{\mathrm{ij}} =\frac{{\mathrm{ c}}_{\mathrm{pred},\mathrm{i},\mathrm{j}}-{\mathrm{c}}_{\mathrm{obs},\mathrm{i},\mathrm{j}}}{{{ (\mathrm{c}}_{\mathrm{pred},\mathrm{i},\mathrm{j}}+\mathrm{c}}_{\mathrm{obs},\mathrm{i},\mathrm{j}})/2},$$
(1)

i: individual patient, j: measurement.

$$\mathrm{MPE }\left[\mathrm{\%}\right]=\mathrm{ median}(\left\{{\mathrm{rPE}}_{\mathrm{1,1}}, \dots , {\mathrm{rPE}}_{\mathrm{i},\mathrm{j}}\right\})*100,$$
(2)
$$\mathrm{MAPE }\left[\mathrm{\%}\right]=\mathrm{median}\left(\left\{{|\mathrm{rPE}}_{\mathrm{1,1}}|, |\dots |, |{\mathrm{rPE}}_{\mathrm{i},\mathrm{j}}|\right\}\right)*100.$$
(3)

To facilitate clinical interpretation, an illustration based on the rPE (A/B1/B2) translated to PIP concentrations was performed:

$${\mathrm{c}}_{\mathrm{rPEi},\mathrm{j}}={\mathrm{c}}_{\mathrm{target}}-\left({\mathrm{c}}_{\mathrm{target}}*{\mathrm{rPE}}_{\mathrm{i},\mathrm{j}}\right).$$
(4)

The target concentration (ctarget) was exemplarily defined as 64 mg/L, which is related to the 4xMIC-epidemiological cutoff (ECOFF) for Pseudomonas aeruginosa according to the European Committee on Antimicrobial Susceptibility Testing (EUCAST). Density plots illustrate the range of the expected concentrations; i.e., the deviations from ctarget due to model-specific inaccuracy and imprecision are simultaneously reflected according to the identified rPE. A symmetric target range of 32–96 mg/L (2-6xMIC-ECOFF) was considered clinically acceptable and attainment rates (%) were quantified.

Furthermore, the evaluation was stratified by CI and II to account for different PIP infusion regimens (studies 1–3: CI, nsamples_CI = 1819; study 4: II, nsamples_II = 1835).

Apart from numerical evaluation, cpred and cobs were compared graphically using goodness-of-fit (GOF) plots and prediction-corrected visual predictive checks (pcVPC) (nsimulations = 1000) [33].

The models that best represented the total evaluation dataset numerically and graphically were used to further investigate factors influencing the predictive performance:

  1. (1)

    To explore the precision of Bayesian predictions as a function of time after integration of TDM data, the absolute rPE (B1/B2) vs. time after TDM (TaTDM) was assessed by descriptive statistics and the Kendall rank correlation coefficient (τ). This analysis was performed using studies 1 and 2, as these provided observation periods > 7 days (489 patients, CI, npredicted_samples_B1/B2 = 1306/817).

  2. (2)

    To quantify the differences between (i) males vs. females, and (ii) non-renal replacement therapy (non-RRT) vs. RRT patients, point-biserial correlations (rpb_corrected) were calculated between the rPE (A) and each binomial group.

Besides, all 24 models underwent independent subgroup evaluations using the external dataset stratified by sex and RRT, and the above-described numerical analysis and GOF plots.

Results

Predictive performance of population pharmacokinetic models for piperacillin

The predictive performance markedly varied between all models (total dataset, A/B1/B2: MPE − 135.6–78.3/− 77.3–72.3/− 55.8–65.3%; MAPE 35.7–135.6/26.2–77.7/23.8–67.3%). Overall, data of each study population were predicted similarly to the pooled dataset, except that centers with CI performed slightly better in the model evaluation than that with II (see below). For most models, a common predictive trend (e.g., MPE < / > 0) and comparable performance per study site was observed (ESM_Main Tables S3–5). Figure 1 illustrates the accuracy and precision of the models based on the total dataset. Differences in performance were most evident for population predictions (A). Underestimation (MPE < 0; nmodels_A/B1/B2: 17/15/11) was more frequent than overestimation (MPE > 0; nmodels_A/B1/B2: 7/9/13). Consideration of TDM samples improved predictions for all models (ESM_Main Table S5). Models developed merely based on ICU patients did not generally appear superior to models without underlying ICU populations. However, taking together minimum inaccuracy (MPE) and imprecision (MAPE), the models by Kim (2022) [34], Klastrup [35] and Udy et al. [36], all originating from ICU patient data, performed best (A/B1/B2: |MPE|< 13/7.7/7.5%; MAPE < 37.5/28.0/28.4%). Graphical GOF plots agreed with these numerical results.

Fig. 1
figure 1

Predictive performance of the evaluated population pharmacokinetic models. Median relative prediction error (MPE) [%] as a measure of accuracy and median absolute relative prediction error (MAPE) [%] reflecting precision of the predictions, comparing the observed plasma piperacillin (PIP) concentrations for each model; A priori (A): Prediction of all concentrations based on dosing history and patient covariates only; Bayesian 1 (B1): Prediction of concentrations considering the first therapeutic drug monitoring (TDM) sample for each patient; Bayesian 2 (B2): Prediction of concentrations considering the first and second TDM sample for each patient; models were ordered (from left to right) according to the model-specific sum (A + B1 + B2) of (absolute) MPE and MAPE values, respectively

Based on the rPE of the three best-predicting models, Fig. 2 illustrates the range of expected concentrations targeting 64 mg/L (see ESM_Main Fig. S2 for all model results). A symmetric range with a median concentration near ctarget indicates high predictive performance, with improved model precision and accuracy associated with more concentrations approaching 64 mg/L. Over- or underestimation is reflected by the frequency of out-of-range concentrations. Overall, higher target range attainment rates were observed for Bayesian approaches, and the model by Kim et al. showed the best result for the pooled dataset (A/B1/B2: 62.7/77.6/80.6%).

Fig. 2
figure 2

Expected target attainment of the population pharmacokinetic models by Kim, Klastrup and Udy et al. by translating the relative prediction errors for the total evaluation dataset to piperacillin plasma concentrations. A priori (A): Prediction of all concentrations based on dosing history and patient covariates only; Bayesian 1 (B1): Prediction of concentrations considering the first therapeutic drug monitoring (TDM) sample for each patient; Bayesian 2 (B2): Prediction of concentrations considering the first and second TDM sample for each patient; red dashed lines represent the lower and upper limits of the target concentration range; percentages indicate the proportion of predicted concentrations within (black color) and outside (red color) the target range

The model evaluation revealed overall better predictions for data collected during CI compared to II (A: |MPE|median_all_models: 27.8 vs. 35.7%; MAPEmedian_all_models 46.2 vs. 50.9%). Integration of a second TDM sample (B2) substantially improved the predictions for II, but not for CI (ESM_Main Table S5). The model by Klastrup et al. showed the best numerical accuracy and precision for CI but was inferior to Kim and Udy et al. regarding II. With respect to the 95% confidence intervals of MPE and MAPE, however, the performance of the latter models partially overlapped (ESM_Main Fig. S4). GOF plots and pcVPCs confirmed the numerical evaluation results (ESM_Main Tables S3–5, Figs. S3–6).

A significant, albeit small positive correlation between the precision of Bayesian predictions and TaTDM was detected, i.e., higher imprecision given longer time (Kim, Klastrup, Udy et al.: τB1 = 0.08, 0.11, 0.13; τB2 = 0.10, 0.10, 0.10; pB1/B2 < 0.001). Comparison of MAPE values vs. TaTDM further indicated decreasing precision (e.g., Udy et al. B1: MAPE24h/MAPE168h: 17.4/29.9%; B2: MAPE24h/MAPE168h: 22.3/40.2%, Fig. 3). Moreover, higher imprecision and underestimation were observed in (i) females compared to males (Kim/Klastrup/Udy et al.: rpb_corrected = − 0.01/− 0.09/− 0.08; p = 0.79/ < 0.001/ < 0.01) and (ii) patients undergoing RRT (Kim/Klastrup/Udy et al.: rpb_corrected = − 0.35/− 0.26/− 0.17; p < 0.001), as detailed in ESM_Main Fig. S7, Table S6.

Fig. 3
figure 3

Precision of the models by Kim, Klastrup and Udy et al. as a function of time since the last therapeutic drug monitoring. Absolute relative prediction error (Absolute rPE) [%] of Bayesian predictions versus time interval [h] since the last considered therapeutic drug monitoring (TDM) sample; values based on the pooled evaluation dataset of studies 1 and 2; Bayesian 1 (B1): Prediction of concentrations considering the first TDM sample for each patient; Bayesian 2 (B2): Prediction of concentrations considering the first and second TDM sample for each patient; n: number of binned concentrations for each boxplot (binning: {24, 48, 72…} + −12 h); boxplots represent the 25th, 50th and 75th percentiles, whiskers the 1.5-fold interquartile range; red dashed lines: linear regression through binned median absolute prediction errors (MAPEs)

Stratified subgroup evaluations of all 24 models are presented in ESM_Subgroup Figs. S1–5, Tables S1–3. Separation by sex revealed overall higher accuracy and precision for men compared to women (A: |MPE|median_all_models: 28.7 vs. 35.8%; MAPEmedian_all_models 46.7 vs. 49.3%); however, the difference was not apparent in Bayesian predictions. Notably, the model by Kim et al. performed best in women (A/B1/B2: MPEfemale − 11.6/− 6.4/0.8%; MAPEfemale 39.3/29/24.5%). Stratification by non-RRT/RRT patients disclosed more models underestimating PIP concentrations in the RRT group (nA/B1/B2: 19/19/14), and predictions appeared worse overall (A: |MPE|median_all_models: 28.7 vs. 47.7%; MAPEmedian_all_models 47.6 vs. 53.7%). While the model by Kim et al. was favorable in non-RRT patients (A/B1/B2: MPE − 1/− 2.4/1.6%; MAPE 34.7/27.2/23.8%), the model by Roberts et al. (2015) [37] produced the most accurate and precise population predictions in RRT patients (A: MPE 0.8%; MAPE 33.3%). When considering TDM data, adequate predictions for RRT patients were achieved with several models (e.g., Tamme et al. [38], B1/B2: MPE 0.6/− 2.9%; MAPE 22.1/20.9%). Further subgroup analyses (e.g., female + RRT) are available in ESM (ESM_Subgroup Figs. S3–5, Tables S1–3).

Taken together all evaluation results, a flowchart showing the best-performing models for each subgroup and clinical target attainment within our external data is presented (ESM_Subgroup Figs. S6–7). The candidate models were then implemented in the open-access TDMx dosing software (www.TDMx.eu), which allows to explore PIP MIPD scenarios relevant for clinical practice (Fig. 4).

Fig. 4
figure 4

Screenshot of the TDMx dosing software for piperacillin (open access web-browser application) with sample patient data (left), an excerpt from the dosing record (bottom), and visualization of a priori (blue line) and Bayesian (orange line) predicted piperacillin concentrations based on the underlying population pharmacokinetic model by Kim et al. Concentration–time profile (top) and probability of target attainment over time (bottom) are displayed in the diagram

Discussion

This study investigates the external validity of 24 PopPK models for PIP in a multicenter cohort of 561 ICU patients and highlights MIPD candidates appropriate for specific situations and patient groups (e.g., CI/II administration, male/female, RRT/non-RRT), considering a priori and Bayesian predictions.

Acceptable values for model accuracy (e.g., |MPE|< 20%) and imprecision (e.g., MAPE < 30%), originating from a propofol PK model evaluation by Miyabe-Nishiwaki and their feasibility for models of time-dependent antibiotics have been debated [24, 39]. However, alternative thresholds to assess the suitability of a model are lacking. The highest predictive performance within our external dataset was observed for the models by Kim, Klastrup, and Udy et al., but none of these achieved the above-mentioned thresholds for a priori predictions (A). When considering TDM measurements, however, several models met the defined thresholds, highly suggesting the combination of model-based therapy and TDM. Similar to our work, Chai et al. recently assessed the accuracy of the ID-ODS dosing software using external data from 75 critically ill patients, concluding overall improved predictive performance/dosing using Bayesian forecasting [40]. While our study showed no substantial benefit of considering two vs. one TDM sample for predicting CI data, two samples were superior to one for II data. However, Bayesian precision has been shown to decrease over time, suggesting the consideration of timely TDM samples during CI. Fixed time intervals cannot be recommended based on our data and require further investigation. Continuous drug monitoring using biosensors with direct feedback of real-time antibiotic concentrations to PK models represents a future scenario, enabling fully automated closed-loop techniques between drug sensor, dosing software, and self-adaptive infusion pumps [41,42,43].

Our evaluation results are largely consistent with a previous study conducted in France using external monocentric data from 30 critically ill patients [24]. Despite methodological differences, the authors also recommended the models by Udy and Klastrup et al. as potential candidates for MIPD (Kim et al. was not published at that time), making our results likely transferable to other centers/countries. In our evaluation, the model by Udy et al. best predicted data collected during II and Klastrup et al. best predicted CI data, which seems plausible since the models were built upon II and CI data, respectively. Both models showed similar performance to that of Kim et al., which revealed most appropriate for the pooled dataset (CI + II), also due to less gender bias and superior performance in non-RRT patients.

Our analyses demonstrated that PIP predictions were overall more accurate and precise for males, presumably as women were underrepresented in all datasets underlying the 24 investigated models (ESM_Main Table S1). The PK parameters estimated (e.g., Vd, CL) may, thus, have been driven by men. A smaller hydrophilic Vd due to a relatively lower muscle mass and a higher proportion of adipose tissue may explain higher/accumulating concentrations in women. Moreover, renal clearance is physiologically about 10% lower than in men [44]. Although some models accounted for sex by integrating renal function using the Cockcroft-Gault (CG) equation [30], this did not appear sufficient to avoid differences in predictions. RRT patients showed considerably worse predictions than non-RRT patients (A). Some models were built upon data excluding RRT patients (e.g., Klastrup and Udy et al.) or were derived from predominantly non-RRT patients (e.g., Kim et al.: 79%), which may explain lowered predictive performance. In contrast, some models were developed exclusively based on RRT patients (ESM_Main Table S1); however, only the model by Roberts et al. (2015) performed well regarding a priori predictions for RRT patients. RRT is associated with highly variable patient PKs due to multiple influences like residual diuresis, type of RRT, dialysis membrane, duration, and intensity settings [41]. Unfortunately, RRT-related variables were not covered by the evaluation dataset, which precluded a detailed exploration of predictions in RRT patients.

A recent randomized controlled trial by Ewoldt et al. investigated a potential benefit of MIPD in 388 critically ill patients receiving ciprofloxacin and β-lactams [45]. Such investigations are urgently needed to link an impact of novel dosing strategies to clinical outcomes. Surprisingly, the authors did not find increased target attainment when using a commercial MIPD software. The applied model for PIP (Andersen et al. [46]) was developed based on clinical data from 22 septic non-ICU patients and published evidence of its transferability to external populations seems lacking. In fact, our study revealed overestimation and mediocre performance of the respective model in the investigated critically ill patients (A/B1/B2: MPE 24.9/18.6/22.8%; MAPE 43.5/34.6/36.9%). Thus, it can only be speculated whether model selection had a relevant influence on the study results, also because the trial included several β-lactams, whereas our study included only PIP [22, 47, 48]. However, both studies combined highlight the key role of externally evaluating PopPK models prior to clinical implementation.

Some limitations of this study shall be acknowledged. First, external evaluation depends on the quality of the evaluation dataset and clinical collection may be error prone, potentially distorting model predictions [49]. To limit this shortcoming, our combined dataset was carefully reviewed and each model assessment was conducted identically to enhance inter-model comparability. Second, although this was a multicenter study, all patients were admitted to national ICUs (Germany), and the evaluation for II dosing is based on a single monocentric trial, limiting extrapolation to other settings. Third, we only included parametric models and cannot exclude similar or better performance of models built with less common modeling approaches. Fourth, our evaluation considered PIP concentrations only; however, recent studies support a holistic PIP/tazobactam stewardship, particularly in severe cases of sepsis. Considering tazobactam may improve target attainment and antibacterial effectiveness for both drugs simultaneously, and future evaluation of a combined model would, thus, be desirable [24, 50]. Fifth, the clinical dataset covered PIPtotal, whereas some models were developed based on PIPunbound. We adjusted the available concentrations according to the protein binding stated by the manufacturer (30%) [24]. However, the unbound fraction may vary in critically ill patients, and possible skewing of the results can therefore not be excluded [27, 32, 51]. Sixth, our study exemplarily followed a recommended PIP ctarget of 64 mg/L, but for β-lactams “the optimal PK/PD target remains debated” according to Novy et al. [47]. A range of 32–96 mg/L was considered acceptable. Lower concentrations are at risk of being ineffective due to variable protein binding and insufficient penetration at the infection site, while higher concentrations are associated with significantly increased 28-day mortality [16]. Seventh, some minor covariate adjustments were necessary due to unavailable covariates (e.g., eGFR instead of measured CLCR) (ESM_Main Table S1). As several characteristics are included in the CG equation [30], a multivariate positive effect on model performance has been discussed [24]. Last, external evaluations (and model selection) include a subjective component, yet we followed a systematic approach with defined criteria to make the evaluation as objective as possible.

In conclusion, the predictive performance of published PopPK models for PIP in critically ill patients varied remarkably. Selecting an appropriate model is essential for high-quality MIPD. Models should be combined with TDM as soon as possible to improve predictability. Studies on the prospective application of MIPD of PIP at the bedside using Bayesian dosing software like TDMx is desired to assess a clinical benefit for patients.

Finally, the authors would like to emphasize that this evaluation is not intended to discredit any PK model, but rather to extrapolate potential candidates for cross-center clinical application in critically ill patients, to move from conventional dosing of PIP toward a more individualized approach.