Impact of model misspecification on model-based tests in PK studies with parallel design: real case and simulation studies

Guhl, Mélanie; Mercier, François; Hofmann, Carsten; Sharan, Satish; Donnelly, Mark; Feng, Kairui; Sun, Wanjie; Sun, Guoying; Grosser, Stella; Zhao, Liang; Fang, Lanyan; Mentré, France; Comets, Emmanuelle; Bertrand, Julie

doi:10.1007/s10928-022-09821-z

Impact of model misspecification on model-based tests in PK studies with parallel design: real case and simulation studies

Original Paper
Published: 16 September 2022

Volume 49, pages 557–577, (2022)
Cite this article

Download PDF

Journal of Pharmacokinetics and Pharmacodynamics Aims and scope Submit manuscript

Impact of model misspecification on model-based tests in PK studies with parallel design: real case and simulation studies

Download PDF

Mélanie Guhl ORCID: orcid.org/0000-0002-6456-3924¹,
François Mercier²,
Carsten Hofmann³,
Satish Sharan⁴,
Mark Donnelly⁴,
Kairui Feng⁴,
Wanjie Sun⁵,
Guoying Sun⁵,
Stella Grosser⁵,
Liang Zhao⁴,
Lanyan Fang⁴,
France Mentré¹,
Emmanuelle Comets^1,6 &
…
Julie Bertrand¹

2350 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

This article evaluates the performance of pharmacokinetic (PK) equivalence testing between two formulations of a drug through the Two-One Sided Tests (TOST) by a model-based approach (MB-TOST), as an alternative to the classical non-compartmental approach (NCA-TOST), for a sparse design with a few time points per subject. We focused on the impact of model misspecification and the relevance of model selection for the reference data. We first analysed PK data from phase I studies of gantenerumab, a monoclonal antibody for the treatment of Alzheimer’s disease. Using the original rich sample data, we compared MB-TOST to NCA-TOST for validation. Then, the analysis was repeated on a sparse subset of the original data with MB-TOST. This analysis inspired a simulation study with rich and sparse designs. With rich designs, we compared NCA-TOST and MB-TOST in terms of type I error and study power. With both designs, we explored the impact of misspecifying the model on the performance of MB-TOST and adding a model selection step. Using the observed data, the results of both approaches were in general concordance. MB-TOST results were robust with sparse designs when the underlying PK structural model was correctly specified. Using the simulated data with a rich design, the type I error of NCA-TOST was close to the nominal level. When using the simulated model, the type I error of MB-TOST was controlled on rich and sparse designs, but using a misspecified model led to inflated type I errors. Adding a model selection step on the reference data reduced the inflation. MB-TOST appears as a robust alternative to NCA-TOST, provided that the PK model is correctly specified and the test drug has the same PK structural model as the reference drug.

New Model–Based Bioequivalence Statistical Approaches for Pharmacokinetic Studies with Sparse Sampling

Article 30 October 2020

Statistical Power Calculations for Mixed Pharmacokinetic Study Designs Using a Population Approach

Article Open access 11 July 2014

Strategies for developing Alzheimer’s disease treatments: application of population pharmacokinetic and pharmacodynamic models

Article 13 June 2022

Introduction

In bioequivalence (BE) studies with pharmacokinetic (PK) endpoints (for generics), or PK similarity studies (for biologicals), we aim to compare the exposure after administration of different drug formulations by comparing two PK parameters of interest: the area under the curve (AUC) of the plasma concentration as a function of time, and the maximal concentration ($C_{max}$).

BE studies are an essential part of drug development and still an active research field. Currently, a key science and research priority at the U.S. Food and Drug Administration (FDA) is to “improve quantitative pharmacology and BE trial simulation to optimise the design of BE studies for generic drug products and establish a foundation for model-based BE study designs” [1].

The classical statistical test used to assess BE is the Two One Sided Tests (TOST) proposed by Schuirmann in 1987 [2]. It consists of two t tests, on PK parameters of interest, comparing the difference of treatment effects computed to a threshold $\delta$. The FDA as well as the European Medicines Agency (EMA) fix this threshold to $\delta =log(0.8)$ and $\delta =log(1.25)$ [3, 4].

FDA and EMA recommend estimating BE treatment effects via non-compartmental analysis (NCA) for both crossover and parallel study designs [3, 4]. However, assessment of PK equivalence may be challenging for PK BE studies with sparse sampling, such as in participants receiving ophthalmic or oncology drug products. PK BE studies for ophthalmic drug products typically involve a sparse design with one sampling time point per subject (or per treatment group per subject in a crossover design). In such studies, FDA recommends BE to be assessed using a non-parametric bootstrap NCA-based approach or a parametric method [5, 6]. This type of sparse study design may be useful for certain drug products or may occur from study interruptions due to the COVID-19 pandemic or other causes.

An alternative proposed by Dubois et al. [7] is to use a model-based (MB) approach, using the empirical Bayes estimated (EBE) individual parameters of a non-linear mixed effects model instead of NCA parameters. They showed that this method leads to an increase in type I error when the EBE shrinkage is above 20%, which is frequent in case of sparse design. Dubois et al. [8] also proposed a MB approach, this time inferring on the population parameters. They showed that this MB approach works as well as the NCA on rich designs and can be applied on sparser designs. Currently, it is unclear when MBBE methods would be preferred over traditional BE approaches. As such, FDA has actively supported research focused on MBBE approaches for PK BE studies with sparse designs [9,10,11]. Indeed, MB tests can lead to an inflation of the type I error because of an underestimation of the standard error (SE) of treatment effects on sparse designs in presence of large variability, which led Loingeville et al. to propose and evaluate methods of correction of the standard errors in MB studies [10]. Shen et al. [12] also proposed a MB alternative to traditional BE tests. In this MBBE approach, rich individual PK profiles are simulated from the model and NCA is performed to estimate individual AUC and $C_{max}$ values. Since TOST was based on individual predicted values, the authors assessed distributional assumptions.

MB approaches involve the selection of a PK model to fit the data, which raises the question of the impact of model misspecification on the results of the equivalence tests.

In this study, we define a ”sparse” design as any study with only a few sampling points and that challenges the identifiability of the model, which means that the sparse nature of data depends on the complexity of the model of interest.

Our work was based on data collected during the development of gantenerumab, a monoclonal antibody for the treatment of Alzheimer’s disease. As this drug has a very long half-life, the clinical trials were conducted using a parallel design (more than 13 weeks of follow up), which is not the classical design for PK equivalence studies that are usually conducted using a crossover design.

In this real case, we compared the PK data gathered in participants treated with two formulations of gantenerumab. Then, we evaluated the performance of the MB approach on simulations based on data from this study and assessed the impact of study design, model misspecification, and the relevance of a model selection step. Although this assessment was based on PK data from a monoclonal antibody, our novel method may potentially be used to evaluate BE studies in generic drug development when there is sparse PK sampling.

We first present the theoretical background, i.e., the NCA and MB approach for equivalence TOST tests. We then describe the observed data, the methodology to analyse it and the results of this real case study. We finally present the design, methods and results of the simulation study, and discuss our findings in the last section.

Theoretical background

Two One-Sided Tests

Showing the PK equivalence of two drug formulations, one reference (R) and one test (T), means showing their exposure is equivalent.

In PK BE studies, drug exposure is typically characterised by two PK parameters, variables of the plasma concentration versus time profiles : the Area Under the Curve (AUC), which can be computed from 0 to the last sampling point ($AUC_{tlast}$) or extrapolated to infinity ($AUC_\infty$), and the maximum plasma concentration ($C_{max}$). Treatment effects on AUC and $C_{max}$, namely $\theta _{AUC}$ and $\theta _{C_{max}}$, are defined as the difference of the expectation of the log individual values of these variables under test and reference treatment. For instance:

$$\begin{aligned} \theta _{AUC}={\mathbb {E}}(log(AUC_T)) - {\mathbb {E}}(log(AUC_R)) \end{aligned}$$

(1)

Since we wish to reject the assumption that the two formulations have different exposures, we write the null hypothesis as [2]:

$$\begin{aligned} H_0: \{ \theta \le - \delta \text { or } \theta \ge \delta \} \end{aligned}$$

(2)

where $\delta$ is the tolerance. The regulatory guidances for equivalence studies fix the threshold $\delta =log(1.25)$ [3, 4].

By decomposing this null hypothesis in two, we perform Two One-Sided Tests (TOST):

$$\begin{aligned} H_{0, -\delta }: \{ \theta \le - \delta \} \text { and } H_{0,\delta }: \{ \theta \ge \delta \} \end{aligned}$$

(3)

The two t test statistics are rejected at $\alpha =5\%$ if:

$$\begin{aligned} Z_{-\delta } = \frac{\theta + \delta }{SE(\theta )} \ge q_{1-\alpha } \text { and } Z_\delta = \frac{\theta - \delta }{SE(\theta )} \le q_\alpha \end{aligned}$$

(4)

with $q_\alpha$ the quantile of order $\alpha$ of a reference distribution.

Equivalently, we can reject the null hypothesis if the confidence interval of $\theta$ is within $[-\delta , \delta ]$, that is if the confidence interval of the exponential of $\theta$ is within [0.8 ; 1.25]. The exponential of $\theta$ is often shown in the results of the test and is called the geometric mean ratio (GMR).

Non-compartmental analysis

The standard method for PK equivalence studies is to compute individual AUC and $C_{max}$ and use an ANOVA or a linear mixed model to estimate the treatment effect. $AUC_{tlast}$ can be computed using the trapezoidal method and $AUC_{\infty }$ can be estimated by linear extrapolation. For this, FDA recommends that sampling continues for at least three or more terminal elimination half-lives of the drug and there are at least three sampling points after the peak [3]. $C_{max}$ is defined as the maximal concentration measured among the study sampling times.

Depending on the study design, there can be a period and a sequence effect on the variables of interest. In parallel studies, there is only one period: each group of participants receives one treatment only. Our present work focuses on a drug with a long half-life which warrants a parallel study design instead of the classical crossover design for PK equivalence studies. In this case, there is no period or sequence effect and intra-individual variability cannot be properly evaluated. The models to fit are simply:

$$\begin{aligned} log(AUC_i)= & {} \mu _{AUC}+\theta _{AUC} T_i+\epsilon _{AUC_i} \end{aligned}$$

(5)

$$\begin{aligned} log(C_{max_i})= & {} \mu _{C_{max}}+\theta _{C_{max}}T_{i}+\epsilon _{C_{max_i}} \end{aligned}$$

(6)

with:

$\mu$: mean value of variable for the reference treatment;
$T_i$: treatment covariate variable for individual i;
$\theta$: coefficient of treatment effect;
$\epsilon _{i} \sim {\mathcal {N}}(0,\sigma ^2)$: residual error.

The treatment effects on the variables of interest and their standard errors are obtained directly from the linear model inference.

The geometric mean ratio is, e.g. for AUC:

$$\begin{aligned} GMR&= \displaystyle \frac{\displaystyle exp({\mathbb {E}}(log(AUC_{T})))}{\displaystyle exp({\mathbb {E}}(log(AUC_{R})))} \\&= \displaystyle \frac{exp(\mu _{AUC}+\theta _{AUC})}{exp(\mu _{AUC})}\\&= exp(\theta _{AUC}) \end{aligned}$$

In non-compartmental PK equivalence analyses (hereafter called NCA-TOST), the standard error is obtained with the Fisher Information Matrix (FIM), which is asymptotically the inverse of the lower bound of the variance-covariance matrix of regression coefficients. With balanced groups, the reference distribution to use in NCA-TOST is a Student’s t distribution with N-2 degrees of freedom, N being the number of participants in the study.

Model-based approach

Regulatory requirements may not be met in studies with sparse sampling design, and NCA-TOST may then become less accurate. Indeed, it can be hard to compute individual AUC and $C_{max}$ if we only have a few points per subject. In an effort to leverage population data over time to inform predictions for individuals, a model-based alternative has been proposed [8, 10], in which we build a structural PK model and use a non-linear mixed effect model (NLMEM) to estimate the treatment effect. The corresponding statistical model can be written as follows in the case of parallel studies:

$$\begin{aligned} y_{ij}= & {} f(t_{ij},\phi _{i})+g(t_{ij},\phi _{i}) \epsilon _{ij}\end{aligned}$$

(7)

$$\begin{aligned} log(\phi _{il})= & {} log(\mu _l) + \theta _l T_{i} + \eta _{il}\end{aligned}$$

(8)

with:

$t_{ij}$: time j for individual i;
$y_{ij}$: concentration for individual i at time $t_{ij}$;
$\phi _{i}$: vector of parameters for individual i (typically of size 3 to 10);
$f(t_{ij},\phi _{i})$: non-linear structural PK model depending on $\phi _i$;
$g(t_{ij},\phi _{i})$: error model;
$\epsilon _{ij} \sim {\mathcal {N}}(0,1)$: residual error;
$\mu _{l}$: fixed effect for parameter l;
$T_i$: treatment covariate variable;
$\theta _l$: coefficient of treatment effect for parameter l;
$\eta _{il} \sim {\mathcal {N}}(0,\omega _l)$: between subject random effect for parameter l;
$\omega _l$: standard deviation of the inter-individual random effect for parameter l.

g() describes the error model, with usual models being:

Additive error model: $g(t_{ij},\phi _{i})= \sigma _a$ ;
Multiplicative error model:$g(t_{ij},\phi _{i}) = \sigma _b \ f(t_{ij},\phi _{i})$ ;
Combined error model: $g(t_{ij},\phi _{i})= \sigma _a + \sigma _b \ f(t_{ij},\phi _{i})$ .

In the context of BE studies, we usually have previous knowledge on the underlying PK characteristics of the reference product, which could be described by a subset of structural PK models f().

In this study, we only fitted and compared PK models that differed in terms of number of compartments, order of absorption, and presence of an absorption delay. A description of all the models used in this study can be found in Appendix 1, defining the vector $\mu$ of l parameters related to each model.

Computation of standard errors

In this study, we used and compared three different methods of computation of SE in the MB approach, that are described below, and called ”Asympt”, ”Gallant” and ”Post”. These three methods have also been evaluated in the context of BE studies by Loingeville et al. [10].

Asympt

AUC and $C_{max}$ are secondary PK parameters of the models, i.e., functions derived from the PK model direct parameters, and their treatment effects are also functions of the PK model direct parameters and treatment effect: $\theta = h(\mu _{PK}, \theta _{PK})$. For instance, for all PK models with a linear elimination, $AUC_\infty =\displaystyle \frac{FD}{CL}$, where D is the dose administered, F the bioavailability of the drug and CL the clearance, so the treatment effect on $AUC_\infty$ can be simply derived from the model as $\theta _{AUC_\infty }=-\theta _{CL/F}$ and $SE(\theta _{AUC_\infty })=SE(\theta _{CL/F})$. In one compartment models, there are analytical solutions for all secondary PK parameters, so the delta-method can be used to compute the standard errors of treatment effects. In two-compartment models, there is no analytical solution for $C_{max}$, so we need to compute $\theta _{C_{max}}$ and its standard error by simulation. This method consists of sampling parameters from a multi-normal distribution with maximum likelihood estimates as the mean vector and the inverse of the FIM as the variance-covariance matrix, to simulate rich concentration profiles for reference and test treatments (see Appendix 2 for a more precise description of the method).

In this approach (which will be designated hereafter by MB-TOST Asympt), the standard error computed in NLMEM is also obtained with the FIM, using a linearisation of the PK model.

The reference distribution we use in MB-TOST Asympt is a Gaussian distribution with zero mean and a standard deviation equal to 1.

In the MB approach, an underestimation of the asymptotic standard errors of the treatment effects has been observed which resulted in an inflation of type I error when performing PK equivalence tests [8]. To address this, several methods of correction of the asymptotic standard errors have been suggested. Here, we use two methods of correction, designated Gallant and Post, which were proposed for equivalence tests by Loingeville et al. [10].

Gallant

The Gallant correction [13] (MB-TOST Gallant) aims to take into account the number of parameters estimated towards the available data to correct for the underestimation of the standard errors of treatment effects. It involves re-weighting the standard errors using the following formula:

$$\begin{aligned} SE_{Gallant} = SE \ \sqrt{\frac{N}{N - p}} \end{aligned}$$

(9)

with N the number of participants in the study and p the number of fixed and covariate effects (here, we only have the treatment as a covariate).

We also switch the reference distribution used in the tests from a Gaussian distribution to a Student’s t distribution with $N-p$ degrees of freedom.

Post

This method (MB-TOST Post) uses posterior distribution samples to compute the standard errors of treatment effects [10].

Samples of population parameters are generated by Bayesian inference, with the Hamiltonian Monte Carlo algorithm. Maximum likelihood estimates obtained with NLMEM are used as initial values. Uniform priors are used for the fixed and treatment effects and Half-Cauchy distributions with zero mean and a standard deviation equal to 1 for the random effects and residual error variance parameters.

When the data are not informative enough given the number of model parameters to estimate, these priors can result in chains with low $N_{eff}$ and high ${\hat{R}}$. When $N_{eff} \le 400$ and ${\hat{R}} \ge 1.05$, log normal priors can be used for the fixed effects, with mean equal to the maximum likelihood estimation and a standard deviation equal to 0.5 and normal priors with zero mean and standard deviation equal to 0.5 for the treatment effects as in [10].

The standard errors of treatment effects are computed using samples from the posterior distribution.

The reference distribution, as for MB-TOST Asympt, is a Gaussian distribution with zero mean and a standard deviation equal to 1.

Case study: gantenerumab

Data

In our analysis, PK data was collected from two phase I randomised clinical trials on healthy male or female subjects between 40-70 years of age. These trials investigated the relative bioavailability, tolerability, and dose-exposure relationship of a high concentration liquid formulation (HCLF G3) versus a lyophilised formulation (LyoF G2) of gantenerumab, a monoclonal antibody used for the treatment of Alzheimer’s disease. Hereafter we considered the high concentration liquid formulation as the reference formulation. Both formulations were administered by subcutaneous injection. The first study (NCT01636531, here called S1) was composed of five parallel arms with 24 participants each: three reference arms at different dose levels (105, 225 and 300 mg) and two test arms (105 and 225 mg). In the second study (NCT02133937, here called S2), composed of one reference arm of 25 participants and one test arm of 23 participants, the dose tested was 225 mg. PK sampling was performed in participants for up to 13 weeks using the following scheme: 0.25, 1, 2, 3, 4, 7, 13, 20, 42, 63, and 84 days post dose. There was one additional sampling time in S2, one hour post dose (0.04 days). We evaluated PK equivalence of the two formulations in terms of $C_{max}$ and $AUC_\infty$.

Methods

We performed separate analyses for each study and dose tested, hereafter called S1-105, S1-225 and S2-225, discarding the 300 mg arm of S1 as this study did not include a test treatment arm at this dose.

On the original rich design data (11 sampling points per subject), different structural PK models and residual error models were fitted on the reference arms, and compared for selection purposes. The structural PK models tested differed in terms of number of compartments (one or two), order of absorption (zero or one) and presence of an absorption delay. A description of all these models can be found in Appendix 1. As we work on a drug administered by sub-cutaneous injection, the parameters of the PK models used are apparent parameters scaled by the bioavailability of the drug F. Inter-individual variability followed a log-normal distribution for all parameters. Three types of error models were tested: additive, multiplicative and combined. Models were compared using the Bayesian Information Criterion (BIC) computed by Importance Sampling, combined with a second criteria of a relative SE (RSE) below 50% for all parameters. Inter-individual variability parameters that did not meet this second criteria were removed. We also explored the relevance of adding a correlation between the inter-individual variabilities. Goodness of fit was assessed with Visual Predictive Checks (VPC) and Normalised Prediction Distribution Errors (NPDE) [14]. The selected PK model was then fitted on both the reference and test arms and treatment effects were estimated on all parameters. We compared the results of MB-TOST, using only the Asympt computation method for the SE, with results obtained with NCA-TOST which usually performs well on such rich designs.

MB analyses were also run on a sparse subset of the data to explore the impact of the study design. The sparse subset for each study contained 5 points per subject because it is the maximum number of population parameters that we needed to estimate, in order to make the model identifiable. These points were obtained by optimisation of the design with PFIM [15] (Population Fisher Information Matrix, an algorithm for the evaluation and optimisation of designs), using the model fitted on the rich reference and test arms. Given that this manuscript focuses on the investigation of MB methods as an alternative for sparse design, we tested the PK equivalence only with MB-TOST, selecting again the PK structural model on the reference arm. Three methods to compute the SE were used: Asympt, Gallant and Post.

Implementation

Analyses were run on R version 4.0.2. Parameters of the PK models were estimated by maximising the likelihood using the Stochastic Approximation of Expectation Maximisation algorithm (SAEM) [16], in the saemix R package [17] (development version: https://github.com/saemixdevelopment/saemixextension). For NCA-TOST, $AUC_{\infty }$ was computed by extrapolation with the PKNCA R package [18] version 0.9.4, using the observed concentration at $t_{last}$. Sampling points for the sparse designs were chosen with the PFIM [15] R package version 4.0 which enables to optimise population design using the Fedorov–Wynn algorithm.

Results

Figure 1 shows spaghetti plots of the plasma concentrations of gantenerumab versus time in log-scale, for the two lower doses in each study.

The same model, a two-compartment model ($V_1/F$: apparent volume of the principal compartment, $V_2/F$: apparent volume of the peripheral compartment, Q/F: apparent inter-compartmental clearance) with linear absorption (ka: absorption constant) and elimination (CL/F: apparent clearance constant) with an absorption delay ($T_{lag}$), was selected to be the best (among the considered candidates) at describing the drug PK across studies/arms (taken as three separate datasets). A treatment effect was estimated on all 6 parameters ($\theta _{Tlag}$, $\theta _{ka}$, $\theta _{CL/F}$, $\theta _{V1/F}$, $\theta _{Q/F}$, and $\theta _{V2/F}$). On all datasets, based on BIC, the inter-individual random effect on $V_2/F$ was withdrawn, and a correlation between the inter-individual random effects of CL/F and $V_1/F$ was estimated. On S1-105 and S1-225, the error model was multiplicative. On S1-225, no inter-individual random effect was kept on Q/F. On S2-225, the error model was combined. The models selected were therefore very similar. Table 4 in Appendix 3 gives the parameter estimates obtained across datasets. As shown in Fig. 2, illustrating the GMR and their confidence intervals in the different datasets investigated, the different methods gave consistent results: for S1-105, with both NCA-TOST and MB-TOST Asympt, the 90% confidence interval of the GMR of AUC and $C_{max}$ fell within [0.8; 1.25], but for S1-225, equivalence could not be shown on $C_{max}$ with either of the two methods. On S2-225, equivalence could not be shown on $C_{max}$ with both methods. For AUC, equivalence was shown using MB-TOST but not using NCA-TOST, although the estimates were close (MB-TOST Asympt: 90% CI=[0.801;1.218], p-value=0.049; NCA-TOST: 90% CI=[0.782;1.205], p-value=0.070). The data used to produce Fig. 2 are provided in Table 5 in Appendix 3.

The sparse design optimised using PFIM led to the following sampling scheme: 0.25, 3, 7, 20, 84 days post dose for S1-105, 0.25, 4, 20, 42, 84 days for S1-225, and 0.04, 4, 13, 42, 84 days post dose for S2-225. The selected PK model was a one compartment model with linear absorption and an absorption delay on the two S1 datasets, and a one compartment model with zero order absorption and no absorption delay on S2. Again, a treatment effect was estimated on all apparent parameters in each case. On all datasets, a correlation between the inter-individual random effects of CL/F and V/F was selected. On S1-105 and S1-225, the error model selected was multiplicative. On S2-225, the error model selected was combined. On S1-225 and S2-225, no inter-individual random effect was kept on $T_{lag}$. Table 4 in Appendix 3 gives the parameters estimated on all these subsets. Although the PK models selected on the sparse data were different from the ones selected on the observed data, the results of the equivalence study using MB-TOST were consistent, across all computation methods of SE, and comparable to those obtained on rich design (Fig. 2).

Fig. 6 shows the VPC and Fig. 7 reports the normality of residuals for S1-225 original and sparse design. These goodness of fit plots have also been checked for S1-105 and S2 (not shown).