Improved Confidence in a Confirmatory Stage by Application of Item-Based Pharmacometrics Model: Illustration with a Phase III Active Comparator-Controlled Trial in COPD Patients

Purpose The current study aimed to illustrate how a non-linear mixed effect (NLME) model-based analysis may improve confidence in a Phase III trial through more precise estimates of the drug effect. Methods The FULFIL clinical trial was a Phase III study that compared 24 weeks of once daily inhaled triple therapy with twice daily inhaled dual therapy in patients with chronic obstructive pulmonary disease (COPD). Patient reported outcome data, obtained by using The Evaluating Respiratory Symptoms in COPD (E-RS:COPD) questionnaire, from the FULFIL study were analyzed using an NLME item-based response theory model (IRT). The change from baseline (CFB) in E-RS:COPD total score over 4-week intervals for each treatment arm was obtained using the IRT and compared with published results obtained with a mixed model repeated measures (MMRM) analysis. Results The IRT included a graded response model characterizing item parameters and a Weibull function combined with an offset function to describe the COPD symptoms-time course in patients receiving either triple therapy (n = 907) or dual therapy (n = 894). The IRT improved precision of the estimated drug effect compared to MMRM, resulting in a sample size of at least 3.64 times larger for the MMRM analysis to achieve the IRT precision in the CFB estimate. Conclusion This study shows the advantage of IRT over MMRM with a direct comparison of the same primary endpoint for the two analyses using the same observed clinical trial data, resulting in an increased confidence in Phase III. Supplementary Information The online version contains supplementary material available at 10.1007/s11095-022-03194-1.


Introduction
The primary reason for Phase III clinical drug development failure is insufficient drug efficacy (55%), followed by safety (14%) and strategic reasons (14%) (1). Although the development of "ineffective" or "unsafe" compounds should stop during early stages of drug development, the proportion of failure owing to insufficient efficacy is currently larger in Phase III compared to Phase II (55% vs. 48%) (1). The causes of such failures may include not only insufficient drug efficacy but also insufficient knowledge of the treatment effect in the target population (2) leading to an insufficient sample size (e.g. sample size calculated from a commonly overestimated true treatment effect in Phase II (3)), and therefore high uncertainty in the efficacy estimate. A model-informed drug development (MIDD) decisionmaking framework has been proposed to increase Phase III trials probability of success through more precise estimates of the drug effect. This may result in an increased proportion of successful Phase III and IV trials (4).
Chronic Obstructive Pulmonary disease (COPD) is an inflammatory disease of the respiratory system, accounting for 54.9% of chronic respiratory diseases in 2017 (5). It was the third leading cause of death in 2016 and is projected to remain among the five leading causes of death by 2030 (6). The Global Initiative for Chronic Obstructive Lung Disease (GOLD) strategy (7) states that commonly used maintenance medications in COPD are short or long acting β2 agonists (SABA and LABA, respectively), short or long acting anticholinergic (SAMA and LAMA, respectively) or combination therapies with/without addition of inhaled corticosteroids (ICS). The GOLD strategy recommends that patients with advanced COPD and persistent symptoms who are at risk of exacerbations use an inhaled triple therapy (ICS/ LABA/LAMA) (7). The Lung Function and Quality of life Assessment in COPD with Closed Triple Therapy (FULFIL) is the first Phase III study that compared a once-daily singleinhaler triple therapy with a twice-daily inhaled dual therapy (8). Patient-Reported Outcomes (PROs) from the FULFIL trial (8,9) were obtained from different tools, such as the Evaluating Respiratory Symptoms in COPD (E-RS:COPD) (10), to compare the effect of fluticasone furoate/umeclidinium/vilanterol (FF/UMEC/VI) and budesonide/formoterol (BUD/FOR) on patient's respiratory symptoms. PROs are an important aspect of drug development as they contribute directly with information on the patient's own perceptions of their disease. The E-RS:COPD consists of 11 items from the 14-item Exacerbations of COPD Tool (EXACT), intended to capture information specifically related to respiratory symptoms. This tool has been derived through the application of item and Rasch model analysis (11).
The results of the statistical analysis of E-RS:COPD total score (RS-Total) from the FULFIL trial (8,9) showed that FF/UMEC/VI significantly improved patient respiratory symptoms (assessed via E-RS:COPD tool) in comparison to BUD/FOR (9). This statistical analysis was performed using a mixed model repeated measures (MMRM) approach, a standard statistical method used in longitudinal clinical trials (12). This type of analysis may result in a loss of information and therefore a loss of precision in the efficacy estimate since MMRM analyzes the total score only, ignoring the contribution of each item in the questionnaire to the disease state. Moreover, in MMRM, time is handled as a discrete value rather than a continuous variable resulting in a further loss of information. To include all information from the data and thus increase power and precision to detect a drug effect, a MIDD approach can be applied to analyze PRO data such as RS-Total using a longitudinal non-linear mixed effect (NLME) analysis based on item-level data (i.e. item response theory based model -IRT) (13). In contrast to MMRM, IRT utilizes all components of the composite observations by relating its items to an underlying disease state that varies among individuals and changes with time. IRT thus utilizes all information captured by the questionnaire instead of only the total score as MMRM does.
In the planning of Phase III, different metrics are of interest for the design and analysis of trials such as the probability of study success or the probability of making a correct decision (14). An MIDD framework can provide a more informative approach to explore these metrics and therefore to improve late-stage clinical development productivity (4). This analysis aims to illustrate how a new methodology to analyze PRO data using an item-based pharmacometrics model would improve confidence in a confirmatory stage by comparing the precision around the primary efficacy endpoint obtained with IRT and MMRM.

Data and Patients
The FULFIL clinical trial (NCT02345161) was a Phase III, randomized, double-blind, double dummy, parallel-group, multicenter study that compared 24 weeks of once daily FF/UMEC/VI (100 μg/62.5 μg/25 μg) inhalation powder with twice daily BUD/FOR (400 μg/12 μg) in patients with COPD. Patients were randomized in a 1:1 ratio to FF/ UMEC/VI or BUD/FOR. The study inclusion and exclusion criteria were reported elsewhere (8). In this analysis, nine patients were excluded from the intention to treat population for the following reasons: absence of E-RS:COPD score data for the whole study period (4 patients), dispensing errors (4 patients), and missing recorded time (1 patient). Daily data records were obtained by the completion of the E-RS:COPD tool using an electronic diary, which did not allow patients to skip individual items, although missing days (where patients did not provide answer for any of the items) were possible (10). The E-RS:COPD tool contains 11 items (a subset of the EXACT) that capture information related to respiratory symptoms of COPD (breathlessness, cough, sputum production, chest congestion and chest tightness), with seven items including five ordered categorical options and four items with four categories (Table I).

IRT Development
The IRT was developed in two steps: 1) characterization of the item characteristic functions (ICFs) and 2) the development of the longitudinal model (a flow chart of the analysis is shown Supplementary Fig. 1).

Item Characteristics Functions
The properties of each item were determined by non-linear functions (the ICFs) linking the unobserved patient's disease status, the latent variable θ, to the probability of giving a particular response for an item. Item parameters such as discrimination and difficulty parameters were estimated using an "independent occasion" approach (15). Under this approach, observations from each measurement occasion were treated as belonging to separate individuals, assuming a normal distribution with fixed mean and variance N(0,1) of θ at baseline, and estimated mean and variance N(μ,ω 2 ) at later observations.
A logistic transformation was used to model each item with the probability of rating a score of at least k [P(y ij ≥ k)] and the probability of rating exactly score k [P(y ij = k)] as shown in Eq.1 and Eq. 2, respectively. These probabilities are functions of θ i , the latent variable of patient i, and the estimated fixed effect item parameters such as discrimination (a j ) and difficulty (b j,k ) parameters for item j; more specifically b j,k is the difficulty parameter for the item-score k, and y ij corresponds to the observed data (E-RS:COPD scores) for an individual i and an item j.

Longitudinal Model
In this step, ICF parameters were fixed to the values obtained in step 1, and a data reconciliation was needed to include each individual's time-course data (original IDs). This was done since in step 1 each time point was treated as belonging to separate individuals. For the longitudinal model, the estimates of θ i from the IRT, using ICF fixed parameters from step 1 were considered as the dependent variable (15) taking into account their uncertainty. These estimates of θ i were obtained from an intermediate step where a large uncertainty for θ was considered to ensure that the total scores translated from the estimates of θ i described the raw data well (as shown in Supplementary Fig. 2). The NON-MEM control stream for this intermediate step to obtain θ i for step 2 is available in the Supplementary material. A pre-specified Weibull function (Eq. 3) was used to describe changes in individual symptoms-time course (θ i ) for both 1 + e (aj(θi−bj,k)) (2) P y ij = k = P y ij ≥ k − P y ij ≥ k + 1 treatment arms. Parameters such as disease progression time (T progi ), maximum response (R maxi ) and baseline latent variable (θ i,t = 0 ) are subject-specific parameters with inter-individual variability (IIV), including a random variable with a mean of 0 and variance of ω 2 . T progi was assumed to be lognormally distributed with IIV modeled using an exponential function, whereas all other parameters were assumed to be normally distributed with an additive IIV model. Time t is in years and γ is the gamma value that governs the steepness of the curve. Different parameters per treatment arm were considered. Additionally, effects of smoking status and geographical regions on θ i,t = 0 were included in the model as these covariates were considered in the MMRM analysis (9).

Internal Model Evaluation
Non-parametric ICF smooth plots were developed to assess ICF fit (16), and the predictive performance of the model was assessed by using Visual Predictive Check plots (VPCs). The 2.5th, 50th and 97.5th percentile of the observed data were compared to the 95%CI for the 2.5th, 50th and 97.5th percentiles of the simulated (N = 500) data. At this stage, further changes to the Weibull model were considered if it was deemed necessary.

Clinical Endpoint Definition
The clinical endpoint was the change from baseline (CFB) in RS-Total over 4-week intervals for each treatment arm (FF/UMEC/VI and BUD/FOR). To provide context for the mean differences between groups and being able to compare Breathless with activity 0-3 9 Short of breath -personal care 0-4 10 Short of breath -indoor activities 0- 3 11 Short of breath -outdoor activities 0-3 2 Cough frequency 0-4 Cough and Sputum 3 Mucus quantity 0-3 4 Difficulty with mucus 0-4 1 Congestion 0-4 Chest Symptoms 5 Discomfort 0-4 6 Tightness 0-4 changes in PRO scores between the treatment arms, a clinically meaningful CFB was defined as a change equal to or greater than a minimal clinically important difference (MCID) of 2 units (9). This means that a decrease of at least 2 points from baseline in RS-Total was deemed clinically significant.

Precision in Clinical Endpoint Estimate
The point estimate and the precision in the estimated clinical endpoint was obtained through model simulations, with inclusion of the uncertainty in the estimated longitudinal IRT parameters. RS-Total, linked to the individual patient disease status (θi), were simulated using the final IRT parameter estimates. The derived relationship between θ and RS-Total ( Fig. 1) was used as a basis in the simulations. These stochastic simulations included parameter uncertainty from the estimated asymptotic variance-covariance matrix of the estimates by using the $PRIOR functionality in NON-MEM. Specifically, the NWPRI subroutine was used where prior of fixed and random effects were assumed to be normally and inverse Wishart distributed, respectively. Degrees of freedom for the inverse Wishart distribution were calculated based on standard error (SE) of the obtained parameter estimates (17). RS-Total were simulated (N = 2000) over a period of 24 weeks for each treatment arm, using a large number of virtual subjects (Nsubj = 15,000 per arm). The individual average RS-Total at 4-week intervals was calculated and subtracted from the individual RS-Total at baseline. The distribution of CFB in RS-Total for each simulation (including 15,000 subjects) are illustrated in Fig. 1. The median, 2.5th, and 97.5th percentiles of the average CFB from each distribution were used to represent mean (95%CI) CFB in RS-Total (Fig. 1). These IRT derived values were compared with those (published values) obtained using the MMRM analysis (9).

Sample Size
The relative sample size (N) of a study analyzed using MMRM versus IRT can be calculated based on the precision (CI) of estimates for MMRM (95%CI length -CIMMRM) and IRT (95%CI length -CIIRT) for the same sample size as shown in Eq. 4. An N larger than one indicates that a MMRM analysis requires a larger study size to achieve the same precision as an IRT analysis.   Table II. One-thousand six-hundred forty-three (91%) patients provided data up to at least week 24 with a median (range) missing days of 5 (0-101), whereas 158 patients (9%) stopped filling out the questionnaire after 84 (1-160) days with 3 (0-86) missing days.

IRT and Simulations Including Parameter Uncertainty
ICF parameters were estimated with good precision (Supplementary Table I). Item characteristic curves that illustrate the relationship between disease status and probability of giving a certain score for all items are shown in Supplementary Fig. 3. Addition of an offset drug effect value and correlation between θi,t = 0, Rmax and offset improved the description of data. The offset drug effect (Eq.5) value, that is different between treatment arms, was assumed to be normally distributed with IIV modeled using an additive function.
Different parameters per arm were estimated with a median (range) relative standard error (RSE) of 0.13 (0.02-0.75) (Table III). This model also showed an acceptable fit to the total score data for both treatment arms. Specifically, the model predicts satisfactorily the median observed total score data, as seen with agreement between     Fig. 2. According to the IRT, in the BUD/FOR arm, the CFB in RS-Total score did not achieve the MCID at any time point, whereas in the FF/UMEC/VI arm, the mean CFB in RS-Total achieved the MCID from week 9 onwards, which is in agreement with the observed data (Fig. 2). NON-MEM control stream and snippet of data for both models ICFs and longitudinal are provided in the Supplementary material. Based on simulations including parameter uncertainty, the IRT considerably improved the precision of the drug effect compared to the MMRM at every time point (Fig. 3 and Table IV

Discussion
Using data from a Phase III study, a longitudinal IRT improved the precision in the efficacy endpoint compared to MMRM. The use of NLME analysis based on itemlevel data (IRT) was recently proposed as an alternative to MMRM for evaluation of efficacy in the analysis of data from a Phase II study, where IRT improved the precision of the estimated drug effect considerably in comparison to MMRM (24). The benefits of using an IRT over other statistical methods such as least-square mean analysis have already been demonstrated, highlighting its higher power to detect a drug effect (13,25). This could be explained by the fact that IRT is a more informative approach that not only describe the longitudinal aspects of the data but also accounts for information from the scores given to the individual items of the questionnaire. In contrast, MMRM ignores this information in the data since i) it is an analysis of total score data only and ii) each visit is modeled independently as a factor with earlier time points contributing less information about the end-of-treatment response than what is the case for NLME model. MMRM analysis is designed for robustness rather than efficiency (12), hence it could be expected that IRT analysis is more efficient than MMRM. However, this study provides an insight into the magnitude of improvement in precision that an IRT model may provide and the consequences this can have on the required sample size.
The primary objective of the Phase III study analysis is the confirmation of efficacy (and safety) profiles of an investigational new drug. Overall, an increased precision around the efficacy estimate was observed with IRT compared to  MMRM. This analysis highlights that a smaller sample size would have been required to confirm the drug effect if a NLME model-based approach were to have been used instead of the MMRM approach. A median relative sample size over the 4-week intervals of 4.0 (FF/UMEC/VI) and 4.9-fold (BUD/FOR) larger would have been required for the MMRM analysis to achieve the IRT precision. These values are comparable with a 3.5-fold smaller study size with IRT compared to MMRM analysis using RS-Total data from a Phase II trial (24). This is particularly important in cases where a large number of patients cannot be recruited for pivotal trials (e.g. rare diseases) with presumably a clearer benefit of using the IRT approach due to the increased utilization of the information contained in the item-level data. Thus, the increased precision in the efficacy estimate with IRT can lead to a higher probability of making a correct decision which appears important in light of the many failures in Phase III attributed to underpowered trials.
This analysis illustrates how simulations using an IRT with parameter uncertainty included can be used to obtain the precision of the primary efficacy endpoint. A large number of virtual subjects (15,000) were used in the simulations to make sure that uncertainty comes mainly from the model parameter estimates rather than the sample size. The SEs associated with each parameter estimates were obtained from the variance-covariance matrix in NONMEM. To further investigate how SEs obtained from different techniques such as bootstrap or sampling importance resampling would have impacted the precision in the efficacy endpoint was not in the scope of this study; however, it would be of interest to investigate how this would affect the calculated sample size.
This analysis is not exempt of limitations, which are described as follow. A predetermined (decided before start analyzing the data) longitudinal model (Weibull function) was considered; however, the addition of an extra parameter (offset) was required to describe the rapid onset of bronchodilator effects in both treatment arms, and thus improve the predictive performance of the model (based on the VPC plot). The absence of this parameter in the model would have led to a poorer description of the data, which is not necessarily related to a worse or better precision and/or relative sample size (26). The authors acknowledge that, in order to use a pharmacometric model as primary analysis in confirmatory trials, this model need to be pre-specified to avoid the risk of type 1 error inflation due to multiple testing during model building. One benefit of MMRM in this aspect is its flexibility and the fact that can adequately be pre-specified, though some assumptions are still required (12). Furthermore, in this analysis, model uncertainty (e.g. uncertainty from the structural part of model or from the random effects) and its impact on the precision around the efficacy endpoint was also not investigated. To mitigate model uncertainty in a NLME model-based analysis, it has been proposed to conduct model averaging (27,28). Model averaging would cover the model space by assigning a goodness-of-fit derived weight to the different proposed structural models, thereby including model uncertainty for a pre-specified analysis. Lastly, it can be argued that a simpler NLME model for the total score as one continuous endpoint could achieve similar results, in terms of precision and power, than an IRTbased model analysis when assessing treatment effect on the CFB of total scores. A formal comparison between these two types of analyses was not performed in this study; however, previous work has suggested that IRT-based models are more informative and require sample size that are approximately 20-40% smaller than analysis of total score data (13,(29)(30)(31). Despite these limitations, this analysis shows the advantage of NLME analysis with a direct comparison of the same primary endpoint for the two methods (IRT and MMRM) using observed clinical trial data rather than focusing on drug effect parameters which are often unobserved. Based on simulations, it has been shown already that a NLME analysis can be more powerful than MMRM in some (albeit not all) scenarios (32).

Conclusions
The positive impact of using a NLME model-based approach in decision-making during drug development has already been shown (4,33). This analysis shows the advantage of using a NLME model based on item level data over a standard approach used today in drug development (MMRM) for the same endpoint, increasing the precision in the efficacy estimate and thereby significantly reducing the required sample size to confirm drug effect.
ACKNOWLEDGMENTS AND DISCLOSURES. The authors would like to acknowledge Maggie Tabberer for her contributions to the data analysis and interpretation. Trademarks are owned by or licensed to EVIDERA. CL-P, ELP and MOK declare that they have no conflict of interest. CA, SY and MB are GSK employees and hold GSK shares.
Funding Statement GSK funded this research in the form of a Research payment to Uppsala University. Funding for this research was also provided by the Swedish Research Council Grant 2018-03317.
Author Contributions CL-P, ELP and MOK designed and carried out the analysis. SY, CA and MB were responsible for providing the data and review the analysis. CL-P drafted the manuscript, which were critically revised and received final approval by all authors.
Funding Open access funding provided by Uppsala University.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.