A Novel Method for Analysing Frequent Observations from Questionnaires in Order to Model Patient-Reported Outcomes: Application to EXACT® Daily Diary Data from COPD Patients

Chronic obstructive pulmonary disease (COPD) is a progressive lung disease with approximately 174 million cases worldwide. Electronic questionnaires are increasingly used for collecting patient-reported-outcome (PRO) data about disease symptoms. Our aim was to leverage PRO data, collected to record COPD disease symptoms, in a general modelling framework to enable interpretation of PRO observations in relation to disease progression and potential to predict exacerbations. The data were collected daily over a year, in a prospective, observational study. The e-questionnaire, the EXAcerbations of COPD Tool (EXACT®) included 14 items (i.e. questions) with 4 or 5 ordered categorical response options. An item response theory (IRT) model was used to relate the responses from each item to the underlying latent variable (which we refer to as disease severity), and on each item level, Markov models (MM) with 4 or 5 categories were applied to describe the dependence between consecutive observations. Minimal continuous time MMs were used and parameterised using ordinary differential equations. One hundred twenty-seven COPD patients were included (median age 67 years, 54% male, 39% current smokers), providing approximately 40,000 observations per EXACT® item. The final model suggested that, with time, patients more often reported the same scores as the previous day, i.e. the scores were more stable. The modelled COPD disease severity change over time varied markedly between subjects, but was small in the typical individual. This is the first IRT model with Markovian properties; our analysis proved them necessary for predicting symptom-defined exacerbations. Electronic supplementary material The online version of this article (10.1208/s12248-019-0319-9) contains supplementary material, which is available to authorized users.


INTRODUCTION
Chronic obstructive pulmonary disease (COPD) is an inflammatory disease of the lung, characterised by airflow obstruction that progresses with time. The most important risk factor associated with COPD is considered smoking, but risk factors also include other exposures (e.g. air pollution, occupational dusts and chemicals) and host factors, such as α1-antitrypsin deficiency (1). COPD is associated with emphysema and mucus hypersecretion, and its progression is punctuated with acute periods of a temporary increase in symptoms, also called exacerbations (2,3).Historically, exacerbations are defined by a clinic visit or hospitalisation with medical treatment (clinically confirmed); however, recently, questionnaires have been validated as useful for symptomdefined exacerbations (4)(5)(6). Exacerbations contribute to an accelerated decline of pulmonary function, higher risk of cardiovascular events (7) and worse quality of life (8) and are a major cause of COPD-related hospital admissions, morbidity and mortality; therefore, also increasing healthcare costs. Approximately 174 million people had COPD in 2015 (9), and around three million die from it every year (10). The disease burden of COPD is third worldwide (11,12), but may even increase in the future, due to an ageing society.
For slowly progressing diseases, such as COPD, clinical trials that investigate symptom-based exacerbations and/or disease progression are often relatively long (e.g. a year, or more), making collection of multiple observations possible. Modelling techniques facilitate using all the information from the repeated measurements and can therefore provide additional insight over and above the output from traditional statistical methods, where typically only end-of-treatment information is used. Such longitudinal analysis has the advantage of quantifying any trend in response and may therefore be helpful in describing and predicting future exacerbations, which may be especially useful in early clinical trials. Most models published so far are either Markov models, focusing on the cost-effectiveness of the COPD treatment intervention (13)(14)(15)(16)(17), or logistic regression models, predicting the probability of exacerbation within the next 24 months (18) or COPD-patient hospital admissions (19). Some linear and non-linear regression models were also developed, aiming to predict disease progression (20)(21)(22). To our knowledge, none of the models published so far described the longitudinal COPD progression using daily patientreported-outcome (PRO) data, which might be able to reflect symptoms earlier than, for example, clinicians' reports (23). A model that would use daily PRO data in its entirety to predict changes in patient disease severity would therefore be valuable in assessing disease progression.
PRO data generally reflect health status reports that come directly from the patient and are being increasingly used to inform clinical decisions and assess improvements in a patient's health status, and also in drug development (23)(24)(25). The standard method for analysing longitudinal PRO data from questionnaires (26) is the total instrument score, calculated from the individual item scores; hence, a single continuous variable is analysed. An alternative approach is to use PRO data in its entirety by developing a longitudinal mixed-effects item response theory (IRT) model (26), where the contribution of each individual item score is modelled separately over time, but related to a common underlying hypothetical latent variable (i.e. in IRT terminology and hereafter referred to as 'disease severity'), which varies between individuals and over time (27). This can help one understand change in disease severity over time (i.e. symptom-based disease progression), particularly in early clinical development. Furthermore, as each individual item score is modelled separately giving rise to a unique 'item characteristic curve' (ICC), IRT modelling can provide knowledge on the item that is most informative for a population with a certain disease severity, which can be valuable information for clinicians applying the questionnaire.
Frequent observations of categorical data make the presence of the Markovian elements, typically identified as many consecutive same-item-score observations, likely. With daily data collection in studies, through the use of electronic devices, such correlations are likely to be manifested. In previously published implementations (28)(29)(30)(31)(32)(33)(34)(35), IRT models assume independence under the structural model. This means that the probability distribution of outcomes are driven by the latent variable, and the ICCs only and observations, even when close to each other in time, are independent realisations. This is found often to not be the case with subjective scoring, and models to account for the dependence between consecutive observations in the data would be valuable.
In the present work, we analysed the PRO data, collected in the Acute Exacerbation and Respiratory Infec-tionS in COPD (AERIS) study (36,37), using the EXAcerbations of COPD Tool (EXACT®) (38,39), with two aims: (i) to create a model that can be useful as a basis for assessing disease progression including predictions of symptom-defined exacerbations and effect of treatment on these clinical trials, and (ii) to extend the IRT modelling methodology to incorporate the situation where underlying data display Markovian features.

Data
The EXACT® PRO data were collected in the AERIS study (36,37), a prospective 2-year longitudinal observational study where adult and elderly patients were only receiving standard-of-care treatment, no investigational drug. The study was conducted at the Southampton General Hospital, UK (clinicaltrials.gov number NCT01360398). Details of study conduct and inclusion and exclusion criteria are reported elsewhere (36,37,40). Only the first-year data were available and analysed in the present work. The baseline characteristics of the patients included in the analysis were the following: age, gender, educational and professional status, smoking status, severity of COPD according to the Global Initiative for Chronic Obstructive Lung Disease (GOLD) staging, number of years with COPD, forced vital capacity (FVC), forced expiratory volume in the first second (FEV1).
The EXACT® questionnaire, completed on an electronic diary, was used daily to record patients' answers to the 14 questions (9 with 5 ordered categorical response options (hereafter referred to as 'categories') and 5 with 4 categories, see Table I), to collect data on symptoms suggestive of an exacerbation of COPD. A higher item score indicated a more severe symptom, for example 0 (not at all), 1 (slightly), 2 (moderately), 3 (severely) and 4 (extremely). Due to the design of the study (specifically the device including the electronic questionnaire), partial completion of the questionnaire was not possible, but a missing day where a subject gave no answer to any of the items was possible. No data were excluded from the analysis. Full question for each item from the questionnaire is provided elsewhere (41). A higher score denotes a more severe symptom

Model Building
An IRT framework was combined with the Markov component, and the joint model was fitted to all available data simultaneously. For ease of model explanation, we firstly focus on the IRT part and then on how it is connected to the Markov part of the model.
An IRT model was developed to relate the responses from each individual item (items were ordered categorical data with 4 or 5 categories (Table I)) to an unobserved latent variable (42), i.e. the underlying COPD disease severity. All longitudinal data were used when modelling and also when establishing the item characteristic curves.
A logistic transformation was used, where the (steadystate) probability P of a subject i reporting a response at or above category (or, item-score) k is expressed as follows (Eq. 1). The steady-state probability of an item response being exactly k was given as described in Eq. 2.
where D i is the disease severity of subject i, and a j and b j are parameters specific to item j; more specifically, a j is the slope (or discrimination parameter), and b j,k is the difficulty parameter for the item-score k, which was constrained to be increasing for increasing scores of the same item, namely b j,k + 1 ≥ b j,k . The parameter describing the COPD severity (D i ) was not bounded and was assumed to follow a normal distribution N(0,1) at baseline (D i,t = 0 ). The longitudinal changes in the COPD progression were modelled using a linear change in disease severity, as shown below in Eq. 3.
where slope i is a subject-specific parameter with interindividual variability assumed to follow a normal distribution, and t is time in days.
To allow for the lack of independence between neighbouring observations, the Markov models were used on an individual item level. First-order MM was assumed, i.e. the next observation depended only on the current observation. The distribution of probabilities across the 4/5 categories were described using a set of 4/5 corresponding states. Changes in probabilities with time were described using a set of 4/5 ordinary differential equations (ODEs) where firstorder transfer constants governed the changes in probability between adjacent states. As previously described (43), a reparameterisation allowed the Markovian component across all state transitions to be described in a single parameter, the mean equilibrium time (MET), which is the time when the dependency between observations does not change anymore. As the Markovian features are expressed through one parameter only, and because time is treated dynamically, this type of model has been referred to as a 'minimal Continuous Time Markov Model (mCTMM)' (43). The probability of the first observed score of an item was estimated as P ss (Y ij = k) at steady state given by Eqs. 1 and 2 above, i.e. without an assumption on the previous score. For all subsequent observations, the probabilities were reset immediately after each observation to represent the known distribution of probabilities, i.e. 1 for the currently observed state and 0 for all other states. The relation between MET and the rate constants (λ) and the steady-state probabilities (P ss ) is given in Eqs. 4-6 below. Since transition times and rate constants are in inverse relationship, this means that as MET increases, the λ decreases, and therefore, the probability of transitions also decreases. In other words, this means that with an increase in the MET parameter, the Markovian properties of the system are becoming more important.
The ODEs used to describe the model are provided in Eqs. 7-11, with simplified notation for probabilities.
where, in the case of items with only 4 possible scores, rate constants λ (k = 3,k = 4) and λ (k = 4,k = 3) were assumed to be 0. In the model, it was assumed that there was interindividual variability in MET expressed through an exponential distribution. A schematic representation of the IRT and Markov model(s) is shown in Fig. 1.
To obtain informative initial estimates for the parameters in the 4/5-compartment ODE model, an analytical solution (AS) for a 3-compartment ODE system was used. The data from each item were merged into 3 categories in 3 different combinations ((a) 0, 1, 2-4; (b) 0-1, 2, 3-4; and (c) 0-2, 3, 4), then fitted with the 3-compartment AS models, and the final estimates used as initial values in the 4/5compartment ODE model.
To confirm that the Markov elements were needed in the model, a model without the Markov elements was tested, by fixing MET to a small value (0.1 days). Furthermore, itemspecific MET, time-dependent MET (in a linear and a power fashion), and a Box-Cox transformed MET interindividual variability were tried as extensions to the model.  (46) and R packages, such as, dplyr (47), Xpose4 (48) and ggplot2 (49) were used for data management, summary statistics and graphical examination of NONMEM outputs.

Model Discrimination and Evaluation
To discriminate between the models and evaluate the final model, goodness-of-fit and visual predictive checks (VPC) were used, and the uncertainty on the model parameters was inspected. The models were also compared according to the objective function value (OFV) provided by NONMEM, where the difference in OFV (ΔOFV) is χ 2 distributed, i.e. a ΔOFV > 3.84 corresponds to p < 0.05, for a one degree of freedom difference between two models. For each VPC, 1000 datasets were simulated using parameter estimates from the model, and 95% confidence intervals around key percentiles were computed. VPCs were produced on the individual item-score level, stratified and non-stratified by individual items, and on the total score level (versus time on study, and versus age), also stratified by covariates, such as gender. The total score was obtained as specified in the EXACT® manual (41), i.e. a 0-100 logit transformation was used. Additionally, a VPC of transitions was made, taking into account both the current and the previous state in order to evaluate the Markov part of the model.

Model Application
To quantify the correlation between items, item-specific residuals (RES ij ) were calculated as specified in the following Eqs. 12 and 13: where DV ij is the response of subject i to item j, and IPRED ij is the corresponding weighted prediction (32). Furthermore, the Fisher information (i.e. the second derivative of the log-likelihood) (50) for each item was obtained, in order to examine which item provides the most information for the following: (a) a typical subject from this study, (b) the 'healthier' part of the population (i.e. 5th percentile) and (c) the 'sicker' part (i.e. 95th percentile) of the population.

Simulations
To further evaluate the model, we simulated clinical outcomes used in COPD treatment, such as a symptom-defined exacerbation event. Due to complexity, a simplified version of the definition of a symptom-defined event from the EXACT® manual (41) was used. Specifically, a symptom-defined exacerbation event was defined as an increase in total score over baseline of at least 12 points over 2 days, or of at least 9 points over 3 days; if baseline was zero, there was no event. Baseline was reset every 4 weeks. The baseline for the first block of 4 weeks was determined as the mean total score in the first week of the study, and for subsequent 4-week-block baselines, the mean total score of the last week of the previous 4-week block was used. If a subject completed the questionnaire for fewer than 4 days in the 'last week', baseline was not reset. The cumulative proportion of subjects that already had a symptom-defined exacerbation event was computed for both observed and simulated data, and the agreement between model-predicted (from n = 100 simulations) and observed symptom-defined exacerbations was assessed.

Data
Data from 127 COPD patients (median age 67 years, 54% male, 39% current smokers at study initiation) were available and included approximately 40,000 observations per item. Baseline characteristics of the study participants are presented in Table II. Before the end of the first year, 36 subjects (28%) stopped filling in their questionnaire, 22 of which discontinued the study due to reasons not directly related to the disease severity (37). Five subjects had no missing days (i.e. days when they did not fill in the e-questionnaire), and the remaining 122 subjects had a median  (range) 13 (1-221) total missing days. The median (range) consecutive missing days for all subjects was 3 (0-74) days.

Modelling
An IRT model with 14 (4/5-compartment) the Markov submodels (supplementary material) was successfully fitted to the data (Figs. 2, 3 and4). The model without the Markov elements performed much worse than the model with the Markov elements, i.e. it was not able to predict several transitions as seen on the transition VPC (Fig. S1, supplementary material).
The base model was an IRT model with the Markov elements, and a single non-changing MET parameter. Among the extensions to the model, linear time-dependency on MET improved the visual diagnostics the most and resulted in the biggest OFV drop (ΔOFV = 8727, compared to the base model) among the tested models. For example, when using the model with 14 item-specific METs, the visual diagnostics did not improve, and the OFV drop was lower (ΔOFV = 4678); similarly for the models with a Box-Cox transformation of the variability on MET (ΔOFV = 13.7), or a power time-dependency on the MET parameter (ΔOFV = 5902), all compared to the base model. Therefore, the model with a linear time-dependency on MET was chosen as the final model.
Final parameter estimates are presented with uncertainty in Table S2 (supplementary material). The mean (standard error) equilibrium time was estimated as 1.2 (0.07) days at the beginning of the study (i.e. day 0), and 5.1 (0.57) days at the end of the study (i.e. day 365). The mean (standard error) slope on disease severity representing disease progression was 0.007 (0.08) latent variable scale units per year, with substantial interindividual variability, 122 %CV (Table S2).
Visual predictive checks on the item score level showed satisfactory fit to the data ( Fig. 2; stratified by an individual item, Figures S3a-d, supplementary material). A VPC of transitions is shown in Fig. 3, and a VPC of the total scores plotted against time in Fig. 4 (also stratified by gender, and versus age are shown in Figures S4, and S5,

Model Application
The correlation plot showed quite a strong correlation among some of the items, specifically, between items 1, 5 and 6, items 7 and 11 (Fig. 5) and also between items 2 and 3, corresponding to different domains in the EXACT® questionnaire (Table I). Additionally, a relatively high correlation was also observed between two items from different domains (Table I), specifically item 12 and 13 (Fig. 5).
The items providing the most information in characterising the typical patient (i.e. patient with the typical disease severity) at baseline were 'Breathless-personal care' and 'Breathless-indoor activities' (Fig. 6). Some other items, however, proved to be less informative (e.g. 'Sputum difficulty') (Fig. 6). Also, in general, all items (except item 3, 'Mucus quantity') appeared to be more informative for the 'sicker', or a typical subject from the studied population, compared to the 'healthier' part of the population (Fig. 7). The item characteristic curves (ICC) for all 14 items are shown in Fig. S6 (supplementary material), and the values of the ICC parameters are reported in Table S2.
A comparison of model-simulated and observed cumulative proportion of subjects who already had a symptomdefined exacerbation event, from a model with the Markov elements and a model without the Markov elements is shown in Fig. 8 and indicated that the addition of the Markov elements improved model performance.

DISCUSSION
A combination of an item response theory model and 14 item-specific longitudinal Markov models was successfully developed for the first time to our knowledge. This integrated modelling approach proved to be able to describe frequently collected and therefore correlated composite score data, as was exemplified here using daily EXACT® patient reported outcome data from patients with COPD receiving standard of care only.
The importance of developing a model, where IRT is used together with the longitudinal Markov elements, is twofold. Firstly, using the IRT methodology has several advantages over the total-score approach. For example, as it uses all individual item scores and not just a total score, it prevents information loss, which might otherwise result in model misspecification. Additionally, it does not ignore the categorical nature of the data, which occurs when modelling the total score as a single continuous variable. Secondly, by including the longitudinal Markov elements (on an item level), the developed model also provides a way to describe the dependence between frequently collected longitudinal data such as those obtained nowadays from patient-reported diaries which are now filled in at home, using an electronic device.
The mean equilibrium time was estimated longer than a day (1.2 days at start of the study, and 5.1 days at the end of the study, Table S2), and a clear misfit of the model without the Markov elements was observed (Fig. 8, Fig. S1), which both confirmed that the addition of the Markov elements was needed. The values of the MET parameter estimates also show that the variability in patients reporting outcomes was decreasing with time on study, since a higher MET value indicates fewer transitions, i.e. more stable scores. This might be because when a patient is recruited to a study, they might be more compliant with their medications and have more healthcare interactions.
The graphical evaluation of the model showed that the final model can describe the data adequately. More specifically, the model was able to describe the proportions of itemscores ( Fig. 2 and Fig. S3a-d), the proportions of transitions between previous and current item-scores (Fig. 3), and also the total scores (Fig. 4). There were some underpredictions in the situation when an item score would not change from 1 (Fig. 2), and also in the median total scores (Fig. 4); however, these misspecifications were not that apparent, with the model adequately capturing the time trends and the 95% prediction intervals, and none of the additional changes to the model that were tested provided an improvement.
When correlations between items were investigated, it was confirmed that the grouping of the items in the COPD disease domains (as captured in the EXACT® questionnaire) was appropriate, since the correlation between items 1, 5 and 6; items 7 and 11; and items 2 and 3 (belonging to the chest domain, breathlessness domain and cough and sputum domain, respectively, Table I) were much stronger than between other items (Fig. 5). Additionally, although not belonging to the same domain, items 12 and 13 also showed to be correlated, which can be expected, given their description-i.e. 'Tired or weak', and  (Table I).

COPD severity
Fisher Information Fig. 6. The Fisher information content per item, plotted against chronic obstructive pulmonary disease (COPD) severity, represented in percentiles of the data (e.g. 50 indicates a typical subject from this population). Grey area is the 95% confidence interval around the baseline disease severity. More detailed item descriptions are given in Table I. 'Disturbed sleep', respectively (Table I). Since there were items belonging to obvious domains, one could argue that several latent variables should be included in the model; however, in our example, this was not possible due to too few items belonging to separate domains.
Since visual diagnostics confirmed that the model could describe the data satisfactorily, i.e. there were no major misspecifications, further potential improvements to the model were not investigated. For example, we used minimal Markov models, meaning that the MET was assumed to not differ between compartments, and a dropout model was also not investigated. Additionally, we did not evaluate any of the available covariates in this work, nevertheless, the model was able to describe the data when stratified by gender (Fig. S4) and also when plotted against age (Fig. S5), indicating that these covariates might not provide an improvement to the fit. Another potential limitation might be that we did not have information on whether the patients were at any time admitted to a hospital and treated with another drug (not standard of care), which may affect their disease progression.
In our model, continuous time Markov models were used, where, in contrast with discrete time Markov models, the probability of transition can change according to the time difference between two consecutive observations. Therefore, if a subject had missing observations for a study day, this did not affect the fit.
When the final model's ability to predict symptom-defined exacerbations was tested, the model could predict the general trend of the cumulative proportions of the subjects with an exacerbation; however, some overprediction was observed when the study day was greater than approximately 180 days (Fig. 8), which might be due to fewer study participants continuing to report their symptoms. However, the model with the same structure without the Markov elements greatly underpredicted The Fisher information content for the 'healthier' (5th percentile, white circles), 'typical' (50th percentile, grey circles) and 'sicker' (95th percentile, black circles) parts of the studied population, for each item (Table I). the symptom-defined exacerbations even from the beginning of the study. The poor performance of the model to predict symptom-defined exacerbations without the Markov elements can be understood from the definition of such exacerbation, i.e. the total score had to be increased for at least two or three consecutive days. As the variability in transitions between scores was exaggerated in the model without the Markov elements (Fig.  S1), a stable increase in a score did not occur; hence, a symptomdefined exacerbation was not defined.
Future work could include adding a treatment effect to the model, and although the model was developed using EXACT® data, it could also be applied to daily data from other questionnaires. With a treatment effect added to the model, the model could be utilised for simulation of study designs in terms of evaluating treatment effects, sample size and duration. Additionally, the effect of patients' characteristics on response over time, especially important in early drug development, could also be investigated and perhaps help shorten proof-of-concept studies.

Conclusion
A longitudinal mixed-effects IRT model with Markov elements was developed for the first time to our knowledge and applied to real data from an observational study. The model was able to handle both composite scores and frequent observations, which was exemplified in this analysis using COPD item score and symptom-defined exacerbation data from the EXACT® questionnaire. The developed model also showed it could serve as a platform model for predicting symptom-defined exacerbations.