Introduction

Glucagon-like peptide-1 receptor agonists (GLP-1 RAs) are often recommended for treatment of type 2 diabetes (T2D) [1]. Medications in this class have demonstrated efficacy for glycemic control, along with a low risk of hypoglycemia and the potential benefit of weight loss [2,3,4,5]. The injectable GLP-1 RAs vary in terms of injection devices and treatment administration procedures, which could have an impact on patient preference.

Therefore, two patient-reported outcome (PRO) measures have been developed to assess patient perceptions of injection devices used to administer these non-insulin injectable medications: the Diabetes Injection Device Experience Questionnaire (DID-EQ) and the Diabetes Injection Device Preference Questionnaire (DID-PQ) [6]. The DID-EQ was designed to assess perceptions of a single injection device, and it has demonstrated reliability and validity in patients treated with GLP-1 RAs [7]. The DID-PQ was designed to assess preference between two non-insulin injection devices. This questionnaire has been used in two previous studies [7, 8]. In both studies, however, it was completed by a relatively small subset of patients who had used two non-insulin injection devices (n = 27 and n = 58). Therefore, it was not possible to draw conclusions about construct validity of the DID-PQ from these previous datasets.

In a recent crossover study with a larger sample, people with T2D performed mock injections with two non-insulin injection devices, and all participants completed the DID-PQ to report preferences between the devices [9]. Data from this study provide the first opportunity to examine performance of the DID-PQ in a larger sample. The purpose of the current analysis was to assess construct validity of the DID-PQ and demonstrate one way to test whether there is a significant preference for one injection device over another.

Methods

Study design

Data were from an open-label, multicenter, randomized, crossover study (ClinicalTrials.gov identifier: NCT03724981) [9, 10] assessing patient preference for the dulaglutide single-use pen [11] and the semaglutide single-patient-use pen among injection-naïve patients with T2D [12]. The devices used in the study were those commercially available in the United States. The study design is illustrated in Fig. 1. Study participants were recruited at 13 clinical sites across the US, including nine general practice clinics and four endocrinology clinics. After providing consent to participate in the study, participants were randomly assigned to one of the two device orders (i.e., either dulaglutide or semaglutide first, followed by the other device). After being trained to use each device based on device instructions for use (IFU), participants performed all steps of injection preparation and administered mock injections into an injection pad. Further details of the study design, inclusion/exclusion criteria, and methods have been published previously [9].

Fig. 1
figure 1

Crossover Study Design

Measures

After completing training and performing mock injections with both devices, participants completed the measures described below. Both questionnaires were administered on paper forms and used the brand names (Trulicity for dulaglutide; Ozempic for semaglutide). The questionnaires included color images of the injection devices at the top of the page to avoid any confusion regarding which device corresponded to each question and response option.

Global preference item

The global preference item evaluated patient preference between the devices. The item asked “Overall, which device do you prefer?” Response options were Ozempic, Trulicity, or No Preference. All participants completed the global preference item before completing the DID-PQ.

Diabetes Injection Device Preference Questionnaire (DID-PQ)

The DID-PQ was designed to assess patient preferences between two non-insulin injection devices [6, 7]. The 10 questionnaire items were developed based on qualitative research with patients. Items 1 to 7 focus on preference related to specific characteristics of injection delivery systems. Items 8 to 10 are global items assessing preference based on overall satisfaction, ease of use, and convenience of the injection devices. Each item is rated on a five-point scale allowing respondents to indicate whether they prefer or strongly prefer one of the devices over the other. For each item, participants could also select the “no preference” response. As the five response options are categorical, mean scores are not calculated.

Statistical analysis

Analyses were performed using data from participants who had (1) been randomized to a device order, (2) been exposed to both devices regardless of whether they successfully completed the mock injection, and (3) completed the global preference item. No imputations were performed for missing data. All statistical tests were two-sided with a significance level of 5%. Descriptive statistics (mean, standard deviation, range, and frequency) were used to summarize demographic and clinical characteristics, as well as responses to questionnaires.

The categorical response options of the DID-PQ cannot be treated as continuous scores. Therefore, correlations with a criterion measure that would typically be conducted to examine construct validity of PRO instruments cannot be used. Instead, the 10 DID-PQ items were compared to the global preference item using categorical analyses so that concordance between the two instruments could be assessed. For these analyses, the five DID-PQ response options were collapsed into three categories by combining the “prefer” and “strongly prefer” response options. Thus, the DID-PQ and global preference items had the same three levels of response: prefer dulaglutide device, prefer semaglutide device, and no preference between devices.

These three-level responses were compared to responses on the global preference item in three ways: (1) percent agreement, (2) Gwet’s AC1 statistic [13, 14], and (3) the prevalence-adjusted and bias-adjusted Kappa (PABAK) statistic [15]. The Gwet’s AC1 and PABAK statistics were used to assess concordance instead of the traditional Kappa statistic because Kappa is sensitive to uneven data distributions [16]. For example, when there is high agreement in situations with an uneven distribution of responses across the possible response options (e.g., high prevalence observed for one response option), Kappa may not accurately represent concordance [16]. Gwet’s AC1 is similar to Kappa, but it uses a different definition of chance agreement with a more realistic assumption that only a portion of the observed ratings will potentially lead to agreement by chance [13]. Thus, it is more robust to an uneven distribution of data. The PABAK statistic defines and incorporates both a bias index and prevalence index into its calculation of the estimate of chance agreement, therefore mitigating potential effects of rater bias and overall prevalence [15]. The Gwet AC1 and PABAK statistics were interpreted using benchmarks commonly used to interpret agreement statistics. For example, values over 0.80 are thought to indicate “almost perfect” agreement or “very good” agreement [17, 18].

To determine whether significantly more participants preferred one device over the other with regard to each item of the DID-PQ, comparisons between devices were performed according to the following steps: (1) participants who provided a neutral response for an item were dropped from analysis of that item; (2) for each item, responses were grouped into two categories (prefer dulaglutide device or prefer semaglutide device); and (3) a two-sided binomial test was performed to determine whether the difference in preference between the devices was statistically significant. This test assessed whether the proportion indicating preference for one of the two devices differed from 0.5. For each DID-PQ item, the null hypothesis was that the probability of preferring one of the devices was 0.5, which would indicate that an equal number of respondents preferred each device. If the binomial test yielded a significant p-value, then the null hypothesis could be rejected, which would mean that significantly more participants preferred one device over the other.

Results

Sample characteristics

A total of 310 participants were included in the sample, with half (n = 155) randomized to each group (i.e., either dulaglutide or semaglutide device first). Detailed demographic and clinical information has been previously published for this sample [9], and a selection of participant characteristics are presented Table 1.

Table 1 Demographic and Clinical Characteristicsa

Validity of the DID-PQ

There were minimal missing data on the DID-PQ, as shown in Table 2. There was strong concordance (percent agreement > 78%) between the global preference item and nine of the 10 DID-PQ items (Table 2). Percent agreement was particularly high (> 91%) for the three DID-PQ global items assessing preference related to overall satisfaction, ease of use, and convenience (items 8, 9, and 10). The only DID-PQ item that did not have strong concordance with the global preference item was item 6, which asks about preference related to needle size (percent agreement = 59.7%). The Gwet AC1 and PABAK statistics were consistent with percent agreement, with results indicating strong agreement between the global preference item and all DID-PQ items except item 6 (Table 2).

Table 2 Agreement Between DID-PQ Items and the Global Preference Item Assessing Preferences between Two GLP-1 Receptor Agonist Injection Devices (N = 310)

Significance testing of preferences between devices

For each item of the DID-PQ, a two-sided binomial test was performed to determine whether significantly more participants preferred one device over the other (Table 3). There was a statistically significant difference (p < 0.0001) in preference on every item of the DID-PQ with significantly more participants reporting a preference for the dulaglutide injection device.

Table 3 Significance Testing for Difference in Preference between Devices on Each Item of the DID-PQ (N = 310)

Discussion

Patient preference has been recommended as a “major factor driving the choice of medication” in a consensus report by the American Diabetes Association and the European Association for the Study of Diabetes [1]. To collect and interpret patient preference data, well-designed and valid measurement tools are needed. Current findings suggest that the DID-PQ may be a useful tool for providing insight into preferences of people with T2D using GLP-1 receptor agonists. While a single global item can be used to assess injection device preference, the DID-PQ can provide a more detailed assessment of factors contributing to this preference, including ease of use, convenience, overall satisfaction, and details of the injection experience.

Concordance with the global preference item supports the construct validity of the DID-PQ. Item 6 of the DID-PQ, which assesses preference related to needle size, had the lowest concordance (59.7% agreement). Although needle size is an important factor for some patients [6], this item may not have yielded consistent data because participants were injecting into an injection pad rather than injecting themselves. Therefore, they did not personally experience the feeling of injecting with either needle, and the factors that participants considered when responding to this question are unclear and may have varied widely. Future research involving actual injections rather than mock injections may be necessary to assess validity of DID-PQ item 6.

In addition to examining validity, the study provides a parsimonious and easily interpretable method for examining whether preference for one device over another is statistically significant (Table 3). This analysis approach excludes neutral (i.e., no preference) responses. For situations when it may be important to consider the number of neutral responses (which were relatively rare in the current study; Table 3), the Prescott test can be used to determine whether there was a statistically significant difference in preference while accounting for the frequency of respondents with no preference [19, 20]. The Prescott test was used in the original analysis of data from the current study, with similar statistically significant results favoring the dulaglutide device on all 10 items of the DID-PQ [9].

The structure of the DID-PQ does not allow for typical psychometric analyses, and the resulting limitations need to be considered. Unlike a PRO measure of symptoms or health-related quality of life, the items of the DID-PQ do not have ordinal response options ranging from lowest to highest on a particular construct, and item scores cannot be aggregated into subscales for analysis of continuous data. Instead, DID-PQ items yield categorical data representing preference. Therefore, it is not possible to assess internal consistency reliability with Cronbach’s alpha, test-retest reliability with intra-class correlations, or convergent validity with Spearman correlations. Furthermore, there are not generic instruments or validated gold standard criterion measures that may be used for assessment of construct validity. While the current categorical analyses support construct validity of the DID-PQ via comparisons to a single item assessing global preference, it is not possible to thoroughly investigate reliability or validity of the instrument using common psychometric methods. Future research with the DID-PQ may provide further confidence in its validity.

There may also be limitations associated with the mock injection procedures. Participants were trained on each device prior to injecting each pen into an injection pad. Participants did not inject themselves with medication. Some aspects of the injection experience, such as comfort related to needle size or liquid volume, were not apparent during these procedures. It is possible that some DID-PQ responses could have been different if participants had injected themselves instead of the injection pad. Still, participants were thoroughly trained on both injection devices, and they performed all parts of the injection process. Therefore, their DID-PQ responses were likely based on a good understanding of both devices.

Despite these limitations, the DID-PQ represents a step forward for assessment of patient preference between injection devices. For preference to inform clinical decisions, measurement tools focusing on comparisons between treatments will be necessary. Since the DID-PQ has been useful in several studies, perhaps it could be a model for development of questionnaires designed to assess preference among other treatments across a range of medical conditions.