Glaucoma is the leading cause of permanent visual disability worldwide [1]. With the global prevalence estimated to increase from 64.3 million in 2013 to 111.8 million in 2040 [2], a populous continent like Asia with a rapidly ageing population will likely see a substantial rise in cases. Evidence suggests that glaucoma patients report reduced health-related quality of life (HRQoL) [3], including independence and mobility [4], increased risk of falls [5], and poor emotional well-being [6], and that HRQoL may decline over time [7]. Treatment-related side-effects and economic burden [8] can also negatively impact on patients’ HRQoL, beyond the burden of the disease process itself [9]. Hence, it is important to understand the impact of glaucoma and the effectiveness of treatment therapies from the patients’ perspective, especially as healthcare moves towards a value-based care model [10].

Measurement of patient-reported outcomes is often done using patient-reported outcome measures (PROMs). However, most PROMs in Ophthalmology are fixed-length meaning all items (questions) must be administered regardless of whether they are targeted to participants’ underlying level of the construct (e.g., visual functioning), making them burdensome to administer. Moreover, despite most vision-related PROMs comprising more than 20 items, they generally only provide measurement of two or three HRQoL domains. Modern psychometric methods of instrument development, such as item banking and computerized adaptive testing (CAT), offer a solution to overcome these shortcomings [11]. In item banking, a collection of items that measure a latent construct is calibrated according to level of difficulty on an interval-level scale [11]. Items from the calibrated item bank (IB) are then administered by CAT using an algorithm to selectively present to participants items that provide the greatest amount of information until a stopping criterion (e.g., measurement precision, based on standard error of measurement [SEM]) is reached [12]. Capitalizing on these efficiency gains, CATs usually require only 7–10 items (dependent on the stopping rule) to obtain a score, which is substantially fewer items than required by most fixed-length PROMs. CATs therefore provide a fast, yet precise means of estimating patients’ level of HRQoL [13, 14].

While an IB for glaucoma was developed by Matsuura and colleagues in 2017 [15], a CAT is not yet available. Our group has developed, validated and implemented a glaucoma IB and CAT system (GlauCAT™-Western) [16,17,18]; however, the content was developed in patients from Australia and the UK and, as such, may not be relevant to an Asian population where lifestyles, healthcare systems and access and illness perceptions may differ from the West.

Our team has recently developed domains and items for a glaucoma-specific instrument HRQoL item bank (IB) and CAT based on qualitative information from Asian patients with glaucoma (GlauCAT™-Asian), which differs from the GlauCAT™-Western in terms of number of domains (7 vs. 12, respectively) and content (e.g., more items about glaucoma management in GlauCAT™-Asian) [19]. Indeed, in a head-to-head comparison of the GlauCAT-Western and GlauCAT-Asian instruments, we found that only 57% of content in GlauCAT-Western was similar to GlauCAT-Asian, and that 25% of the content in the new GlauCAT instrument was unique (Online Resource 1). The current study reports on the psychometric properties of the GlauCAT™-Asian IBs in a multi-ethnic, clinical sample glaucoma patients and explores the functionality of the final calibrated IBs using CAT simulations.

Methods

Sample population

Patients aged ≥ 40 years (English- or Mandarin-speaking) with a primary diagnosis of glaucoma (i.e., primary open angle glaucoma (POAG) or primary angle closure glaucoma (PACG)) in at least one eye were recruited from the Singapore National Eye Centre (SNEC) between December 2019 and January 2021. Patients were excluded if they had other ocular comorbidities including secondary glaucomas, severe cataract, neurological conditions affecting vision, and/or hearing or cognitive impairment (assessed using the 6-CIT questionnaire [20]). During recruitment, we utilised specific target quotas to ensure we included patients across the spectrum of ethnicity, gender, age, and glaucoma severity) in order to obtain a diverse sample and enhance the applicability of our calibrations.

The study protocol was approved by the SingHealth Centralised Institutional Review Board (CIRB #2018/2459) and conducted in accordance with the Declaration of Helsinki. Written informed consent was obtained from participants prior to study participation.

Assessment of glaucoma, visual fields, and visual acuity

Glaucoma subtype, Snellen visual acuity (VA) and visual field (VF) data (both eyes) were extracted from patients’ files. We also conducted binocular Esterman tests using the Humphrey Visual Field Analyzer-3 (Carl Zeiss AG, Jena, Germany). Details on grading of glaucoma severity and definitions of vision impairment (VI) are provided in Online Resource 2.

Development of the glaucoma IBs

The content development process of the GlauCAT™-Asian IBs has been described in detail elsewhere [19] and more information is provided in Online Resource 3. In brief, the final instrument comprised 221 items under seven QoL domains with 4 to 5 response options each rated on a Likert-type scale with a non-applicable option available when the task or issue was not relevant to the participant (Online Resource 4): Activity Limitation (AL; n = 72); Lighting (LT; n = 15); Mobility (MB; n = 19); Psychosocial (PSY; n = 55); Ocular Comfort Symptoms (OS; n = 19); Glaucoma Management (GM; n = 28); and Work (WK; n = 13).

Psychometric evaluation of the IBs

We performed Rasch analysis separately on each IB using Winsteps software (version 4.50; Chicago, IL, USA) using the Andrich single rating-scale model [21]. Rasch analysis is a probabilistic model that estimates the relative difficulty of items (item measures) and relative abilities of respondents (person measures) and aligns them on a common scale [22] enabling transformation of ordinal data into estimates of interval-level data, expressed in log of the odds units (logits). Rasch analysis was chosen over Item Response Theory (IRT) models such as the Graded Response Model as, unlike IRT models [23, 24], Rasch analysis enables stable item calibrations and person measure estimates in small sample sizes [25]. Rasch analysis allows a thorough assessment of a scale’s psychometric properties, which are outlined in detail below. Similarly, a set of analytic performance criteria guiding decision making relating to retaining or dropping domains (i.e., adequate fit statistics, plus Applicability, Ceiling effect, Test Information Function (TIF), Item Separation Index (ISI), Measurement range, and Clinical importance) and items (misift, Differential Item Functioning (DIF), Applicability, Discrimination, Clinical importance and Patient importance; Online Resource 2) was taken into account by the research team, comprising members with content development and psychometric (EF, RM, BL, EL); and/or clinical (MB, MN) expertise.

Rating-scale assessment

To evaluate rating-scale performance, the average observed measures for the person sample were investigated. The average observed measures should increase monotonically as the rating-scale increases meaning that, as person measures increase, each category in turn should be more probable than any one of the others to be observed, as shown by a distinct peak on the category probability curve graphs (Online Resources 12–20) [26]. Disordering may indicate that categories are underutilized, are poorly defined, or that there are too many categories for respondents to sensibly distinguish [27]. Category disordering can be resolved by collapsing adjacent categories if sensible to do so and this results in improvements to other Rasch metrics [26].

Precision

Person separation indicates if the scale can adequately distinguish between different person groups according to their level of performance. Values of > 2.0 and > 0.8 for person separation index (PSI) and person reliability (PR), respectively, suggest that the scale can distinguish at least three different person groups [22]. Extreme scores (i.e., minimum or maximum) were removed a priori from the analysis (range 29 [Lighting] to 126 [Mobility]), as these provide minimal information when estimating item measures and can reduce precision [22].

Unidimensionality

We utilized principal components analysis of residuals (PCA) to assess the dimensionality of the IBs [28]. An eigenvalue ≥ 3 and raw variance explained by measures < 50% was suggestive of multidimensionality. If PCA targets were not met, we then considered four other metrics [28]: 1. A high number of misfitting items, which may indicate poor fit by a group of items to the underlying construct rather than problems with the items per se; 2. The ratio of the raw variance explained by items (i.e., % ‘Rasch’ dimension) and the % unexplained variance in the first contrast. If the Rasch dimension was bigger than the secondary dimension by several-fold, this indicated the secondary dimension was noticeable but not problematic; 3. The standardized residual loadings of the first contrast to determine if loading items (> 0.4) formed a potential second dimension (i.e., they formed a conceptually relevant alternative construct; 4. The disattenuated correlations of each potential cluster to ascertain if they were measuring an independent construct (disattenuated correlation < 0.82), hence providing evidence for the scale to be split, or whether they could be assessing a different “strand” of the same construct (disattenuated correlation ≥ 0.82) [29].

Item fit statistics

Item fit was explored using infit and outfit MnSq statistics (acceptable range 0.5–1.5) [22]. Before deleting misfitting items, we explored the z-residuals of individual participant responses according to the process outlined by Boone and colleagues (p. 185) [22]. Scores >|4| (i.e., > 4 and < − 4) indicated a high likelihood of an erroneous/unpredictable response. Such responses were given a weightage of zero and item fit statistics reassessed. If misfit persisted, we iteratively checked for z-residuals >|3| and then >|2| if needed, until satisfactory item fit statistics were achieved [22]. Item deletion was only considered if this process did not resolve item misfit.

We also considered discrimination when considering item performance. Items with discrimination values > 1.0 and < 1.0 means that the item discriminates between high and low performers more and less than expected for an item of this difficulty, respectively. Over-discriminating items tend to classify people as either ‘highly impaired’ or ‘highly functional’. Under-discriminating items, however, are more of a concern as they tend neither to stratify nor to measure [30]; as such, we focused on items with values substantially under 1.0 as candidates for deletion.

Local item dependency

Rasch measurement requires items that approximate local independence. That is, there should not be any correlation between two items after the effect of the underlying construct is conditioned out [31]. Local item dependency (LID) between two items was deemed present if the correlation of residuals was > 0.2. To ensure we had LID-free item calibrations, we generated and then anchored LID-free person measures to all other person measures within a specific IB. This process forces item difficulties and rating-scale structures within the item bank to conform with the LID-free person measures and prevents LID from impacting item difficulties [13].

Targeting

Targeting was inspected via the person-item map (Online Resources 12–20). Poor targeting is evident if there are gaps in item coverage leaving certain levels of patient ability unmeasured and/or when there is a difference of > 1.0 logits between the mean item difficulty and person ability [22].

Differential item functioning

Differential item functioning (DIF) indicates if item bias is present for certain participant characteristics [32]. We examined uniform DIF for gender, age group (< 65 vs. ≥ 65 years), glaucoma type (POAG vs. PACG/normal tension glaucoma (NTG)), and language of interview (English vs. Mandarin). A DIF contrast of > 1.0 logits with a corresponding Rasch–Welch probability of P < 0.05 indicated notable DIF.

Measurement range

Measurement range was calculated as the difference in logits between highest and lowest item locations. A larger measurement range indicates a greater spectrum of measurement of the latent construct.

Test information function

The test information function (TIF) represents the sum of the information provided by all items in the IB and identifies where the test has highest/lowest SE (Online Resources 12–20). A higher level of information indicates greater measurement precision (i.e., low SE) at that point along the scale [33]. Generally, a TIF ≥ 10 is considered excellent [34].

Item Separation Index (ISI)

Item separation is used to verify the item hierarchy or construct validity of the instrument. Low item separation (< 3.00) can mean that the difficulty range of items is low and/or that the person sample is not large enough to confirm the item difficulty hierarchy [35].

Level of dependence between different IBs

We applied the Pearson correlation coefficient to individual person measures from each IB to determine the level of dependence between them, with r < 0.8 suggesting that they were measuring independent HRQoL constructs [13].

CAT simulations

We performed simulations to assess the efficiency of our Winsteps threshold calibrations (JMLE; Joint Maximum Likelihood Estimation method) and associated CAT algorithm [36] in 1000 simulated respondents using R Statistical Computing Environment [37]. Individual packages were loaded in R to conduct IRT including CAT simulations (“catR” [38]). Simulations were based on a standard normal distribution (M = 0, SD = 1) and used the Rating-Scale Model (RSM), the ML (maximum likelihood) estimator and the Maximum Fisher Information (MFI) item selection criteria [39]. We determined the average number of items required based on two different stopping rules: SEM of 0.30 representing “high precision” and 0.387 representing “moderate precision” (approximating to a reliability of 0.91 and 0.85, respectively) [13]. Model fit was assessed using the root mean square error (RSME) and level of bias between true and estimated ability levels (low values are desirable). We also calculated the Pearson correlation coefficient between the IBs and CAT simulated person measure estimates. We hypothesized high (r ≥ 0.85) and moderate-high (0.75 ≥ r < 0.85) correlations for simulations with the high and moderate precision stopping rules, respectively. Results are summarized both overall (Table 3), and in deciles (D1––D10 n = 100 each, where D1 and D10 includes simulees at the lowest and highest ‘ability’ levels, respectively) across the latent trait ranging from − 4 to 4 (Online Resources 21–27).

Results

Sociodemographic and clinical characteristics

Of the 300 patients (mean ± SD age 67.2 ± 9.2 years; 62.3% male, 87.3% Chinese), 60 (20.0%), 114 (38.0%), 72 (24.0%), 34 (11.3%), and 20 (6.7%) had no, mild, moderate, severe, advanced/end-stage glaucoma in the better eye, respectively (Table 1). Of the 300 patients, 208 (69.3%), 45 (15.0%), and 47 (15.7%) had POAG/NTG, PACG, or different glaucoma types in each eye, respectively. Most (n = 275; 91.7%) patients had received topical treatment in at least one eye, with 121 (20.2%) and 188 (31.1%) receiving laser or surgery in at least one eye, respectively.

Table 1 Sociodemographic and clinical characteristics of the 300 participants

Psychometric properties of the IBs

A summary of the modifications made to the GlauCAT™-Asian IBs is provided in Fig. 1, while a detailed tabulation of the initial and final psychometric properties is found in Online Resources 5–11).

Fig. 1
figure 1

Flowchart explaining the iterative process of item bank modification following psychometric assessment. The left hand side shows there were seven original item banks, of which six were retained and one new item bank was created (right, green boxes). One item bank had suboptimal psychometric properties and was not further pursued in CAT testing (right, red box)

In brief, the LT, MB, and GM domains required few modifications to achieve acceptable psychometric properties. These involved collapsing categories to resolve disordered thresholds and giving unexpected respondents a weightage of 0 (LT: N = 11 [Online Resource 7]; GM: N = 12 [Online Resource 9]; none needed for MB).

Initial analysis of the OS IB showed suboptimal precision (PSI < 2.0), five misfitting items (infit/outfit MnSq > 1.5) and possible measurement “noise” (raw variance explained by measures < 50%; Online Resource 5). Upon iterative removal of three highly misfitting items, precision level and measurement “noise” improved. With ordered thresholds (Online Resource 12), no item misfit, strong evidence of unidimensionality, lack of DIF, high applicability (100% response rate), lack of ceiling effect (< 20% extreme high responders), high ISI (6.60), good measurement range (3.28), and high clinical importance rating, the OS IB was retained despite lower than desirable measurement precision (PSI = 1.75) and TIF (~ 8.0).

The AL domain initially had > 10 misfitting items, five items that displayed DIF for age, gender, glaucoma type or language, and > 10 items with a high amount of missing data (> 20%) due to use of the ‘non-appliable’ response (Online Resource 6). High item misfit for eight items was resolved by giving unpredictable responders (total n = 10) a weighting of 0. However, misfit for several items could not be remedied by this process and these items were deleted after considering the other performance criteria outlined in Online Resource 2. For example, item AL33 “sewing” displayed substantial misfit (infit MnSq 1.78), DIF for gender, had 50% missing data, poor discrimination (0.62) and was deemed to have low clinical importance; as such, this item was deleted. After these amendments, a high eigenvalue for the first contrast (3.90) was suggestive of possible multidimensionality; however, the variance explained by the 1st factor was > 50%, the Rasch dimension was bigger than the secondary dimension by 3.5-fold, and the high disattenuated correlations between primary/secondary dimensions suggested that any secondary strands within the AL domain were contributing to the same underlying construct.

Initial evaluation of the PSY domain revealed strong evidence of multidimensionality, with seven misfitting items, a high eigenvalue (5.35), items relating to “concern” aspects loading together, and borderline disattenuated correlation values (Online Resource 10), suggesting that the PSY domain comprised > 1 construct. As such, we separated the “concern” items into a new domain (“CN”; n = 18) and conducted a new Rasch analysis on the CN domain and remaining items in PSY.

The new PSY domain initially still displayed evidence of multidimensionality; however, upon removal of five items with misfit, evidence overall suggested the scale was unidimensional notwithstanding slight measurement noise (raw variance explained < 50%). While one item (PS18 “Passing glaucoma onto your children”) had an infit MnSq value > 1.5, it was retained due to its high applicability (missing data < 20%), adequate discrimination (0.91), and high perceived importance by patients during qualitative sessions. Disordered thresholds were observed in both the new PSY and CN IBs (Online Resources 18–19), which were remedied by collapsing categories ‘Moderately’ and ‘A little bit’. After category collapse, DIF for gender for CN17 “Losing your driver’s license” and DIF for glaucoma type CN18 “The appearance of your eyes” emerged. CN17 had very high missing data (56%) and both items had poor discrimination (< 0.7); as such, both items were removed after which psychometric properties improved. While the eigenvalue for the first contrast was > 3, the variance explained by the 1st factor was > 50% and high disattenuated correlations suggested that the potential secondary clusters were not measuring separate constructs.

The Work domain initially displayed disordered thresholds, poor precision (PSI = 1.55), one misfitting item and four items with DIF for age/language (Online Resource 11). Although fit statistics improved after deletion of three items and measurement range was high (6.91), precision remained suboptimal (PSI = 1.92) and the IB had low applicability (41% response rate), low test information (< 6), a low item separation index (3.02), high ceiling effect (39.8%), and a low clinical importance rating. Taken together, evidence suggested that the WK IB was not valid in this population and testing for this domain ceased.

Overall, targeting was suboptimal for all IBs (Online Resources 5–11), with items, on average, too easy to endorse for participants’ level of the QoL construct (Online Resources 12–20). Test information was good overall (excluding WK), ranging from 8.1 for OS to 44.2 for AL (Online Resources 12–20). Finally, correlation coefficients of the person measures for all final IBs were < 0.8 (range 0.27–0.76), supporting the independence of each QoL domain assessed by the individual IBs (Table 2).

Table 2 Correlation coefficients between the final seven domains of the GlauCAT™-Asian item banks

Following psychometric analyses (Fig. 1), there were seven IBs comprising 182 items: AL (n = 56); LT (n = 15); MB (n = 19); PSY (n = 32); OS (n = 16); GM (n = 28); and CN (n = 16).

CAT simulations

Overall, good model fit (low RMSE and bias values) was achieved for all simulations (Table 3). With a 0.3 SEM stopping rule, the mean number of items administered ranged between 12.55 for AL and 19 for MB (overall mean 15.7 items; Table 3). For the smaller IBs (i.e., those with < 20 items: OS, LT, MB and CN), all items were administered and the stopping rule target was not reached. For AL, GM and PSY, the target stopping rule was met 100% of the time except for at the very ‘unable’ (D1) and very ‘able’ (D9-10) ends of the spectrum (Online Resources 21–27).

Table 3 CAT simulation results for the seven GlauCAT-Asian quality of life item banks

For ‘moderate’ precision (0.387 SEM), the mean number of items administered ranged between 7.87 for AL and 14.5 for OS (overall mean 12.1 items; Table 3). Except for the extreme deciles (D1 and D10), the proportion of simulees reaching the stopping rule target was > 75% (Online Resources 21–27).

Correlations between the CAT simulated person measures and the IBs were moderate to high (0.77–0.89 for SEM 0.3, and 0.76–0.87 for SEM 0.387).

In simulations in which test length was capped at 10 items, the SEM achieved ranged between 0.309 for AL and 0.466 for CN, corresponding to a reliability of ~ 0.91 and ~ 0.78, respectively (Table 4).

Table 4 CAT simulation results using a combined stopping rule based on test length and precision

Discussion

Following a comprehensive assessment of seven initial glaucoma-specific HRQoL IBs designed specifically for an Asian population (GlauCAT™-Asian), LT, MB and GM required minimal optimization of their psychometric properties. AL and PSY required more extensive remediations, including category collapse, deletion of poorly performing items, and/or splitting scales to resolve multidimensionality resulting in a new IB, “CN”. Due to unresolvable psychometric issues, the WK IB was not pursued further. In CAT simulations of the seven final IBs (n = 182 items in total), two IBs achieved moderate precision with ≤ 10 items administered, and 5/7 achieved moderate precision with ≤ 15 items. Researchers and clinicians can utilize GlauCAT™-Asian to obtain a comprehensive yet low burden understanding of the HRQoL impact of glaucoma and associated treatments.

Sixteen items in the AL domain were deleted due to high missing data (> 20%) or other associated psychometric issues (e.g., misfit). Items relating to social participation (e.g., Participating in community groups) were particularly susceptible to missing data. This may have been because these activities were not relevant to our elderly participants or because data collection occurred during the COVID-19 pandemic, during which Singapore spent ~ 100 days under strict lockdown [40] with many community activities suspended. As items that are likely to elicit the ‘non-applicable’ response are a threat to CAT efficiency, six ‘social participation’ items with high missing data were removed. While several ‘social participation’ items remain in the IB, users with a focus on measuring activity limitation in a social context may wish to include additional related questionnaires.

Despite iterative remedial steps, the OS IB displayed suboptimal precision (PSI 1.75), which may be due to a lack of variance in our sample population. Indeed, few participants had VI in the better eye (11.6%) resulting in a sample skewed towards the higher end of the ‘ability’ spectrum. However, because the PSI/PR values fall well within a useful range for a measurement tool [22], we retained the OS scale due to high perceived importance in our qualitative participant responses [19], and input from our team’s research clinicians that changes in OS consequent to glaucoma and its associated interventions are integral to clinical care outcomes.

The PSY domain needed to be split to resolve multidimensionality, which is not unexpected given that it contained items relating to ‘concerns’ (e.g., falling), social life (e.g., becoming socially isolated), and emotional reactions (e.g., feel frustrated). Indeed, sister instruments for diabetic retinopathy (RetCAT™ [13, 41]) and GlauCAT™-Western [17] both have > 1 domain measuring the psychosocial aspects of HRQoL. In a head-to-head comparison of the two GlauCAT instruments, the newly developed Asian version is more compact, comprising only seven domains compared to 12. While 5/7 GlauCAT™-Asian domains (namely OS, AL, MB, LT, CN) are like those in the original, the remaining two domains differ. The PSY domain in GlauCAT™-Asian covers content found in two domains of GlauCAT™-Western (Emotional and Social) and, while the GM domain in GlauCAT™-Asian has some similarities to the Convenience-Treatment domain in GlauCAT™-Western, it has twice the number of items and covers a broader range of treatment-related issues.

Despite iterative remedial steps, the WK IB displayed suboptimal psychometric properties including low test information and item separation. These issues may be due to the small sample size, with only 41% of participants providing responses for this domain and, of these, 40% recording an ‘extreme’ high score and thus removed a priori from analysis. This was unexpected as ‘work’ issues were important to our working focus group participants (25%) [19]. However, with only 17% of our current sample aged < 60 years (median age 68), the WK IB lacked applicability and was not pursued. Further work, including additional data collection in a younger glaucoma sample, are warranted to optimize the WK IB.

Overall, targeting was suboptimal for all our IBs, with items too “easy” relative to the average ability level of our participants. This affected CAT efficiency in the highest ability ‘bins’ (D9-10), with more items needed on average to provide a score estimate. Similar targeting issues were observed in the GlauCAT™-Western instrument, where 11/12 IBs lacked items targeted to very high performers [17]. The poor targeting may be due to the lack of VI in our sample, with most people with glaucoma having good VA until late-stage disease [42, 43]. Consultation with more early-stage patients to create ‘harder’ items is warranted. Addition of novel items to IBs is possible by estimating the calibration of new items relative to existing ones using Rasch analysis [44]. While users should be aware of the limitations of the GlauCAT™-Asian in providing stable ability estimates in patients at the very able end of the spectrum, poor targeting in the upper score range is not necessarily problematic, as such patients may not be the focus of clinician monitoring in healthcare or drug development in pharma companies.

Overall, our simulation results were promising, particularly for our moderate precision stopping rule target where CAT administration of our GlauCAT™-Asian IBs could reduce the number of items needed by 10–86% (depending on the domain) compared to the full IB. CAT efficiency was reduced with the high precision stopping rule, where the smaller IBs needed all items to be administered. Correlations between the CAT and full IB thetas (person measures) was somewhat lower than expected (~ 80% compared to the expected ≥ 0.85) and also lower than that obtained by Patient-Reported Outcomes Measurement Information System CATs in other health conditions (> 0.90) [45, 46], suggesting that the reduction in response burden offered by GlauCAT™-Asian may come with some potential loss of accuracy in scores. This may be particularly important for individual level comparisons, where measurements may require reliabilities ≥ 0.90 to be considered appropriate. Future work will conduct a formal comparison between CAT and full item bank theta with real data, where a subset of patients (n ~ 50) will answer both the CAT tests and associated full item banks. This will provide a better understanding of the effectiveness of GlauCAT in reducing response burden while replicating full IB scores.

Overall, our results demonstrate that IB and CAT approaches are advantageous compared to fixed-length questionnaire administration [11, 47]. Moreover, as the GlauCAT™-Asian IBs function independently, users can select which to administer depending on their measurement goals, which is useful in clinical settings, where time for PROM administration is limited yet increasingly valued in value-based care [10]. Indeed, using our online CAT testing platform (PROMinsight), six QoL domains of GlauCAT™-Western (taking 8 min on average) were recently administered to > 200 glaucoma patients while waiting to see their treating doctor at Mass. Eye and Ear Institute, with the HRQoL scores discussed during the patients’ consultations [18, 48]. Analysis of the group data found that VA was associated with greater activity limitation and worse mobility, while poorer VF mean deviation was associated with worse emotional well-being. These pilot results provide real-world evidence of the implementation feasibility of GlauCAT™ in routine clinical care enabling real-time discussion of results [49] and subsequent analysis of group data to better understand the factors associated with HRQoL outcomes in glaucoma patients [48].

Strengths of our study include our rigorous psychometric assessment and wealth of reliability and validity data provided. We also employed techniques to save items with minor misfit rather than deleting them outright, thus retaining as many issues important to glaucoma patients as possible. Our sample included patients with bilateral POAG (n = 202) and PACG (n = 45), ensuring that our calibrations are robust in Asian populations where the prevalence of PACG is high (0.7%) compared to Western populations (0.2–0.4%) [50]. Our recruitment strategy using clinical and demographic target quotas ensured we recruited participants across a range of glaucoma subtypes and severities enabling us to capture diverse responses. However, several limitations of our work must be acknowledged. First, we did not explore how IB score estimates differed across levels of glaucoma severity and associated VI, and different glaucoma subtypes. Future work using the final GlauCAT-Asian™ instrument will assess the HRQoL impact of glaucoma types and severity in another large sample of Asian patients with glaucoma. Second, while our new instrument has been developed in Asian patients, our sample comprised relatively few Indians or Malays meaning that we may have missed issues pertinent to these minority groups. Third, while Singapore comprises a multi-ethnic population of Chinese, Malays and Indians, who form some of the largest ethnic groups in Asia, further studies to test the cross-cultural validity of GlauCAT™-Asian beyond Singapore are warranted. For example, item calibrations between our Singaporean population and glaucoma patients from China, Malaysia and India could be assessed for agreement, and DIF for culture could be explored. If found, cultural DIF could be managed using the item anchoring method described by Gibbons and colleagues, where parameters of items with DIF are allowed to vary, while item calibrations for the other items which displayed DIF remain constant across each country [51]. Fourth, some correlations between our measures were relatively high (e.g., PSY and CN) which may support a multidimensional latent structure underlying glaucoma-specific QoL and the application of multidimensional IRT and bifactor models [52,53,54]. However, our aim was to produce unidimensional measurement tools that provide users with the ability to administer selected scales for a given purpose. Finally, our use of the ML estimator in our CAT simulations may have resulted in inaccurate estimates for participants at the extreme ends of the ability spectrum due to lack of variability in responses, and other estimators such as EAP (expected a posteriori) or BM (Bayesian Method) could be used instead to overcome this issue. Similarly, we used the standard normal distribution (M = 0, SD = 1) to run our simulations. Given that our item bank calibrations were based on a relatively able participant sample with high mean person measures, simulating latent traits from a normal distribution with these parameters may have affected the accuracy of our results.

In conclusion, following reengineering using Rasch analysis to optimize their psychometric properties, our seven IBs (OS, AL, LT, MB, GM, PSY, and CN) allow comprehensive assessment of glaucoma-specific HRQoL relevant to glaucoma populations in Asia. CAT simulations suggest that relatively few items within each IB will be required, particularly in low-stakes situations. Future work aims to assess the ability of GlauCAT™-Asian to provide efficient, valid, and reliable assessment of glaucoma-specific HRQoL in a clinical sample of glaucoma patients across the range of glaucoma severity and types; and the feasibility and acceptability of its implementation into routine clinical care.