Patient-reported outcomes (PROs), captured with PRO measures (PROMs), play important roles in clinical research and healthcare by permitting the standardized evaluation of health constructs from the perspective of the patient and other stakeholders [1, 2]. PROMs are especially valuable for measuring health constructs that are difficult for a clinician to fully observe, for example, pain, anxiety, depression, social connectedness, and other constructs with a subjective component [3]. PROMs are being used for clinical research and clinical practice with participants/patients of nearly all ages and health conditions, including individuals with cognitive and language challenges [3, 4].

PROMs may be more or less cognitively difficult for patients to complete based on the health construct(s) assessed, as well as the design and phrasing of the PROM. For example, respondents may more easily (quickly) respond to items about concrete physical symptoms than abstract emotional symptoms, about symptoms within the past hour than within the past week, about frequency of a symptom more than degree of impairment because of that symptom, to items with three response options rather than nine, and to items with common words rather than uncommon words. Test developers may follow best practice guidelines and/or qualitative feedback from pilot respondents when deciding on PROM design characteristics and item phrasing [5, 6], but there is often not a strong empirical research literature informing those kinds of design decisions [7]. However, the cognitive demands required for PROM completion is important to consider and study—for all respondents, not just those with cognitive or language limitations—because they may have important implications. For example, a deeper understanding of PROM response processes may lead to a more complete understanding of response error [8, 9], more valid interpretations of PROM scores, and development of PROMs that are more accessible to people with cognitive and language challenges [10].

Completing a PROM requires the respondent to (1) receive (e.g., aurally, visually) and comprehend the question, (2) search for and retrieve relevant information from memory, (3) apply appraisal processes towards a response (e.g., expectations of health, standards of comparison [11, 12]), and (4) elect and communicate (orally or manually) a response [13]. At a minimum, these stages draw on the respondent’s attention, processing speed, auditory-verbal working memory, episodic memory, language comprehension, and executive functioning [3, 14]. Indeed, there is a reasonably sized literature on psychological variables that may associate with survey responses [6, 15, 16], but without as much emphasis on health constructs in particular like what are assessed by PROMs, or with cognitive science methods like examinations of psychophysiological variables or response time [17]. Since they can quantify cognitive effort, these methodologies are potentially useful as a complement to cognitive interviewing, which relies strongly on the patient’s language and metacognitive abilities, and are only able to capture psychological factors within the patient’s awareness.

In the work presented here, we report on a secondary analysis of PROM item response times, which is a proxy variable for psychological effort or difficulty, to test whether time/effort completing a PROM may be associated with psycholinguistic factors, design attributes, and/or cognitive ability of the respondent [7, 18]. Consistent with findings from the more experimental cognitive science literatures (but to our knowledge not yet extended to PROMs), we hypothesize that longer response time to PROM items (i.e., indicative of more cognitive effort) will be associated with respondents with lower self-reported cognitive function and with PROM items that assess abstract (e.g., stigma) rather than concrete health constructs (e.g., mobility) [19], contain low-frequency words [7] or words that are acquired later in development [20]. We further hypothesize that there will be a response time difference associated with whether the PROM item is phrased as a perception-based question (e.g., asking about frequency of symptoms) or an evaluation-based question (e.g., asking about difficulty completing a task), because these may be associated with different cognitive appraisal processes [21]. However, we do not have a direction associated with that hypothesis. To be clear, we do not think that response time necessarily reflects the validity of the response. Rather, we are interpreting response time as a reflection of the amount of processing, or the effort required for that response.

Methods

Participants and procedures

The present study used Wave 2 data from the initial development of the Quality of Life in Neurological Disorders (Neuro-QoL) measurement system [22], obtained from a public repository [23]. Whereas the Neuro-QoL Wave 1a (clinical sample) and 1b (general population sample) cohorts were drawn from online respondent pools, this sample consisted of 581 adults with neurological disorders (multiple sclerosis, Parkinson’s disease, epilepsy, stroke, and amyotrophic lateral sclerosis [ALS]) who were recruited from and tested at 1 of 12 academic medical centers in the United States [22]. As described in greater detail in the main Neuro-QoL development papers [22, 24, 25], participants were eligible for study inclusion if they were: (1) at least 18 years of age, (2) English speaking, (3) diagnosed with one of the five neurological conditions studied, and (4) entered the study with a proxy reporter (for individuals with stroke only; e.g., family member, caregiver). Participants were excluded if (1) cognitive impairment precluded their ability to provide informed consent or understand and complete test items, (2) their seizures were non-epileptic (participants with epilepsy only), or (3) they were not community-dwelling (participants with stroke only). Participants were assessed up to three times; for the present analyses, baseline data were used. Participants completed items within the Assessment Center platform, which recorded their response time [26]. Additional information regarding the Wave 2 sample can be found in Gershon et al. [22] and the Neuro-QoL Technical Report [25] available at www.healthmeasures.net.

Study measures and variables

Outcome variable: Neuro-QoL item response times

The outcome variable for analyses was the response time associated with completing each Neuro-QoL item, between the presentation of the PROM item and the participant’s selection of a response. This time was captured by the Assessment Center platform [26], rounded to the nearest second, and contained in the public Wave 2 dataset [23]. Neuro-QoL comprises 11 item response theory [IRT]-calibrated item banks that assess domains of physical, mental, and social health in the context of neurological conditions. Items use a five-category response format and ask either what Schwartz and Rapkin [21] would call perception-based questions about how often specific symptoms occur (Never/Rarely/Sometimes/Often/Always) or evaluation-based questions about perceived difficulty performing a task (None/A Little/Somewhat/A lot/Cannot do).

Person-level predictors: patient-reported cognitive function and neurological condition

The Neuro-QoL Item Bank v2.0—Cognitive Function measure consisted of 28 items addressing respondents’ perceived difficulty in completing various cognitive tasks. Scores are provided on a T-metric (M = 50, SD = 10), with higher scores reflecting higher levels of perceived cognitive function. In addition to cognitive function, neurological condition was included as a person-level covariate in all models as four dummy-coded variables (reference condition: epilepsy).

Item-level predictors: psycholinguistic characteristics

Item-level predictors included psycholinguistic variables extracted from the 185 Neuro-QoL items that were administered in Wave 2 [22, 24]. A full list of item IDs can be found in the online supplementary material. For each Neuro-QoL item, the following variables were coded independently by one of three research assistants. Each research assistant reviewed each psycholinguistic variable and associated database (below), explored example coded items, and practiced coding items until criterion was achieved (100% accuracy on 15 training items). The following psycholinguistic variables were selected because experimental studies from the cognitive science literature have indicated their association with word-processing time.

Mean word frequency

The frequency that a word occurs within a given language is associated with how richly represented that word is within the readers’ lexical networks. Infrequent words in the English language (e.g., platypus) are less strongly represented and take longer to recognize and respond to than more frequent words (e.g., dog) [27]. The mean word frequency was computed for each Neuro-QoL item by entering each word in the item into the SubtlexUS database, adding the frequencies, and dividing by the number of words. The SubtlexUS database is based on subtitles from television and film, and contains over 51 million word instances from over 8000 sources [28].

Mean age of word acquisition

Age of word acquisition (AoA) is often operationalized as the age of development at which a person is typically exposed to that word/concept, and is closely tied to but distinct from word frequency [29]. Words that are acquired later in development are processed and responded to more slowly [20]. Mean AoA was computed for each Neuro-QoL item by extracting the AoA for each word from the normative data reported by Kuperman et al. [30], adding the values, and dividing by the number of words in the item.

Mean word imageability

Imageability is the “extent to which the referent of a word evokes a mental image [p. 824]” [31]. For example, “dog” evokes a mental image more readily than “justice.” Highly imageable words are associated with shorter processing time than less imageable words [32]. Mean word imageability was computed for each Neuro-QoL item by extracting the imageability for each item from MRC Psycholinguistic database [33].

Response format

As mentioned above, Neuro-QoL items, and the items from many other modern PRO measurement systems, are generally perception-based questions about frequency of symptoms or evaluation-based questions about perceived difficulty completing a task. These kinds of questions may involve different cognitive appraisal processes, and associated with different kinds of error [21]. This study coded whether each Neuro-QoL item asked a question about frequency or difficulty.

Numbers of words and syllables in the Neuro-QoL item

To minimize the influence of PROM item length on other predictors of response times, we counted the number of words and syllables of each PROM item and included these data in analyses.

Data analysis

Approach

Effects of the item and person-level predictors on Neuro-QoL item response times were analyzed using mixed-effects models [34, 35] to account for two grouping factors in the dataset: persons and items. In the models, persons and items were specified as crossed-classified factors because every participant completed every item, thus precluding a hierarchical arrangement of the two factors. That is, items could be considered nested within individuals just as easily as individuals could be considered nested within items.

Non-normal response time distribution and outliers

As is common in chronometric research, the distribution of response times consisted of nonnegative values, exhibited severe positive skew, and potentially contained outliers (Fig. 1).

Fig. 1
figure 1

Neuro-QoL item response times. Response times for 177 Neuro-QoL items in the Wave 2 clinical sample (N = 577). Approximately 0.25% of recorded response time observations were > 120 s and not shown in the figure

In accordance with the exploratory nature of this study and the need to appropriately analyze the response time distribution, we examined multiple mixed-effects model specifications. Our overarching goal was to build a mixed-effects model that satisfied statistical assumptions to the greatest extent possible while retaining interpretability; in doing so, we aimed in particular for consistency of results across different specifications. These included (a) a linear mixed-effects model (LMM), (b) a LMM with log-transformed response times, and (c) 6 Generalized linear mixed-effects models (GLMMs). More information about the selection, specification, and estimation of each model can be found in the supplementary material.

With regard to outliers, in Fig. 1, approximately 0.25% of the response time observations exceeded the 2-min range of the x-axis and are not shown. The maximum response time was 1106 s (~ 18 min), an implausibly high value. A likely explanation for high response time values is that the participant was distracted or took a break during data collection. Unfortunately, it was neither possible to confirm the validity of such data points, nor determine a single cutoff value that separated plausible from implausible values. As a result, multiple cut points were used to determine whether model results were sensitive to trimming a certain portion of the upper tail of the response time distribution prior to model estimation. Specifically, models were estimated using an untrimmed outcome as well as three outcomes trimmed above the 95th, 97th, and 99th percentiles, which correspond to 24, 31, and 52 s, respectively. Cook’s distance, a measure of influence, was also computed for upper-level units to determine whether any single person or item had a considerable impact on model estimates. At the low end of the response time distribution, we removed observations less than half a second (n = 11, 0.01%) from the dataset.

Model specification, estimation, and evaluation

All models were estimated in R (version 3.6.2, [36]) using the lme4 package (version 1.1-21). For each type of model and trimming level combination, model building proceeded as follows. First, a main effects model was estimated, in which all item-level and person-level predictors were included. Statistical significance of main effects was determined by inspecting 95% profile likelihood confidence intervals. To test for potential moderation effects of cognitive function on item-level characteristics, interaction terms were entered into the main effects model and tested separately. A final “full” model was then estimated, including all significant terms retained from the previous step. Maximum likelihood estimation was used for all analyses. Due to missing data, the analyzed dataset contained 577 individuals, 177 items, and 100,660 individual item administrations.

Results

Data description

Demographic characteristics of the Neuro-QoL Wave 2 clinical sample are provided in Table 1. Results from mixed model analyses were similar across different model specifications and trimming levels; therefore, unless otherwise noted, for the remainder of the article, we report results from the LMM and 99th percentile-trimmed data, which represents a range of plausible response times (1–52 s per PROM item). Results from other analyses can be found in the online supplement.

Table 1 Demographic characteristics of Neuro-QoL Wave 2 clinical sample (N = 581)

Univariate descriptive statistics for item response times, item psycholinguistic characteristics, and cognitive function are provided in Table 2. Bivariate descriptive statistics (correlations) for the predictor variables are provided in Table 3. As expected, the correlation between the number of words and syllables was high (r = .91); therefore, only number of syllables was included in the analysis models. All other variables correlated in the small-to-moderate range (r = − .44 to .30).

Table 2 Descriptive statistics for 99th percentile-trimmed dataset
Table 3 Bivariate correlations between item-level predictor variables

Main effects model results

Results for the main effects LMM based on the 99th percentile-trimmed dataset are shown in Table 4. Neuro-QoL v2.0—Cognitive Function scores were negatively associated with item response times such that after controlling for item characteristics and neurological condition, respondents reporting higher levels of cognitive impairment required more time to complete items. As for the item characteristic variables, the number of syllables per item and response format (difficulty vs. frequency) were positively associated with item response times, that is, respondents took more time responding to items that contained more syllables and used a difficulty (vs. frequency) response format, after controlling for other item characteristics, cognitive function, and condition.

Table 4 LMM results for 99th percentile-trimmed sample

Interaction effects model results

Significant interaction effects were found between self-reported cognitive function and three item-level predictors: number of syllables, imageability, and response format. Coefficient estimates for these terms are provided in Table 4. Plots of conditional effects, which show how coefficients for the item-level predictors vary as a function of cognitive function, are shown in the left-hand column of Fig. 2. Simple intercepts and slopes are shown in the right-hand column; the regression lines in these plots show the relation between the item-level predictors and response time at three levels of cognitive function (− 2 SD, mean, + 2 SD). There was a significant interaction between the number of syllables in an item and self-reported cognitive function. As shown in Fig. 2, the positive association between the number of syllables and response time was stronger (i.e., further from zero) for respondents with lower levels of self-reported cognitive function.

Fig. 2
figure 2

Conditional effects, simple intercepts and slopes. Estimated regression coefficients for three item-level predictors (number of syllables, imageability, response format), conditional on the moderator variable (cognitive function), are shown in the left-hand column. Increased cognitive function was associated with decreased (toward zero) effects for number of syllables and response format, whereas the effect of imageability was increased (away from zero). Associated simple intercepts and slopes—regression lines between each predictor and response time (RT) and plotted at different values of cognitive function (mean and ± 2 standard deviations)—are shown in the right-hand column. These plots show how response times increase with increasing numbers of syllables and when using the difficulty vs. frequency response format, and how these increases are augmented at higher levels of cognitive function. Conversely, for imageability, the positive effect only emerges at higher levels of cognitive function, though the effect is confounded with number of syllables (see Fig. 3)

There was also a significant interaction effect between imageability and self-reported cognitive function such that the effect of imageability on response time was larger for individuals with higher levels of cognitive function. This effect is apparent in Fig. 2, as the trend line of the coefficient becomes significantly different from zero in the positive direction approximately 2 SDs above the mean. However, this moderation effect appears to be confounded with the number of syllables per item. When the number of syllables per item is removed from the model, the trend line representing the conditional effect is shifted downward (Fig. 3), such that it becomes significantly different from zero in the negative direction for individuals approximately 1 SD or more below the mean of cognitive function. This appears due to a negative statistical suppression effect, such that the direction and statistical significance of the association between imageability and response time—that is, the main effect—differs depending on whether the number of syllables is included in the model. When the number of syllables is removed from the main effects model, imageability has a negative and statistically significant effect on response time, such that items with lower imageability ratings are associated with slower response times (controlling for other predictors). Although the slope of the conditional effect line does not change, the downward shift that results from the suppression complicates interpretation of the interaction effect. This can be seen in Figs. 2 and 3 in the simple intercept/slopes plots as the “fanning” effect created by the interaction changes directions depending on whether the number of syllables in an item is also modeled. Inspection of partial correlations among the study variables and additional exploratory analyses suggested that other main and interaction effects were not similarly impacted.

Fig. 3
figure 3

Conditional effects and simple intercepts and slopes for imageability, when number of words is omitted from model. Estimated regression coefficients for imageability, conditional on the moderator variable (cognitive function) and without the number of words included in the model, are shown on the left. Associated simple intercepts and slopes, plotted at the mean and ± 2 standard deviations and without the number of words included in the model, are shown on the right-hand column

Finally, there was a significant interaction effect between response format and self-reported cognitive function. As shown in Fig. 2, the conditional effect line, which represents the difference in response times between the two types of response formats, is further from zero in the positive direction for individuals with lower levels of self-reported cognitive function. That is, the effect of response format was larger for individuals with lower self-reported cognitive function scores.

Discussion

Completing a PROM draws on multiple cognitive skills to support the respondent as they receive and comprehend the question, search for and retrieve relevant information from memory, apply appraisal processes towards a response, and select and communicate a response. To more fully understand variables that contribute to cognitive effort completing a PROM, the work presented here was a secondary analysis of data that aimed to evaluate whether a respondent’s self-reported cognitive abilities and attributes of a PROM item was associated with their response time to that item.

Consistent with our hypothesis, lower self-reported cognitive abilities were indeed associated with slower response times. However, this finding should be interpreted alongside several considerations. First, items that contributed to the Neuro-QoL v2.0 Cognition T score assessed multiple abilities (attention, memory, etc.) so we are not able to discern which abilities were most strongly related to response time. Second, previous research has indicated that self-reported cognition is only moderately correlated with performance-based cognitive test scores, and equally correlated with symptoms of depression [37]. Finally, we cannot form conclusions about the causes of cognitive limitations (e.g., neurological condition), or how they contribute to response time. Individuals in this sample with stroke or MS displayed the slowest response times to PROM items compared with individuals with ALS, PD, or epilepsy. However, these participants were not recruited to address condition-specific hypotheses. The finding that people with stroke and MS were slowest may have resulted from the particular people with those conditions who were recruited, and we are not interpreting group response times as generalizable findings. Future research should investigate at a more granular level and with performance-based cognitive tests how specific conditions and cognitive substrates relate to PROM appraisal processes and effort.

Respondents took longer to respond to PROM items that contained more syllables. While this main effect is a relatively superficial finding, a more interesting effect is that people with lower self-reported cognitive abilities were disproportionately slowed by longer items. This response may result from limits in verbal working memory capacity, which can limit sentence processing depending on level of attention to the item or perceived item demands [38, 39]. Notably, we did not examine sublexical (e.g., syllable frequency) or syntactic complexity contributions to response time, which may have influenced our findings about item length and response time.

Items with perceptual phrasing (e.g., about symptom frequency) versus evaluative phrasing (e.g., about difficulty completing a task) also associated with response times. Although we did not have an a priori hypothesis about the direction of this effect, our findings were consistent with the model of Schwartz and Rapkin [21]: evaluation-based questions are often associated with longer response times because these items involve more idiosyncratic appraisal processes. Indeed, respondents who reported cognitive limitations were disproportionately slowed by evaluation-based questions, which further supports the hypothesis that evaluation-based questions require more processing. For example, unlike questions about frequency (e.g., “how often do you take your medications as prescribed”?), questions about difficulty (“how difficult is it for you to take your medications as prescribed?) may require the respondent to generate ideas about what is meant by difficult (e.g., mental effort to remember medications, frequency of errors, and number of times corrected by a care partner), and to recall personal experiences over the response period. Our results indicate that people with perceived cognitive limitations are disproportionately slow when completing those mental processes.

Consistent with our hypothesis, word imageability was associated with response time such that PROM items with less imageable (more abstract) words were associated with slower responses. However, this main effect was only significant when number of syllables was removed from the model. We suspect this was due to a negative statistical suppression effect; imageability and number of syllables were moderately correlated, and logically more abstract concepts may require more words/syllables to describe. When the number of syllables were removed from the model, imageability was a significant predictor of response time. Furthermore, word imageability significantly interacted with self-reported cognition (when the number of syllables was removed from the model) such that people reporting cognitive limitations were disproportionately slow responding to PROMs with low imageable words. This coheres with other studies indicating that more abstract words/concepts are less robustly encoded within and between lexical networks, and therefore, require greater strength of activation for comprehension and subsequent response [40]. Our data indicate that people reporting cognitive limitations were disproportionately slow when completing items containing abstract concepts. However, more research is needed with more rigorous, experimental designs.

Against our hypotheses, word frequency and AoA were not associated with response time. Neuro-QoL items were by design written below a 6th grade reading level, as defined by the Lexile Analyzer, based on word frequency and syntax complexity [41], which may have restricted the range of word frequencies contained in Neuro-QoL items, as well as AoA, which is strongly correlated with word frequency [42].

Overall, our data indicate that lower self-reported cognitive abilities and some attributes of PROM design are associated with response time, which indicates the amount of processing required for that response. To be clear, we do not think that response time necessarily reflects the validity of the response; within the range of plausible responses analyzed here, we assume all PROM item responses are equally valid. Rather, we are interpreting response time as a reflection of the amount of processing or effort required for that response. However, questions and concerns have been raised about the cognitive burden of some PROMs and the reliability of PROM responses from people with cognitive and language disorders [3, 14]. For example, Carlozzi et al. reported that lower cognitive abilities were associated with lower internal reliability of PROM items completed by people with Huntington’s disease [43]. PROMs that are too cognitively or linguistically difficult for a respondent could result in unreliable score estimates (and therefore, more difficulty detecting change), invalid score interpretation, subsequent inappropriate clinical action, wasted time, and poor rapport.

The work presented here indicates that the cognitive effort required to complete PROMs may be reduced by relatively simple design changes, for example, using concise phrasing, more highly imageable (e.g., concrete rather than abstract) words, and composing items to focus on frequency rather than difficulty. A good example of this is the work of Hunting Pompon et al. [10], who modified the Perceived Stress Scale [44] to be more accessible to people with post-stroke aphasia by simplifying the phrasing of the scale’s instructions and items, altering the scale’s format to include more white space to lessen visual distraction, and including an associated graphic element to the Likert scale response options. There are likely other design factors that could be associated with response time that deserve future study, for example, syntactic complexity, phonotactic probability, emotional evocativeness, and graphic response options. Future studies of PROM response times should take place under controlled, experimental conditions, and capture more granular differences in response time. Ultimately, research on the cognitive demands of PROM completion is potentially important for all respondents so that the field may have a more complete understanding of response error [8, 9], more valid interpretations of PROM scores, and development of PROMs that are more accessible to people with cognitive and language challenges [10].