Introduction

Patient-reported outcome (PRO) measures are increasingly being used to assess symptoms and functioning that rely heavily on patient input to rate the presence and severity of problems with these domains [1,2,3,4]. PROs typically require persons to have the cognitive and communicative capacity to assess what they are experiencing and communicate this to others. There are situations, however, where assessment by a caregiver or other proxy may either be necessary (e.g.., substituted ratings in patients with substantial cognitive impairment) or of added benefit (e.g., if the proxy reporter has a different perspective that complements ratings provided by the patient) [5,6,7,8]. This raises the essential question of how accurately a proxy can evaluate a domain that is being experienced solely or predominantly by the patient. This is important in order to decide if and when clinicians should gather proxy report and ultimately act upon the information.

Pain, anxiety, and depression (the PAD triad) are among the most common and disabling symptoms in the general and clinical population and are frequently under-recognized and suboptimally treated [9]. Moreover, they commonly co-occur, adversely affect treatment responsiveness of one another, and are responsible for an enormous amount of disability as well as direct and indirect medical and societal costs [8]. The PHQ-9 depression, GAD-7 anxiety, and PEG pain scales are among the most widely-used PROs for assessing PAD symptoms in both research and clinical practice [10,11,12]. Moreover, the PHQ-9 and GAD-7 have been recommended as core outcome measures in older adults [13]. Additionally, aging is often accompanied by multimorbidity, i.e., the co-occurrence of several diseases in the same individual. To address this issue, SymTrak has recently been developed and validated as a multi-dimensional scale that measures common symptoms and impairments in older adults [14, 15].

In this paper, we analyze PRO data from older adults and caregivers who participated in a cohort study to develop the SymTrak scale. Specifically, we focus on Symtrak and PAD scale scores as the PROs of interest. Our objectives are to:

  1. (1)

    Assess the internal consistency and test–retest reliability of patient and caregiver PRO report;

  2. (2)

    Determine item-level and scale-level agreement (concordance) between patient and caregiver PRO report;

  3. (3)

    Examine the association between patient-caregiver concordance and symptom severity, while adjusting for potentially confounding covariates.

Methods

Study participants

A group of 576 participants (188 patient-caregiver dyads and 200 non-dyadic patients without an identified caregiver) recruited from an academic-affiliated primary care network of clinics constitute the sample for this study. Patient inclusion criteria were: (1) age ≥ 65 years, (2) ≥ one primary care visit in the past 12 months, (3) ≥ one chronic condition according to medical records, and (4) for those participants with an informal caregiver, the caregiver had to be ≥ 21 years of age and willing to participate in the study. Patients with a serious mental illness such as bipolar disorder or schizophrenia or who resided in a long-term care facility were excluded. The study was approved by the Indiana University Institutional Review Board (IRB #1308983443), and all participants provided written informed consent. Further details of study procedures is described elsewhere [14, 15].

Measures

Participants completed by telephone interview a brief survey which included the scales assessed in this study. A random third of the patients and caregivers were re-interviewed 24 h later to assess test-test reliability. Caregiver and patient versions of the scales had identical item wording and response options except that “your loved one” was substituted for “you” in the caregiver version items to ensure proxies were reporting from the patient’s perspective.

SymTrak is a 23-item multi-morbidity scale that focuses on clinically actionable symptoms and impairments common in older adults. Response options for each item are: 0 = Never, 1 =  Sometimes, 2 = Often, 3 = Always. Thus, SymTrak scores from 0 to 69. The construct and factorial validity as well as sensitivity to change of Symtrak has been demonstrated [14, 15].

The PEG is a 3-item pain scale that assesses average pain intensity as well as pain interference with enjoyment of life and general activities in the past week. Each item is scored on 0 to 10 scale, with the PEG score being the mean of the 3 items and higher scores representing worse pain. Validity and responsiveness of the PEG is comparable to longer legacy pain measures [16, 17].

The Patient Health Questionnaire 9-item depression scale (PHQ-9) is one of the most widely-used measures for assessing depression in both clinical practice and research [10, 12]. In this study, the PHQ-8 was used which is identical to the PHQ-9 except it omits the 9th item which assesses suicidal ideation. The PHQ-8 is often used in clinical research settings where depression is not the primary outcome and endorsement of the 9th item is most often a false positive response for active suicidal ideation. Because the 9th item is the least frequently endorsed item, multiple studies have shown that group mean scores are nearly identical for the PHQ-8 and PHQ-9 as is the optimal screening cutpoint of ≥ 10 [18].

The Generalized Anxiety Disorder 7-item anxiety scale (GAD-7) is a measure for anxiety screening and severity assessment [10]. Although initially developed as a measure for generalized anxiety disorder, the GAD-7 also has good operating characteristics as a screener for panic disorder, social anxiety disorder, and posttraumatic stress disorder [19]. It is one of the most widely-used brief anxiety measures [20].

The difficulty subscale from the Oberst Caregiver Burden Scale was used to measure caregiver perceptions of difficulty for 15 different types of caregiving tasks [21]. Each of the 15 Items are rated on a 5-point scale ranging from 1 (not difficult) to 5 (extremely difficult). Thus, scores range from 15 to 75 with higher scores representing greater caregiver difficulty.

Analysis

Cronbach’s alpha was used to assess internal consistency reliability. The intra-class correlation coefficient (ICC) was used for two types of analyses. The absolute-agreement version of the ICC was used to assess test–retest reliability for scale scores, with occasions specified as a random effect. Agreement between patients and caregivers on scale scores was assessed using the absolute-agreement ICC, with a fixed effect for rater (patient, caregiver). Cronbach’s alpha and ICC are considered acceptable if ≥ 0.70. Agreement on patient-caregiver item-level ordinal responses was assessed with the weighted kappa (using Fleiss-Cohen quadratic weights) as the primary statistic and the Spearman correlation coefficient as a secondary index. Quadratic weights, which are commonly used for weighted kappa, were implemented [22]. Moreover, the formula for weighted kappa when using quadratic weights is nearly identical to the formula for an ICC specified above [23]. However, as a sensitivity analysis, we also calculated weighted kappas using linear weights. Agreement was considered substantial if kappa was > 0.61 to 0.80, moderate if 0.41 to 0.60, fair if 0.21 to 0.40, and slight if 0.01 to 0.20 [24]. Scatter plots of patient versus caregiver total scores were used to examine whether agreement varied with symptom severity.

Measurement error bars were based upon the standard error of measurement (SEM) which was calculated as \({\text{SD}} \times \sqrt {1 - \alpha }\), where α = Cronbach’s alpha. Error bars were set at ± 2 SEM since differences larger than this are considered by some to represent minimally important differences [25]. Multinomial logistic regression modeling was done to explore patient and caregiver characteristics associated with caregiver overreporting and underreporting (i.e., caregiver-reported score > 2 SEM higher or lower than patient-reported score, respectively). Odds ratios (ORs) and 95% confidence intervals were reported.

Results

Participant characteristics

Dyads. Patients and caregivers were diverse with respect to race, education, income, and marital status (Table 1). Of the 203 recruited patient-caregiver dyads, some patients or caregivers subsequently decided to not participate, yielding 188 dyads available for concordance analyses. Most caregivers were either a child (42%) or a spouse/partner (37%) of the dyadic patients (Table 1). The mean (SD) Oberst caregiver task difficulty score was 20.7 (8.6) with a range of 15 to 65.

Table 1 Characteristics of patient and caregiver dyads

Non-dyadic patients. Patients without a participating caregiver were included in scale reliability analyses and, compared to dyadic patients, were significantly younger by an average of 2 years (p = 0.01), less likely to be married or living with a partner (p < 0.0001) and with lower household income (p < 0.0001).

Reliability

Internal consistency reliability was high for all four scales and comparable across non-dyadic and dyadic patients as well as caregivers (Table 2). Of the 16 Cronbach’s alpha calculations, 13 ranged from 0.83 to 0.94. The remaining three ranged from 0.75 to 0.78 and all related to depression.

Table 2 Internal Consistency and Test–Retest Reliability of the Four Scales

Test–retest reliability also revealed high agreement for all scales and was generally similar across patients and caregivers (Table 2). Of the 16 test–retest calculations, 12 were 0.80 or greater, 3 were 0.73 to 0.74, and one was 0.63. The four test–retest results less than 0.80 were either depression (n = 2) or anxiety (n = 2).

Concordance for SymTrak

As shown in Table 3, the SymTrak mean total scores were similar for patient-reported and caregiver-reported proxy scores (17.5 vs. 18.1). Concordance for the total score was in the poor to moderate range (ICC = 0.48; Spearman’s correlation = 0.49). Item mean scores were quite similar with no patient-proxy item mean differing by greater than 0.2. In addition, 18 of the 23 items showed either moderate (n = 8) or fair (n = 10) patient-caregiver concordance as reflected by a weighted (quadratic) kappa of ≥ 0.40 and ≥ 0.20, respectively. Spearman correlation results generally paralleled weighted kappa findings. Linear weighted kappas were typically somewhat lower than quadratic weighted kappas. Of the five items with poor concordance, 2 were psychological (items 14 and 19), 2 were cognitive (items 20 and 22), and 1 was trouble with urination.

Table 3 Patient-Caregiver Concordance for SymTrak Scale

Concordance for pain, anxiety and depression

Table 4 summarizes patient-proxy agreement for PEG pain, GAD-7 anxiety, and PHQ-8 depression scores. In general, concordance was higher for pain than the two psychological scores. PEG total and item mean scores were similar between patients and caregivers, and both the ICC and Spearman’s correlation were 0.58.

Table 4 Patient-Caregiver Concordance for Pain, Anxiety, and Depression Scales

Conversely, agreement regarding anxiety and depression was lower. The total score ICC was only 0.28 for anxiety and 0.25 for depression. In addition, the highest weighted kappa at the item level was 0.31, and the kappa was < 0.20 for 4 of the 8 depression items and 2 of the 7 anxiety items.

Comparison of concordance for items shared by Symtrak and legacy scales

There are 10 items in common between SymTrak and the legacy scales for which the conceptual content is the same and the item wording is either identical or relatively similar, including 6 PHQ-8 items, 3 GAD-7 items, and 1 PEG item. Table 5 shows that patient-proxy concordance was generally comparable for items measured by two different scales.

Table 5 Comparison of Patient-Caregiver Concordance on Items Common to Symtrak and Legacy Scales

Concordance related to symptom severity

Figure 1 displays the scatter plots showing the association between patient- and caregiver-reported scores. For multi-morbidity and pain (Figs. 1a, b), concordance is generally linear and stronger (steeper slope) at lower scores and tends to decrease or plateau at higher scores. In contrast, concordance for the two psychological scores (Figs. 1c, d) is generally weaker and bidirectional, with a positive slope at lower scores and a flat or slightly negative slope at higher scores. For all 4 scales, most caregiver reports outside the 2-SEM concordance bars exceed patient reports at lower scores and are less than patient reports at higher scores. Additionally, the underestimate by proxies at higher scores is greater for psychological symptoms.

Fig. 1
figure 1

Scatter plot of Caregiver (Proxy) Report versus Patient Self-Report for Patient Symptoms on 4 Scales: a SymTrak multimorbidity scale; b PEG pain scale; c PHQ-8 depression scale; d GAD-7 anxiety scale. Higher scores on all 4 scales indicates greater (worse) symptom severity. The solid straight line represents theoretical perfect agreement and the dotted lines are the measurement error bars representing ± 2 SEM (standard error of measurement). The solid curvilinear line represents the fitted actual agreement derived from linear regression and the shaded area represents the confidence limits around the actual agreement

Predictors of discordance

Table 6 summarizes the results of multinomial logistic regression modeling conducted to explore patient and caregiver characteristics associated with discordance defined as caregiver scores > 2 SEM higher or lower than patient scores (i.e., caregiver overreporting or underreporting, respectively). Discordance was common for the 4 scales, ranging from 33.7% for PHQ-8 depression to 62.4% for GAD-7 anxiety. The severity of patient-reported scores and caregiver task difficulty were associated with discordance. Specifically, higher (worse) patient-reported scale scores were associated with caregiver underreporting and higher caregiver task difficulty was associated with overreporting. Each 1-point increase in the patient-reported scale score increased caregiver underreporting by an OR of 1.17 to 1.60 across the 4 scales, and each 1-point increase in the Oberst caregiver difficulty score increased caregiver overreporting by an OR of 1.07 to 1.10. For a few scales (and complementing these findings), higher patient-reported scores decreased overreporting, and higher caregiver task difficulty decreased underreporting. Other patient and caregiver characteristics were not associated with discordance but were retained in the models to control for their potential effects.

Table 6 Multinomial Logistic Regression Models: Correlates of Caregiver Overreporting and Underreporting of Patient-Reported Scale Scores (Four models were run–one for each scale score. Each model has a 3-level dependent variable, with concordance being the reference group; the odds ratios (ORs) for over- and under-reporting are relative to the reference group Differences are calculated as (Caregiver-estimated score)‒(Patient-reported score). Concordance was defined as a difference within ± 2 standard errors of measurement (SEM). Overreporting = Difference > 2 SEM higher. Underreporting = Difference > 2 SEM lower. Percentages in column headings are the proportion of caregivers that over- and underreported for each scale.)

Discussion

Our study has several important findings. First, both patient and caregiver PRO reports had excellent internal consistency and test–retest reliability. Second, caregiver reports tended to approximate patient reports when total scale and item-level scores were averaged at the group level (i.e., with respect to mean scores), whereas there was substantial variability at the level of the individual patient-caregiver dyad, as reflected in scatterplots and ICC/kappa coefficients. Third, higher patient-reported scale scores were associated with caregiver underreporting, whereas higher caregiver task difficulty was associated with overreporting. Fourth, caregiver underestimates of symptom burden at higher levels of severity were more prominent for depression and anxiety than it was for pain and multi-morbidity measures.

Other studies have also shown reasonable agreement when comparing patient versus proxy mean scores but greater differences when comparing individual patient-proxy scores. Indeed, discordance rates ranged from 34 to 62% for the 4 scales in our study. The greater heterogeneity in concordance at the individual level requires greater caution when interpreting proxy reports in the clinical setting. Whereas some studies have found no directionality in dyad differences (i.e., a similar proportion of patient scores are over- and under-estimated by the proxy) [26,27,28], research has more commonly shown that proxies tend to overestimate patient-reported symptom severity and impairment [5, 6, 8, 29,30,31,32,33,34,35,36,37,38,39]. Unlike our study, previous studies generally did not evaluate how concordance varies with severity, nor did they adequately control for other potentially confounding patient and caregiver characteristics. Our finding that proxies tend to overestimate impairment at lower levels of symptom severity and underestimate at higher levels therefore warrants further study.

Greater discordance for psychological/internal symptoms than more observable domains such as physical functioning and impairment has also been reported in previous studies [5, 26, 29,30,31, 33,34,35, 38]. Even among the performance-based domains, proxy and patient reports may diverge more for higher level functioning than basic functioning (e.g., instrumental vs. basic activities of daily living) [6, 40,41,42].

Proxy psychological distress and caregiving burden may increase discordance between proxy and patient report, most commonly in the direction of worse ratings of patient PROs [7, 29, 34, 42,43,44,45,46]. Similarly, we found greater caregiver task difficulty led to overestimates of patient symptom severity. Whereas the mechanism for the relationship between caregiving burden and discordance has not been delineated, it is conceivable that caregivers overestimate the patient’s symptom severity as a consequence of their own distress or, alternatively, that patients react to their caregiver’s burden by underestimating self-reported severity. Conversely, neither caregiver sex nor relationship with the patient influenced concordance. Although some studies found that proxies who live with the patient tend to have better concordance, their control for other confounders was less complete than in our study [30, 33, 47].

There is a body of salient pediatric literature on comparing proxy reporting (typically by the parent) to child and adolescent self-report. Like the research in adults, several common themes emerge including greater concordance for group-aggregated scores versus individual dyad-level scores; a tendency for parents to overestimate impairment compared to child self-report; better agreement for observable compared to internal domains (i.e., physical compared to psychological); and an adverse influence on agreement by parental distress [27, 43,44,45,46, 48,49,50,51,52,53,54]. Whereas some of the findings from pediatric studies may be generalizable to proxy reporting in adults, population differences should also be acknowledged. In children, limitations of self-report are more commonly related to developmental than cognitive impairment factors. Moreover, parents may have a greater actual or perceived responsibility in overseeing treatment and monitoring response. Further, factors such as authority, autonomy, and attachment are not identical for parents and caregivers of adults and thereby may influence the salience and interpretation of proxy report.

Some have made the distinction between what proxies believe the patient would report versus what they think the degree of severity or impairment actually is from their own independent perspective as observers or caregivers [55]. Although we presume that in patients with the capacity to report for themselves, patient report is foremost, it is also possible that the perspectives of persons close to the patient may complement rather than substitute for or replace self-report. Although patients should in most instances still have the primary voice in articulating their level of suffering, distress or impairment (hence the term patient-reported outcomes), this does not preclude proxies (who know and observe the patient) from having a valuable vantage point that might further inform evaluation and treatment. Whereas information from proxies is essential for patients with impaired capacity to report for themselves (in which case it serves as a necessary substitute for symptom assessment), proxy report may nonetheless augment scores provided by patients able to self-report. Indeed, investigating the agreement between patient and proxy reports represents a comparison of two perspectives rather than a reliability or validity assessment; patients and caregivers are rating different experiences and perspectives (self vs observer) This contrasts with the typical inter-rater reliability setting where raters are independent observers of the same information. Thus, the generally fair to moderate (instead of substantial) agreement observed here is a point of interest but not an adverse reflection on psychometrics of the scales. In the absence of an objective criterion standard for symptoms and other predominantly internally-experienced (i.e., subjective) phenomena, optimal assessment and therapy might triangulate patient, proxy, and provider/professional perspectives [7, 30, 52, 56].

Because our sample only included cognitively intact adults 65 and older, generalizability to individuals less than 65 as well as individuals with mild to moderate cognitive impairment needs to be further investigated. Moreover, disease progression and functional decline that occur with aging may result in response shift whereby coping, social comparison and other psychological accommodations attenuate the self-evaluation of symptom severity and impairment relative to proxy report [57]. Also, only a few studies have triangulated proxy and patient report with clinician ratings of PROs or performance-based measures relevant to some domains [7, 30, 32, 56, 58]. It would be interesting to do so with the domains assessed by our PAD scales and SymTrak. Additionally, only recently have longitudinal studies compared whether patient and proxy PRO report are similarly responsive over time [59, 60]. This sensitivity to change would be useful to demonstrate for the measures and domains evaluated in our study. Strengths of our study include the size of our sample as well as its racial, educational and income diversity.

In conclusion, proxy PRO reports may be a reasonable alternative in clinical research when patient self-report cannot be obtained and when group mean scores are averaged across individuals. However, the mean scale scores in our sample were relatively low; it is possible that in clinical populations with more severe symptoms or impairment, proxy mean scores may be a less accurate surrogate for patient self-report due to proxy underreporting at the higher score range of scales. In practice, the clinician should realize that proxies tend to underestimate at the higher range of patient-reported depression and anxiety, and this should be considered when making treatment decisions. When both proxy and patient reports are available and clinically significant discordance exists, reconciling the dual perspectives may be preferable to an either-or approach (i.e., one perspective is wrong and the other one is right).