Background

Increases in health care costs have led to increases in patient cost-sharing arrangements such as high deductibles. The percentage of Americans insured by high deductible health plans (HDHPs) - also known as consumer-directed health plans (CDHPs) when combined with a health savings account or health reimbursement account - increased from 8% in 2006 to 17% in 2009[1, 2]. The effects of cost-sharing arrangements typically have been evaluated in two ways: through analyses of claims data, and through patient surveys that ask about changes in health care decision making and use.

Some of the most important data on the effects of increases in cost-sharing for health care in the United States come from surveys [35]. Reed and colleagues[5], for example, asked respondents how often they changed their care-seeking behavior because of their out-of-pocket costs. The investigators compared self-reported and claims-based cost-sharing levels to assess consumers' knowledge of their cost-sharing plan. Fronstin[2] also asked health plan enrollees about changes in care seeking. He found that CDHP enrollees were more likely than traditional plan enrollees to report that they would change doctors if cost sharing was lower when using a doctor who used health information technology. Collins and colleagues asked if, because of cost, respondents did not go to a doctor or clinic when sick; had not filled a prescription; skipped a medical test, treatment, or follow-up visit recommended by a doctor; or did not see a specialist when a doctor or the respondent thought it was needed [6]. The proportion of individuals forgoing these types of care increased from 29 percent in 2001 to 45 percent in 2007 [6].

Despite the widespread use of survey questions to measure changes in health care use and related behaviors, scant data exists on the reliability of such questions. If these questions do not produce reliable responses, this would blunt the ability of studies to identify true changes in health care use associated with different health insurance cost-sharing arrangements. A better understanding of the reliability of patients' responses to questions regarding their experiences delaying care and/or hypothetical health care seeking behavior would improve our ability to predict cost-related changes in health care use.

We sought to measure the reliability of survey questions assessing changes in health care utilization and delayed or foregone care among a group of enrollees in HDHPs. While there is a vast body of literature reporting on the reliability and validity of various psychometric instruments [710], and health related quality of life questionnaires [1113], similar work has not been done regarding delayed or foregone health care due in part to cost. To our knowledge, this is the first study to examine the test-retest reliability of questions regarding self-reported information use, health care utilization, and delayed or foregone care among HDHP enrollees using a survey instrument administered at two points in time to the same individuals.

In addition to evaluating the test-retest reliability of select survey questions, we also report aggregate differences in information use, health care utilization and delayed or foregone care in order to evaluate how potential changes, due in part to unreliability (which may be related to question design, survey administration, patient recall, the time interval between responses, data entry, etc), might affect the findings of an analysis using a single cross-section of survey data.

Methods

Study Population

The study population consisted of enrollees in HDHPs offered by Harvard Pilgrim Health Care, a New England-based non-profit health insurer. A description of the population and plan characteristics has been previously published [14]. The target sample included adults 18 years of age and older who, as of November 2008, were subscribers enrolled in a Harvard Pilgrim Health Care HDHP with an individual deductible of at least $1000 and a family deductible of at least $2000. All non-preventive inpatient, outpatient and emergency department care was subject to the deductible under these plans. Some preventive care was exempt from the deductible and only subject to a minimal copayment. Some diagnostic testing is also exempt from deductibles (e.g., fecal occult blood tests are exempt but colonoscopy is not). We focused our survey questions regarding hypothetical utilization on diagnostic tests for which the deductible applies (blood tests, colonoscopy, magnetic resonance imaging).

The inclusion criteria also required: (1) continuous enrollment in an employer-sponsored HDHP for at least the previous 6 months; (2) at least one child < 18 years of age also enrolled in the plan; and (3) annualized family out-of-pocket costs (defined as outpatient visit and prescription drug co-pays) of at least $500 in an HDHP. This threshold of annualized OOP expenses included 54% of all families who met other inclusion criteria. We required at least $500 in OOP expenses to ensure that the questions concerning delayed or forgone care would be relevant to potential respondents. We reasoned that individuals who consumed some care would have some experience with their own plan and recent decisions to delay or forego care, and be more able to readily judge hypothetical decision making at different cost levels.

We oversampled households living in low-income areas by stratifying families into two groups based on geocoded address information: (1) residence in a census block group with a median household income in the lowest quartile in the sample frame and (2) residence in a census block group with a median household income in the second, third and fourth quartiles of the sample frame (i.e., lowest quartile versus other). Low income individuals were oversampled because nominal increases in cost sharing affect this population of enrollees disproportionately to their income; thus, these enrollees may be more likely to delay or forego care due to cost. Random sampling was performed in each of the two strata until surveys from approximately 200 families in each group were completed.

Analyses were stratified by self-reported household income on the survey. The lower-income subgroup was defined by a self-reported household income less than 300 percent of the Federal Poverty Limit (FPL < 300). Families with incomes above this threshold were classified as higher-income.

Survey Design

The survey included several domains: health plan characteristics, attitudes towards health care utilization, unexpected costs, information-seeking behaviors, delayed care, and demographic data. Survey domains and questions were developed based on a previous focus group study in this population and were in some cases drawn from existing national surveys [15]. The draft survey underwent cognitive pre-testing and piloting with a total of 60 respondents. The study was approved by the Harvard Pilgrim Health Care Institutional Review Board.

Survey Dissemination and Re-sampling

The original (i.e. the "test") surveys were mailed between January and March 2009. The cover letter asked the adult in the family who is responsible for the family's health care decisions to complete the survey. We sent two mail waves followed by attempts at telephone administration. Respondents received a $30 gift card to their choice of 1 of 3 major retailers. Of the 750 surveys mailed, 229 (30.5%) were completed in the first wave, 130 (17.3%) were completed in the second wave, and 75 (10%) were completed by phone for a total of 434 completed surveys and an overall response rate of 58.1%.

We attempted to administer the follow-up (i.e., the "retest") survey to all 434 families who completed the original survey. The follow-up survey cover letter explained that, "the reason we are asking you to do these questions a second time is to test how reliable the questions are - whether they result in the same answers or different answers if asked at different times."

Follow-up surveys were distributed (mail) or administered (telephone) within two weeks of each family responding to the original survey; however, individual response time varied. The follow-up surveys were all completed between February and July of 2009. We attempted the survey in the same format in which the respondents completed the original survey (i.e., mail or phone), using the same incentives and number of mailing waves. The questions were identical between the original and follow-up surveys. Five questions regarding demographic information (race, ethnicity, language spoken at home, education, income) and two administrative questions (suggestions for improvement, permission to obtain administrative claims data) were omitted from the follow-up survey.

Survey Themes and Questions

The survey covered five broad themes: beneficiary knowledge of their health plan, information seeking, changes in behavior associated with having a deductible, experiences in delayed or foregone care due in part to cost, and hypothetical delays in care due in part to cost. Figure 1 contains the 17 questions tested for reliability under each theme.

Figure 1
figure 1

Test-retest questions.

Statistical Analysis

Our statistical analysis involved 3 components. We compared the agreement of (1) individual level responses between the original and follow-up surveys via percent agreement, absolute retraction, absolute initiation, tetrachoric correlation (binary responses) or polychoric correlation (ordinal categorical responses), and kappa statistics; (2) differences in responses between income groups - comparing the original versus follow-up responses; and (3) differences in responses between the original and follow-up responses, stratified by income group.

We calculated the percentage agreement [16] for each of the 17 questions in order to capture the proportion of respondents that gave identical answers on the original and follow-up surveys. We also calculated absolute retraction and initiation. Absolute retraction is the proportion of individuals who responded positively to a behavior on the original survey (e.g., responded yes to having a health savings account) and responded negatively to the behavior on the follow-up survey. Conversely, absolute initiation measures the proportion of respondents that responded negatively on the original survey and positively on the follow-up survey (e.g., unlikely to delay care on the original and likely to delay care on follow-up). For questions involving ordinal responses, absolute retraction and initiation were calculated by summing the off-diagonal counts and dividing by the total responses. In the case of absolute retraction, this method implies any reduction in the level of endorsement (e.g., very likely changed to somewhat likely, somewhat unlikely, or very unlikely). Similarly, absolute initiation implies any increase in the level of endorsement (very unlikely changed to somewhat unlikely, somewhat likely, or very likely).

We also calculated simple (for binary response) and weighted (for ordinal response) kappa statistics for each question [17, 18]. We used Landis and Koch's standard criteria for interpreting the strength of kappa which are: 0.0 - 0.2 (slight); 0.21 - 0.40 (fair); 0.41 - 0.60 (moderate); 0.61 - 0.80 (substantial); and 0.81 1.0 (almost perfect) [19, 20]. While kappa is the most common statistic used to compare categorical survey responses at two points in time and has been used to evaluate national surveys in the United States [21], kappa has several weaknesses including sensitivity to the prevalence of a behavior/response [22], the number of categories, and the assumption of independent raters [2328]. We therefore calculated tetrachoric correlations (TCC) and polychoric correlations (PCC) to provide measures of agreement for binary response questions (TCC) and ordinal categorical response questions (PCC) between the original and follow-up surveys. Both the TCC and PCC are independent of prevalence so low agreement cannot be attributed to low prevalence or a change in prevalence between survey waves [2932].

In a previous study [14], we used χ2 tests to investigate whether lower-income families with out-of-pocket expenditures in HDHPs were more likely than higher-income families to delay or forego health care services, have difficulties understanding and using their plans, or avoid discussing costly services with physicians. We repeated these analyses for the present study but restricted the original sample to individuals that responded to the follow-up survey in order to evaluate the extent to which our initial findings would be reproduced.

We compared the responses in the lower-income group (FPL < 300%) between the original and follow-up surveys using McNemar's test [33, 34]. The same analysis was used to compare responses in the higher-income group (FPL ≥ 300%) between the original and follow-up surveys.

All statistical analyses were performed using SAS version 9.2 [35].

Results

Of the 434 original completed surveys, 387 completed the follow-up survey (310 written, 77 by phone) and 47 did not respond giving a response rate of 89%. Twenty-four (6%) of the 387 follow-up surveys were completed in a different format than the original survey (23 of 24 were completed by phone in the follow-up and on paper in the original survey; 1 survey was completed on paper in the follow-up and by phone in the original).

Although follow-up surveys were fielded within two weeks of receiving the initial survey, time between receipt of the initial survey and receipt of the follow-up survey varied. The mean time between initial and follow-up survey receipt was 29.7 days (95% CI: 28.3, 31.0) and the median time was 26 days. Of the 387 follow-up surveys, 75% were returned within 31 days and 95% were returned within 56 days. A little less than 5% (n = 19) of follow-up surveys were returned between 57 and 142 days after the initial survey. Several studies have reported reliable estimates of retrospective and contemporaneous behaviors in a 2 to 10 week test-retest period [8, 9, 3638].

We tested for differences in characteristics between follow-up survey respondents and non-respondents such as age, sex, minority status, education, family size, income, deductible level, enrollment time, choice of plan, and chronic illness among family members. The initial and follow-up surveys differed only with respect to the proportion of respondents with any adult chronic condition in the family (including allergies). Of the 47 non-respondents, 37 (78.7%) had an adult in the family with a chronic condition compared to 57.8% of initial respondents (χ2 p = 0.028).

Question Test-Retest Reliability

Table 1 summarizes the agreement statistics describing the reproducibility of respondents' answers on the original and follow-up surveys. The question on whether the respondent had a special account for health care expenses such as a health savings account had the highest test-retest reliability, with a percent agreement of 93%, TCC of 0.97 (95% CI: 0.95, 0.99) and kappa of 0.84 (95% CI: 0.78, 0.90). The other items concerning beneficiary knowledge of their plan had moderate kappa values. Although the test-retest kappa statistic concerning choice of plans was 0.61 (substantial), the confidence interval for kappa (0.52, 0.69) overlapped that of the items concerning the ease/difficulty of understanding their plan (95% CI: 0.41, 0.56) and how protected the beneficiaries feel from out-of-pocket costs (95% CI: 0.43, 0.57). Note that these confidence intervals are based on the assumption of independent raters; however, the individual answers for the original and follow-up surveys are clearly dependent. As such, the confidence intervals are conservative (wide) and favor finding no difference between kappa values. Use of overlapping confidence intervals is generally conservative compared to standard error intervals [39, 40]. Nevertheless, the magnitude of the differences in kappa is small compared to the Landis and Koch [19] criteria (0.41 - 0.60 = moderate); thus, comparisons via narrower confidence intervals might be statistically significant but unimportant when falling within the same categorical band (i.e., moderate).

Table 1 Agreement statistics for the families and health care costs survey: comparison of responses between the original and follow-up surveys

The results comparing agreement via the TCC or PCC are similar to those using kappa. While the level of agreement reported for each question is generally higher using TCC/PCC, the confidence intervals for these estimates overlap except with respect to having a health savings account. The reliability of the question regarding delays in ED care (PCC = 0.88, 95% CI: 0.82, 0.94) may also be higher than that for delayed or foregone care at "any other time" (PCC = 0.70, 95% CI: 0.58, 0.82).

The test-retest reliability as measured by kappa was moderate for the majority of questions (0.41 - 0.60). Similarly, 14 of the 18 correlations (TCC/PCC) are between 0.62 and 0.76 which could also be interpreted as moderate agreement. Interestingly, the level of test-retest reliability for questions concerning the experience of delays in care did not differ substantially from questions regarding hypothetical delays in care. The kappa statistic for delayed or foregone emergency department care was 0.67 (95% CI: 0.58, 0.75), significantly higher than most questions related to behavior change associated with having a deductible (i.e., more likely to call/email, more likely to adopt healthy habits, more likely to change the timing of visits). These results remain statistically significant using either kappa or the TCC/PCC.

Reproducibility of Comparisons Regarding Information Seeking and Delayed or Foregone Care

Table 2 shows the results of bivariate analyses comparing income groups based on responses to the original and the follow-up survey. The observed proportions of respondents with delayed or foregone pediatric, adult, or any family care were similar when comparing the original and follow-up surveys. All of the tests comparing income groups in the follow-up survey produced the same result as in the original survey: respondents in the lower-income group were more likely to delay or forego pediatric care, adult care, or any family care in both surveys.

Table 2 Comparison of delayed or foregone care between federal poverty limit < 300 and federal poverty limit >300 in the original and follow-up surveys

The results of difference of proportion tests comparing income groups are also largely similar. However, in two cases there were significant differences. The test comparing delayed or foregone ED care was statistically significant in the follow-up survey but not in the original survey. Also, the test comparing delayed or foregone operations/procedures was statistically significant in the original survey but not in the follow-up survey.

Table 2 also shows the results of our analyses comparing the proportion of respondents with delayed or foregone care between the original and follow-up surveys, stratified by income level. With one exception, none of the difference of proportion tests comparing the original and follow-up surveys within income groups was statistically significant. The proportion of respondents in the higher-income group that delayed or went without prescription medications decreased from 15.4 percent in the original survey to 5.2 percent in the follow-up survey (p = 0.016).

Table 3 shows the results of analyses comparing the proportion of respondents who answered that their plan was difficult to understand, that they had unexpected costs, that they felt unprotected from out-of-pocket (OOP) expenses, and the types of information seeking they conducted (i.e., whether required to pay and/or how much they were required to pay). In both the original and follow-up surveys, lower-income respondents were more likely to report that they felt unprotected from OOP expenses.

Table 3 Comparison of information seeking between federal poverty limit <300 and federal poverty limit >300 in the original and follow-up surveys (%)

In the original survey, 55.8% of respondents in the lower-income group reported feeling unprotected from OOP expenses versus 44.0% of those in the higher-income group (p = 0.043). The difference between income groups widened in the follow-up survey where 60.8% of lower-income respondents reported feeling unprotected versus 40.2% in the higher-income group (p < 0.0001). None of the other income group comparisons were significant and none of the proportions changed significantly between the original and follow-up surveys.

Discussion

Although surveys are widely used to measure self-reported and hypothetical use of health care by enrollees with cost sharing arrangements, the test-retest reliability of these surveys has not been adequately studied. A better understanding of the reliability of patients' responses to questions on health care seeking behavior is important for improving our ability to identify true changes in health care use associated with different health insurance cost-sharing arrangements.

The test-retest reliability of self-reported plan knowledge, information seeking, and delayed or foregone care reported in this study can generally be characterized as "moderate". None of the questions with kappa statistics in the "substantial" range had confidence intervals that were completely within the substantial range. The apparent consistency of kappa statistics across the various domains is interesting because respondents were equally reliable answering questions about their experiences in delaying care as they were answering questions about hypothetical delays in care. However, readers should exercise care in interpreting differences in these kappa as the confidence intervals are conservatively wide.

We found that most of the proportions of respondents reporting delayed or foregone care did not change significantly between the original and follow-up surveys. Only the proportion of higher-income respondents reporting delayed or foregone prescriptions medications changed. In comparing the lower-income and higher-income groups, only the results concerning delayed or foregone ED care and delayed or foregone operations/procedures changed between the original and follow-up surveys. These results suggest that we can be reasonably confident in our initial analyses and the propensity to delay or forego care in this population of beneficiaries.

Plan Use and Information Seeking

Results across the original and follow-up surveys show that 40 to 50 percent of respondents experienced unexpected costs or felt unprotected from OOP costs. Similarly, only 40 to 50 percent of respondents reported trying to discover whether or not they would have to pay for a service and/or how much they would have to pay for a service since joining their health plan. Although the proportions describing plan use and information seeking in each income group changed slightly between the original and follow-up surveys, none of these changes were statistically significant. Our analysis of the follow-up sample confirmed the results reported for the original study: a higher proportion of lower-income individuals reported feeling unprotected from OOP expenses. We also confirmed our previous analysis in that none of the other comprehension/information proportions were different in the lower-income group compared to the higher income group. These results suggest that we can be confident in our initial results comparing plan knowledge and information seeking.

Strengths and Limitations

This study has two main strengths. This is the first test-retest reliability study of questions concerning beneficiaries' self-reported understanding and use of their HDHP benefits. Second, we repeated the analyses comparing lower-income and higher-income individuals in addition to calculating a variety of agreement statistics for the individual questions. Repeating the analyses by income allowed us to determine whether there were significant changes in the proportion of people reporting delayed or foregone care between the original and follow-up surveys. We found that we can be confident in our initial results using the original survey data because only one proportion out of eighteen changed despite the moderate level of reliability as captured by the agreement statistics.

One limitation of our study is that the results may be biased by individuals' recall of their initial responses when completing the follow-up survey. Given the moderate level of test-retest reliability, the learning effect does not appear to have been strong. However, the absolute initiation among respondents with respect to experienced delays (Table 1, Behavior Change) is consistently greater than the absolute retraction. Another limitation is that we do not know the extent to which respondents provided different answers due to events that occurred in the period between completing the original and follow-up surveys. Almost 5% (n = 19) of follow-up surveys were returned between 57 and 142 days (approximately 2 - 5 months) after the initial survey; thus, the possibility that new health care utilization affected follow-up survey responses cannot be ignored. On the other hand, several studies have reported reliable estimates of retrospective and contemporaneous behaviors in a 2 to 10 week test-retest period [8, 9, 3638].

These data may not be representative of all HDHP populations because our sample was limited to enrollees in one New England health plan. We focused on families who had experienced high costs, who may have experienced more salient events than others, or have been more likely to recall events reliably. Survey questions regarding hypothetical utilization centered on diagnostic tests for which the deductible applies; however, confusion over which diagnostic services are subject to the deductible could be an important source of variability in patient responses to experiences in delayed or foregone care between the initial and follow-up surveys. Further, our inclusion criterion of $500 in annualized visit and prescription drug co-payments may have excluded families who faced access barriers so significant that they never reached this level of out-of-pocket costs, and limits our ability to generalize these findings to individuals with either much lower or much higher out-of-pocket spending. The reliability of the survey instrument for HDHP enrollees with no OOP expenses is likely to be lower because they would have little or no experience with their plan and recent decision making. As such, the reliability estimates presented here may be optimistic for a survey fielded to all HDHP enrollees regardless of their plan experience.

As in other studies of HDHPs, families who were enrolled in these plans may differ in important and often unobservable ways from those who were not, whether they actively chose the plan or had no other option. Our measures gauging respondents' willingness to discuss hypothetical recommended services also may not be predictive of their actual behavior, although we found similar reliability for questions regarding experienced and hypothetical delays in care. Finally, the lack of a non-HDHP comparison group limits the degree to which our observed income group differences and similarities can be contrasted with health plans that have small or no deductibles.

Conclusions

In this population of HDHP beneficiaries, we found that self-reported information concerning plan knowledge, information seeking, and delayed or foregone care was moderately reliable. Our results offer reassurance for researchers using self-reported information to study the effects of changes in cost sharing on health care utilization.

Payers and policy makers are increasingly interested in benefit structures that maximize the use of appropriate health care and minimize the use of inappropriate care. The results presented here complement studies using retrospective administrative data to evaluate changes in the use of health care associated with deductible levels. Our results suggest that beneficiary surveys concerning hypothetical changes in deductibles could be reliably used to better understand potential changes in utilization that might occur under different deductible levels and plan designs (i.e., services subject to and exempt from the deductible). As the proportion of Americans with cost-sharing arrangements for health care continues to increase, reliable self-reported information about their health care decision processes and use will become increasingly important.