Background

Psychological distress in terms of depression and anxiety is a growing problem among older people, with a prevalence of depression in European countries at 12% for people aged 65 years or over [1]. For anxiety the corresponding prevalence varies from 1% to 14% in North America and Europe [2]. Late-life depression can have serious consequences such as increased comorbidity with physical illness, reduced function and increased risk of suicide [3]. Also anxiety in this age group can lead to considerable distress and functional impairment [4]. As there are a considerable number of older people who suffer from symptoms of anxiety and/or depression, there is a need for a brief and feasible instrument to identify people at risk but also for evaluation of interventions and for research.

The Hospital Anxiety and Depression Scale (HADS) is a frequently used self-rating scale developed to assess psychological distress in non-psychiatric patients. It consists of two subscales, Anxiety and Depression [5]. Overall, it has demonstrated satisfactory psychometric properties in different groups; in primary care patients [6], cognitively intact nursing home patients [7], cancer inpatients [8] and in general populations [6, 9]. However, previous studies have suggested different factor structures of the HADS. The hypothesized two-factor structure is most often confirmed [8, 10, 11] but studies have also suggested one factor [12], three factors [13] and also four factors [14]. In addition, few studies have evaluated the factor structure in older populations specifically, and existing studies show divergent factor structures as well. A study by Helvik et al. [15] supported a two-factor model in a sample of hospitalized patients 65 years and older. However, three items in their study did not load on the expected factor suggested by the constructors [5]. One study in a community-dwelling population aged 60–80 years confirmed the two-factor structure as the most plausible and also more clinical relevant in comparison with a three-factor model [16]. Also Gale et al. [10] found a two-factor model as more appropriate compared to a three factor model in non-clinical populations of older men and women. In contrast, a three-factor model was suggested in a study regarding older veterans (> 65 years) with limb amputation [13]. As knowledge about the latent structure of the HADS in an older population is limited and has shown contradictory results, there is a need to further evaluate the factor structure in a general population of older people.

Previous research has commonly shown that psychological distress is more prevalent among women than men [2, 17]. However, some studies have shown no statistically significant differences between women and men regarding prevalence of depression [15, 18] whereas Martin et al. [19] showed a higher prevalence for men. A possible explanation to these diverging results could be heterogeneity regarding study design, population and measurements. As previous research also has shown ambiguous results when using the HADS [20, 21], there is a need to examine if HADS is an invariant measure for psychological distress for women and men.

Differential item functioning (DIF) is an often overlooked aspect of validity and occurs when different subgroups respond differently to specific items within a scale, after matching on the underlying latent construct that the item is intended to measure [22]. If DIF is presented it implies that the scale is not measuring the same thing for all respondents and thus might lead to incorrect conclusions.

The aim of the study was to evaluate the factor structure of the Hospital Anxiety and Depression scale (HADS) in a general population 65–80 years old and further to exam the possible presence of differential item functioning (DIF) with respect to sex.

Method

This validation study was based on a cross-sectional survey including a random sample of individuals (n = 9968) selected from the total Swedish population aged 65–80 years (N = 1.276.307). The main aim was to investigate the prevalence of, and association between depressive symptoms and loneliness in relation to age and sex [21]. Ethical approval was obtained from the Regional Ethic Review Board in Stockholm, Sweden (No. 2010/823–314/4).

Sample and procedures

Participants were randomly selected from a national register of the total population, which includes all persons registered as residents in Sweden. The inclusion criterion was being in the age group 65–80 years. The study was based on postal questionnaires. Statistics Sweden performed the randomization and distributed the questionnaires together with information about the study emphasizing voluntariness to participate and anonymity in relation to researchers. Questionnaires and follow-up letters were sent to non-respondents after three weeks and resulted in a response rate of 67.0% for the total sample (66.6% for women and 67.1% for men). For this psychometric evaluation study, 37 questionnaires had missing data in all items regarding HADS and were therefore excluded, leaving a final sample of 6622 participants.

The questionnaire

The questionnaire was divided into two parts; one part was specifically reflecting e.g. demographics, morbidity, and pharmacological treatment while the other part concerned psychological distress, symptoms of anxiety and depression, measured with HADS.

The HADS aims to measure symptoms of anxiety and depression and consists of 14 items, seven items for the anxiety subscale (HADS Anxiety) and seven for the depression subscale (HADS Depression). HADS Anxiety focus mainly on symptoms of generalized anxiety disorder and HADS Depression is focused on anhedonia, the main symptom of depression [23]. Each item is scored on a response-scale with four alternatives ranging between 0 and 3. After adjusting for six items that are reversed scored, all responses are summed to obtain the two subscales. Recommended cut-off scores according to Zigmond & Snaith [5] are 8–10 for doubtful cases and ≥11 for definite cases. An optimal balance between sensitivity and specificity was found using a cut-off score of 8 or above for both HADS Anxiety and HADS Depression [6].

Data analysis

Demographic characteristics are presented with descriptive statistics (frequencies, means, standard deviations, medians and interquartile ranges) and differences between sexes were analyzed with independent sample t-test and chi-square test.

An item analysis was conducted to evaluate score distributions, floor/ceiling effects, and missing data patterns. Analyses of distribution were based on descriptive statistics for ordinal data. However, mean and standard deviations were also calculated for comparisons with previous studies. The D’Agostino test was conducted to evaluate if item and scale scores deviated significantly from a normal distribution. Floor and ceiling effects, which refer to the proportions of participants with the lowest (floor) and highest (ceiling) possible scores, were evaluated using frequency distributions. Up to 20% floor/ceiling effects were considered acceptable in the present study. To test if the data were completely missing at random (MCAR), Little’s chi-squared test for MCAR was conducted for each scale separately. Homogeneity was evaluated with inter-item correlations based on polychoric correlations.

A confirmatory factor analysis (CFA) was conducted to evaluate the hypothesized two-factor structure of the HADS (model I); 7 item measuring anxiety and 7 item measuring depression, without any other modifications. As the model did not perfectly fit the data, a second model was evaluated (model II); two-factors with cross-loadings for item 7 and 8. As the items were highly skewed distributed with pronounced floor effects, a third model was evaluated to identify which impact this problem had on the factor structure (model III); a two-factor model with cross-loadings for item 7 and 8 together with collapsed response categories (category 2 and 3 for all items). The items were treated as ordered indicator variables and consequently a diagonally weighted least square method (WLSMV), based on a polychoric correlation matrix, was used to estimate the parameters of the models. Different goodness-of-fit statistics were used to evaluate the CFA models. A non-significant chi-square test indicates a perfect model fit between model and data. However, since this test is highly sensitive for large sample sizes it should be interpreted with caution. Therefore we used the following goodness-of-fit criteria; root mean square error of approximation (RMSEA) ≤ 0.06, comparative fit index (CFI) ≥ 0.95 and Tucker-Lewis index (TLI) ≥ 0.95 [24].

To evaluate internal consistency reliability, an ordinal variant of Cronbach’s alpha was calculated [25]. This calculation is based on polychoric correlations rather than Pearson correlations, but is interpreted in the same way as the traditional Cronbach’s alpha. Thus, alpha values above 0.7 indicate sufficient internal consistency reliability [26] For comparisons, also traditional Cronbach’s alpha was calculated.

Examination of differential item functioning (DIF) for sex was conducted for each item using ordinal regression analyses. This method enables to test for both uniform (effects of group differences) and non-uniform DIF (effects of differences in group ability) [22, 27]. In the first step (Block I), the item responses were treated as outcome variables predicted by the conditional variable (i.e. total score for HADS Anxiety and HADS Depression respectively). In the second step (Block II), the grouping variable (i.e. sex) was added as covariate to detect uniform DIF. In the third step (Block III), the interaction term between the conditional variable and group variable (i.e., sex × HADS Anxiety and sex × HADS Depression) were added as covariates to test for non-uniform DIF [22]. The change in McFadden R2 between the three models was used to evaluate the effect size of DIF. For an item to be classified as showing DIF, the two degree of freedom chi-squared test in logistic regression must have a p-value <0.01 and the effect size measure have to be at least R2 ≥ 0.13 [22].

The analyses were conducted with the SPSS Statistics 20.0 (IBM Corp, Armonk, NY, USA), Mplus 7.4 (Muthén & Muthén, Los Angeles, CA, USA) and R 3.3.0 software (the R Foundation for Statistical Computing, Vienna Austria).

Results

Sample characteristics

The overall mean age was 71.2 years (SD = 4.5). The sample consisted of almost as many men as women, 48.4% and 51.6% respectively. A majority were married/cohabitating (70.5%), was retired (80.2%) and reported primary school as the highest education level (49.1%). The proportion of participants scoring HADS Anxiety were 10.7% for the entire sample, significantly more common among women than men (14.1% vs. 7.0%, p < 0.001). Corresponding results for HADS Depression ≥8 was 9.8%, and in opposite to anxiety, significantly more common for men than women (10.6% vs. 9.1%, p < 0.05). Antidepressant medication was prescribed for 8.0% and anxiolytic medication had nearly the same prescription rate (7.0%). Less than 1% had visited psychologist or welfare officer during the last three months (Table 1).

Table 1 Demographic characteristics of the study population in relation to sex

Item score statistics

Item scores for both HADS Anxiety and HADS Depression deviated significantly from a normal distribution, graphically (normal probability plot) and statistically (D’Agostino test (p < 0.001). No ceiling effects were presented, but all items showed floor effects. The score distribution for the lowest response alternative ranged between 52.8% and 77.7% for the items in HADS Anxiety and between 36.1% and 76.4% for HADS Depression (Table 2).

Table 2 Item and scale score statistics (n = 6622)

The presence of missing data was low and ranged between 0.6% and 1.4% for items in HADS Anxiety and 0.2% and 0.9% for items in HADS Depression (Table 2). However, according to the Little MCAR test data was not complete missing at random for either HADS Anxiety (χ2(120) = 278.6, p < 0.001) or HADS Depression (χ2(149) = 278.6, p < 0.001).

The homogeneity was satisfactory, the mean inter-item correlations were 0.61 (SD = 0.07, range = 0.51–0.75) for the HADS Anxiety and 0.51 (SD = 0.10, range = 0.35–0.68) for HADS Depression (Table 3).

Table 3 Inter-item correlation matrix based on polychoric correlations, pairwise deletion (n = 6622)

Factor structure

The two-factor model (I) without modifications showed a reasonable but not perfect fit between model and data. The factor loadings ranged between 0.73 and 0.84 for HADS Anxiety and between 0.54 and 0.82 for HADS Depression. The RMSEA was close but still above the critical value of ≤0.06 (RMSEA = 0.07). In contrast, both CFI and TLI indicated a good model fit according to the critical value of ≥0.95 (CFI = 0.97, TLI = 0.96). Based on the modification index and recent research [15, 28], item 7 and 8 were allowed to cross-load on both factors (i.e., HADS Anxiety and HADS Depression) in model II. This model demonstrated an excellent fit according to all goodness-of-fit indices (RMSEA = 0.05, CFI = 0.98, TLI = 0.98) and no further need for revision according to the modification index. However, factor loadings as well as cross-loadings for item 7 and 8 decreased <0.5 for both HADS Anxiety and HADS Depression. Furthermore, the residual variance increased for both items compared with model I. All other factor loadings were >0.5 and ranged between 0.74 and 0.85 for HADS Anxiety and 0.55 and 0.84 for HADS Depression (Table 4, Fig. 1).

Table 4 Goodness-of-fit indices for the confirmatory factor analyses models (n = 6622)
Fig. 1
figure 1

Parameter estimates (i.e., factor correlations, factor loadings, cross-loadings and residual variances) from model I (outside brackets) and model II (inside brackets)

As the item scores were highly skewed distributed, a third model was evaluated to address this problem, in which category 2 and 3 were collapsed. This third model demonstrated model fit at the same level as model II (RMSEA = 0.05, CFI = 0.98, TLI = 0.98) (Table 4). In this model, factor loadings varied between 0.33 and 0.83 for HADS Anxiety and 0.40 and 0.84 for HADS Depression. As in model II, only factor loadings and cross-loadings for item 7 and 8 were < 0.5.

Internal consistency reliability

The internal consistency reliability, assessed with ordinal alpha, was 0.92 for HADS Anxiety and 0.88 for HADS Depression. The corresponding internal consistency measured with traditional Cronbach’s alpha was 0.87 and 0.81 respectively.

Differential item functioning

The results from the ordinal regressions analysis are presented in Tables 5 and 6. The conditional variable (i.e. HADS scale scores) was significantly associated with all item responses for both HADS Anxiety and HADS Depression in Block I. The group variable (i.e. sex) was also significantly associated with all items in both HADS Anxiety and HADS Depression (p < 0.001). The same findings were demonstrated when the interaction term (i.e. sex x HADS scale scores) were included in block III. However, the explained variance according to the McFadden pseudo R2 did not increase more than up to 0.01 for the items in HADS Anxiety and 0.02 for HADS Depression across the three models (Block I-III). Based on this, no indication of meaningful DIF for sex was detected.

Table 5 Detection of uniform and non-uniform differential item functioning for sex in Hospital Anxiety and Depression Scale - Anxiety (HADS-A), based on ordinal regression
Table 6 Detection of uniform and non-uniform differential item functioning for sex in Hospital Anxiety and Depression Scale - Depression (HADS-D), based on ordinal regression

Discussion

In the present study, the psychometric properties of the HADS have been evaluated in a large general population of older people. Overall, the HADS showed to be a valid instrument to measure psychological distress in the current population. The original two-factor structure was confirmed, internal consistency was satisfactory and no DIF for sex was detected. Problems with floor effects were shown for all items.

The distribution of item responses was highly skewed towards lower scores and floor effects were shown for all items. However, all of the item response alternatives were endorsed which indicate that all response categories are relevant. A potential problem with this skewed distribution and floor/ceiling effects could be a negative impact on sensitivity and responsiveness [29]. This has been seen in other studies using the HADS [30, 31] and could therefore be expected. Further, this study was based on data from a general population where a limited proportion has shown to have symptoms of anxiety and depression. Thus, this problem is probably related to the sample rather than the instrument.

According to the Little MCAR test, missing data was not completely missing at random which indicate a systematic drop out. However, the number of missing responses was very low. Additionally, as many other statistical tests, the Little MCAR test is sensible to large sample sizes and a statistically significant result does not necessarily imply that it is clinically important [32]. The low rate of missing data indicates that the items are easy to understand and that the instrument is not too extensive and burdensome to complete for the respondents.

The CFA in the present study showed support for the hypothesized two-factor structure with two latent variables, anxiety and depression, which also is demonstrated in previous research regarding community-based healthy older people [10, 16]. However, our results identified problems with cross-loadings for item 7 and 8. This problem has been addressed in previous studies [15, 20]. After these items were allowed to cross-load on both factors (model II), the model fit was excellent according to all indices. It has been suggested that item 8 (“I feel like as if I have slowed down”) could be interpreted as age-related slowing down [15] and that item 7 (“I can sit at ease and feel relaxed”) both refers to psychomotor agitation and the anhedonia domain of the depression subscale and therefore loads both into the anxiety and depression factor [28]. This may explain why these two items seem to be indicators for both anxiety and depression. Even if the model fit increased in model II, the cross-loadings resulted in poor factor loadings below 0.5 and increased residual variances for both item 7 and 8. These findings indicate that the original two-factor model should be preferred despite that the RMSEA is above 0.06. Using the hypothesized two-factor model would also facilitate comparisons between studies. However, this problem needs to be addressed in further studies and users should be aware of this limitation of the HADS.

According to the skewed distribution with few responses on the third (2) and fourth (3) category, these two were collapsed in order to examine if this would increase the model fit further. This third model resulted in excellent fit, very close to the findings from model II. This finding indicates that the skewed distributions, with pronounced floor effects, did not have any serious effect on the factor structure. In addition, this third model was evaluated for statistical reasons and should not be applied for clinical use.

Although anxiety and depression are known to represent two different constructs, they are highly correlated. In our study the correlation between the two latent factors was strong, which is consistent with the understanding that there are symptomatic overlaps between anxiety and depression [33].

The internal consistency is well supported by both ordinal as well as traditional Cronbach’s alpha values for both HADS Anxiety and HADS Depression. This is similar to findings from studies in the same age group [2, 16] and thus supports the robustness of the scale for older people.

Our results show that the HADS can be used to make invariant comparisons between men and women even if the group variable and interaction term was significantly associated with item responses for all items. With a large sample, also small and meaningless associations will be highly significant. Therefore, pseudo R2 changes should be used to exam DIF rather than statistical significance. The effect measured with McFadden R2 was low which implies that no meaningful DIF was present. According to Zumbo, [22] R2 changes above 0.130 are required to determine the presence of DIF. This criteria has in later research been criticized for being too liberal [34] but even when more conservative criterion (R2 ≥ 0.035) suggested by Jodoin & Giel, [35] was applied, no meaningful DIF was present. Some few studies have previously evaluated the measurement invariance of the HADS in relation to sex with diverging results. The HADS was shown to be a valid tool for comparisons between sexes in a population of cardiac patients [36] and in a population of outpatients attending a musculoskeletal rehabilitation program [37]. Yet, when the HADS was evaluated in a population of patients who had undergone heart surgery, DIF for sex was found [38]. Further, DIF for age and sex was found for those 55 years and over in a primary care setting [39]. Even if our study showed absence of DIF for sex in an older general population there are further needs for evaluating DIF in other groups, such as age and ethnicity.

Methodological considerations

The large random sample of a general population is a strength of the present study, though one consequence of a large statistic power is the increased risk to detect statistically significant results of minor importance. We have therefore combined the use of p-values with other statistical methods to evaluate the psychometric properties, for example graphs and effect size measures. One potential limitation is that the upper age was limited to 80 years. No strong conclusions about the HADS can therefore be drawn about the oldest. Our findings need therefore to be confirmed in age groups above 80 years. The dropout rate of 33% is in line with what could be expected in this type of surveys [40]. In fact, the population in this study was in the age group of 65–80 years, where disability and poor health is more common than among younger people. Therefore, the dropout rate can be considered as low. A large drop out may have serious consequences for external validity. However, in psychometric studies large drop outs are seldom a problem as long as it will not affect the variation in data, for example that not all response categories are used. According to the score distribution this was not a problem in the present study. Another strength of this study is that we have used appropriate statistical methods for ordinal level data, which strengthens the statistical validity.

Conclusions

This study showed that the hypothesized two-factor structure, measuring anxiety and symptoms of depression respectively, is also adequate for a general older population. In addition, internal consistency was satisfactory and no DIF for sex was detected. Problems with floor effects were shown for all items. Even if the floor effects did not have any serious impact on the factor structure in the present study, users should be aware of this problem as it may have negative consequences for both sensitivity and responsiveness. Despite this, HADS can be recommended to assess psychological distress among a general population 65–80 years old.