Introduction

Systemic sclerosis (SSc), or scleroderma, is a multisystem autoimmune inflammatory disorder. The disease is characterized by microvascular damage and increased deposition of collagen and other matrix molecules in skin and organ systems [1]. SSc is more prevalent among women, with a female-to-male ratio of 4.7:1 [2]. The clinical course can vary from limited skin thickening to severe organ damage such as pulmonary fibrosis or pulmonary arterial hypertension. The disease can be divided into two subtypes depending on the extent of fibrotic skin involvement: limited cutaneous SSc (lcSSc) and diffuse cutaneous SSc (dcSSc) [1]. There is currently no cure for SSc; thus, treatments focus on reducing disease manifestations and improving health-related quality of life (HRQL) [3].

Depression is more common in SSc patients than in patients with other rheumatic diseases [4, 5]. Symptoms of depression occur in approximately one-third to two-thirds of patients with SSc [6], depending on which questionnaire is used and whether or not the prevalence of depression is based on valid interview methods [7]. Disease-specific symptoms such as reflux, constipation, dyspnea, digital ulcers, pain, fatigue, and changes in physical appearance are associated with negative emotions in SSc [4]. Depressive symptoms are associated with poor HRQL [8]. Patients with SSc who have depressive symptoms are also reported to be less physically active than those without depressive symptoms [9]. Further, depressive symptoms are associated with lower self-efficacy and reduced likelihood of adopting health-promoting behaviors [6, 10].

Depressive symptoms are important to detect and address in patients with SSc. One possible way to capture symptoms of depression is to use patient-reported outcome measures (PROMs) [4]. The Patient Health Questionnaire-9 (PHQ-9) is one such PROM that is reliable and valid in SSc in for example English [11, 12]. The basis of the items in the PHQ-9 is also equivalent to the criteria for depression [13]. A slightly shorter version, the PHQ-8, also exists, wherein the ninth item (thoughts of self-harm and death) is omitted [13]. A high correlation between the PHQ-9 and the PHQ-8 has been found in patients with SSc, and the PHQ-8 is preferred in SSc [14]. Thus, a Swedish version of the PHQ-8 for patients with SSc was of interest for the present study. Health professionals may use the PHQ-8 to detect and facilitate communication about symptoms of depression [15,16,17], support self-management of these symptoms, and refer patients to the appropriate healthcare provider.

The PHQ-8 in Swedish has not been psychometrically evaluated in SSc. However, a linguistic validated version of the PHQ-9 in Swedish can be found on the Pfizer website [18]. A Swedish version of the PHQ-9 has support for internal consistency and concurrent validity among patients with affective disorder diagnoses; meanwhile, high internal consistency and structural validity among patients with self-reported depression have been reported [19, 20]. To determine the quality of PROMs, their measurement properties in the target population need to be studied and the ability of patients to properly understand questions about symptoms of depression is of importance [21]. The present study aimed to investigate different aspects of the validity and reliability of the PHQ-8 in Swedish for individuals with SSc.

Methods

This psychometric study included content validity, construct validity (structural validity and hypotheses testing), internal consistency, test–retest reliability, and floor and ceiling effects [21].

Participants

Study participants were recruited from three rheumatology centers in Sweden. The inclusion was based on: diagnosis meeting the 2013 ACR/EULAR criteria for SSc [22], being ≥18 years of age, disease duration of ≥1 year, and ability to understand and speak Swedish. To evaluate content validity, 11 patients from one center agreed to participate, and fulfilled the inclusion criteria (Table 1), while 10 health professionals (HPs) with various occupational backgrounds from two centers were invited and agreed to participate (Table 2). To evaluate construct validity, internal consistency, test–retest reliability, and floor and ceiling effects, 90 patients who fulfilled the criteria for the study from two centers (n = 35, n = 55, respectively) consented to participate (Tables 1, 3). According to the consensus-based standards for the selection of health status measurement instruments (COSMIN) checklist, a sample size of ≥7 individuals (patients or HPs) is considered to be very good for purposes of assessing content validity by qualitative method [21]. Further, a sample size of 50–99 individuals is found to be adequate for the employed assessment of construct validity, internal consistency, and test–retest reliability [21].

Table 1 Characteristics of patients with systemic sclerosis (SSc)
Table 2 Characteristics of health professionals (n = 10)
Table 3 Scores of the patient-reported outcome measures used to assess construct validity (n = 90)

Disease severity variables

Disease severity variables were collected by a rheumatologist. The modified Rodnan skin score (mRSS) evaluates skin involvement. Skin thickness is scored by palpation of the skin in 17 body areas, each of which is scored from 0 (uninvolved) to 3 (severe thickening). The mRSS is reliable and valid in SSc [23].

The Medsger severity scale (MSS) assesses the severity of the disease in nine organ systems. Each organ system is scored separately according to the following: 0 (normal), 1 (mild), 2 (moderate), 3 (severe), and 4 (end-stage). Aspects of validity have been confirmed for MSS in SSc [23]. The following organ systems were used in our study: peripheral vascular, lung, heart, and kidney.

Patient-reported outcome measures

Patients completed PROMs. The PHQ-8 assesses the frequency of depressive symptoms over the past two weeks (Table 4). The items’ response options are scored from 0 to 3 and are summed to generate a total score with a range of 0–24. A score interval of 0–4 indicates no significant depressive symptoms, 5–9 indicates mild depressive symptoms, 10–14 is moderate, 15–19 is moderately severe, and 20–24 is severe [13, 24]. A final item is not included in the total score and is addressed to patients who indicated any problems among the responses; it asks how difficult these problems have made it for the patients to function in different daily life situations. In our study, the response options in that question were scored as follows: not difficult at all = 0, somewhat difficult = 1, very difficult = 2, and extremely difficult = 3. This final item was used in our study to assess if patients had changed during the test–retest period. The PHQ-8 was completed as self-report by paper and pencil.

Table 4 Test–retest reliability of the Patient Health Questionnaire-8 in Swedish for individuals with systemic sclerosis (n = 90)

The Scleroderma Health Assessment Questionnaire (SHAQ) was used to assess disability, pain, and disease interference with daily activities. The SHAQ comprises the HAQ-Disability Index (HAQ-DI) with 20 items covering daily activities that result in a total score ranging from 0 (no disability) to 3 (severe disability) and one visual analogue scale (VAS) to assess pain (0–15 cm). The SHAQ includes five additional VAS items that assess gastrointestinal symptoms, lung symptoms, Raynaud’s phenomenon, digital ulcers, and overall disease severity interference with daily activities [25]. The value of the VAS is multiplied by 0.2 to attain a score of 0 to 3. Aspects of reliability and validity have been established for the Swedish SHAQ among patients with SSc [26].

The Multidimensional Assessment of Fatigue (MAF) assesses fatigue with 16 items resulting in a cumulative score ranging from 1 to 50. Elevated scores indicate greater fatigue. Aspects of reliability and validity of the Swedish MAF have been confirmed in patients with SSc [27].

The RAND 36-Item (RAND-36) Health Survey was used to assess HRQL. The RAND-36 contains 36 items divided into the following subscales: physical function, physical role function, bodily pain, general health, vitality, social function, emotional role function, and mental health. The subscale total scores range from 0 to 100, where a higher score indicates a better HRQL. The RAND-36 is comparable with the Medical Outcomes Study 36-item Short-Form Health Survey (SF-36) [28], which has been validated in SSc [29].

Procedures

Evaluation of the PHQ-8 in Swedish (PHQ-8 Swe) for individuals with SSc underwent four steps (see the Appendix). (1) The PHQ-8 Swe was developed from an existing Swedish version of the PHQ-9 [18] by omitting item nine. Adjustment of the Swedish translation to individuals with SSc was approved by Professor Kurt Kroenke (personal communication, 2017). (2) A semi-structured interview guide was developed and used for interviews with patients with SSc and HPs within SSc care to evaluate the content validity of the PHQ-8 Swe (Table 5). The interviews were audio-recorded, transcribed verbatim, and analyzed with content analysis. (3) The research team undertook some linguistic adjustments following the analysis of content validity. In addition, two patient research partners reviewed and commented on the PHQ-8 Swe. The PHQ-8 Swe was then back-translated into English for a comparison with the original; no significant changes were found. (4) The PHQ-8 Swe was further evaluated for construct validity, internal consistency, test–retest reliability, and floor and ceiling effects.

Table 5 Interview guide to evaluate content validity of the Patient Health Questionnaire-8 in Swedish

The vast majority of patients completed PROMs and answered questions about sociodemographic data in conjunction with their visits to the hospital (first occasion in the test–retest procedure). The PHQ-8 Swe was completed a second time at the patient’s home and returned by mail in a pre-stamped envelope (retest occasion). The average time interval between the test and retest occasions was 11 (SD 7.4) days.

Ethics approval

The regional committee at Umeå University (No. 2017/149-31) approved the study. Informed written consent was obtained from all patients and HPs who participated in the study in accordance with the Helsinki Declaration. All patients had access to a social worker at the clinic. In the case of severe symptoms of depression, they could be referred to either a psychologist, psychiatrist, or general practitioner.

Statistical analysis

Construct validity by structural validity was analyzed by confirmatory factor analysis (CFA). Single- and two-factor models (cognitive/affective factor items 1, 2, 6, and 7; somatic factor items 3, 4, 5, and 8) were tested based on previous findings of the PHQ-9 in SSc [11]. The maximum likelihood estimation was used to fit the CFA model. Indexes to assess the fitness of the model were: comparative fit index (CFI), root-mean-square error of approximation (RMSEA), and Chi-square/degree of freedom (CMIN/DF). The following cutoff values were used as the level of acceptance with fit considered acceptable when the CFI was ≥0.90 [30]; the RMSEA was ≤0.08 [31], and the CMIN/DF was <3 [32]. The Akaike information criterion (AIC) was used to compare the factor models, and the lowest AIC indicates a more favorable trade-off between fit and complexity [33].

Hypotheses testing for construct validity in terms of different associations between the total score of the PHQ-8 Swe and the other outcome measures; SHAQ (HAQ-DI and VAS scales), MAF, RAND-36, mRSS, and MSS were evaluated. From the previously reported results in SSc of the PHQ-9 [12, 34], for convergent validity [21], we expected that the PHQ-8 Swe would have at least a moderate correlation with SHAQ (HAQ-DI and VAS scales), MAF, and RAND-36. For divergent validity [35], weak correlations were expected with mRSS, MSS, and disease duration. Spearman’s rank correlation coefficient (rs) was used, as most of our data are ordinal. Correlation interpretations were as follows: 0 = no association; 0.1–0.3 = weak; 0.4–0.6 = moderate; 0.7–0.9 = strong; and 1.0 = perfect [36]. The calculated correlation coefficient values were rounded to one decimal.

Internal consistency was determined with Cronbach’s alpha coefficient; an alpha coefficient of ≥0.70 was suggested to be sufficient [37]. The corrected item-to-total correlation was also analyzed and item correlations of >0.30 were interpreted as good [38]. Test–retest reliability was assessed by having patients with SSc complete the PHQ-8 Swe on two occasions. The sign test was used to evaluate whether any statistically significant differences were found between test occasions for the total score and each item. The total score was assessed using an intraclass correlation coefficient (ICC), in a two-way mixed model, and absolute agreement [39]. An ICC of ≥ 0.70 is considered sufficient when evaluating test–retest reliability [37]. Weighted kappa with quadratic weights was calculated to analyze the agreement between test occasions for each item [40]. Kappa was interpreted as follows: kappa <0.00 = poor; 0.00–0.20 = slight; 0.21–0.40 = fair; 41–0.60 = moderate; 0.61–0.80 = substantial; and 0.81–1.00 = almost perfect [41].

Floor and ceiling effects were defined as >15% of patients obtaining the lowest or highest possible total score [40].

Missing items on the PHQ-8 Swe were handled as follows: No total score was calculated if two items were missing. If one item was missing, the missing score was replaced by the mean of the completed items [13]. Missing items were not replaced when individual items in the PHQ-8 Swe were investigated. Missing items on the other PROMs were treated as described by the developer of the respective PROMs. The choice of statistical tests was supported by the COSMIN checklist [21, 40]. The level of significance was specified at p ≤ 0.05. Statistical analyses were performed using the SPSS 25, CFA was performed using Amos 25, and weighted kappa was calculated by the VassarStats: Website for Statistical Computation.

Results

Of the total of 101 patients, most had lcSSc and, at the median, mild disease severity in their peripheral vascular system, as well as normal heart and kidney systems (Table 1). The patients (n = 90) participating in the testing of aspects of construct validity and reliability had, at the median, mild disease severity of the lung system. The patients (n = 11) in the assessment of content validity had, at the median, moderate disease severity of the lung system and, at the median, greater skin involvement than those in the assessment of construct validity and reliability (Table 1).

The PHQ-8 Swe total score was at median 6 (interquartile range (IQR): 2–12; n = 11), the PHQ-8 Swe total score for n = 90 patients see aspects of reliability. Of the patients (n = 90) who completed the PHQ-8 Swe, 53% had no significant depressive symptoms, 30% had mild symptoms, 15% had moderate symptoms, 1% had moderately severe symptoms, and 1% had severe symptoms. The final item in the PHQ-8 (not included in the total score), assessing the difficulties of symptoms of depression in different daily life situations, was, at the median, in the first measurement occasion 1 (i.e., “somewhat difficult”; min–max: 0–3; n = 81) and at retest 1 (i.e., “somewhat difficult”; min–max: 0–2; n = 72). There were no statistical changes over time (p = 0.84).

Content validity and linguistic adjustments

The results of the evaluation of the content validity of the PHQ-8 Swe are presented in the domains of comprehensibility, relevance, and comprehensiveness [21], with illustrative quotations in Table 6. Overall, the PHQ-8 Swe was experienced as being easy to understand, relevant in item content, and covering important aspects of depression in SSc. However, the following main changes to the PHQ-8 Swe were carried out to boost understanding: The tenses in items 1, 3, 5, 6, and 7 were altered to maintain the same tense throughout all items. Further, item 1 was changed from “little interest” to “felt less interest” and in item 8 the words “could have” were added to prevent misunderstandings in the Swedish language. This change in item 8 is in line with the English original [18] and information from Kurt Kroenke (personal communication, 2017). Finally, the last item (not included in the total score) was clarified. Table 4 contains the PHQ-8 Swe for individuals with SSc.

Table 6 Content validity of the Patient Health Questionnaire-8 in Swedish for individuals with systemic sclerosis

Construct validity

Structural validity: The CFA for the single factor had a near “reasonable” fit with fit indicators: AIC 97.3, CFI 0.891, RMSEA 0.128, and CMIN/DF 2.47. The two-factor model provided a better fit for the data, revealing an “acceptable” fit and AIC 81.5, CFI 0.953, RMSEA 0.086, and CMIN/DF 1.66.

Hypotheses testing for construct validity: Convergent validity was supported by strong correlations between the PHQ-8 Swe and the assessment of pain (HAQ-DI VAS); fatigue (MAF); and physical role function, bodily pain, vitality, social function, and mental health (RAND-36). Moderate correlations were found between the PHQ-8 Swe and disability (HAQ-DI); gastrointestinal symptoms, lung symptoms, Raynaud’s phenomenon, digital ulcers, and overall disease severity interference with daily activities (SHAQ VAS); physical function, general health, and emotional role function (RAND-36); and disease severity of the lung system (MSS) (Table 7). Divergent validity was obtained with weak correlations between the PHQ-8 Swe and skin involvement (mRSS); disease severity of peripheral vascular, heart, and kidney systems (MSS); and disease duration (Table 7).

Table 7 Construct validity (correlations) of the Patient Health Questionnaire-8 in Swedish for individuals with systemic sclerosis

Aspects of reliability

In terms of internal consistency, the Cronbach’s alpha was 0.85 and corrected item-to-total correlation had a median of 0.61 (min–max: 0.41–0.76; n = 87). Of the 90 patients who completed the PHQ-8 Swe on the first test occasion, 81 of them responded to it on the retest occasion. The median of the total score of the PHQ-8 Swe was 4 on the test occasion (IQR: 2–9; n = 89) and also 4 on the retest (IQR: 1–7; n = 81). The ICC was 0.83 (n = 81) for the total score, and the weighted kappa coefficient had a median of 0.72 for the items (min–max: 0.60–0.79). There were no significant differences between test occasions in the total score (p = 0.15) or in seven of the eight items (Table 4).

The total score had no floor or ceiling effects (n = 89).

Discussion

This study evaluates aspects of validity and reliability of the PHQ-8 Swe for individuals with SSc. The results indicate that content validity was satisfactory overall; however, some items could be interpreted as not only related to a depressive symptom but also covering somatic symptoms related to SSc. Further, based on the interviews, some linguistic adjustments were performed. The CFA revealed a better fit for the two-factor model than the one-factor model. The PHQ-8 Swe for individuals with SSc correlates more to self-reported disability, pain, disease interference with daily activities, fatigue, and HRQL than to disease severity assessments except for a moderate association with lung disease severity. Internal consistency and the test–retest reliability of the PHQ-8 Swe total score were sufficient and there were no floor or ceiling effects.

In terms of content validity, items were expressed as generally relevant and easy to understand, though some linguistic adjustments were made to the PHQ-8 Swe to increase the understanding for individuals with SSc. Further, some HPs experienced a fear of upsetting patients due to the potentially emotionally demanding items. In general, this consideration among HPs is probably unnecessary because individuals with SSc are likely to exhibit depressive symptoms during the disease course [42], which is of important to capture.

Some items in the PHQ-8 Swe were found to cover symptoms or problems possibly attributed to the somatic symptoms of SSc. When patients (n = 90) completed the PHQ-8 Swe, items 1 (interest/pleasure), 3 (sleep), and 4 (tired/little energy) held the highest median scores, while items related to sleeping problems and tiredness cover symptoms that can be attributed to somatic symptoms of SSc [43]. Strong correlations were found between the PHQ-8 Swe and fatigue and vitality; others have found similar results [12]. Fatigue could be related to somatic symptoms but on the other hand, fatigue is also a part of the core criteria for depression.

Our interviews indicated that it was difficult to estimate the verbal response options in the PHQ-8. In some previous studies, the verbal response options were changed to the exact number of days, but the original verbal setting has stronger validation data [13]. Items that were suggested for inclusion in the PHQ-8 Swe involved meaning of life, demanding situations for mental health, and self-management strategies. Psychosocial support [42] and support for self-management strategies such as physical exercise are important to these patients [44]. However, the PHQ-8 consists of criteria for depression, and it would be problematic to include items beyond these criteria.

Construct validity by structural validity of the PHQ-8 Swe showed a nearly “reasonable” fit with a one-factor solution, while the two-factor model was considered to have an “acceptable” fitting model and provide a better fit to the data. The authors have not found any results in terms of structural validity regarding the PHQ-8 in patients with SSc, though the PHQ-9 previously confirmed both a single- and two-factor model without substantive differences between them [11].

Hypotheses testing for construct validity revealed weak correlations between the PHQ-8 Swe and skin involvement as well as the objectively assessed disease severity of peripheral vascular, heart, and kidney systems. This suggests divergent validity and that the PHQ-8 Swe does not capture these somatic aspects of the disease in our sample. Similar results, including physician-rated disease severity, have been described in previous studies on the PHQ-9 in SSc [12, 34]. One reason for the low correlations between the PHQ-8 Swe and disease severity could be that the disease severity of the assessed organ systems in the included sample was, at the median, mild or normal. However, the moderate association between the PHQ-8 Swe and the lung system indicates that lungs were more affected than the other assessed organ systems in our sample. Different associations between disease manifestation of the lung system and depressive symptoms have been described in SSc [4, 8], but to our knowledge, no strong associations have been presented [8]. Findings supporting convergent validity revealed moderate-to-strong correlations between the PHQ-8 Swe and disability, pain, disease interference with daily activities, fatigue, and HRQL; these results were comparable to those of previous studies on the PHQ-9 in SSc [12, 34]. There were strong correlations between the PHQ-8 Swe and pain/bodily pain, fatigue/vitality, physical role function, social function, and mental health, indicating that the PHQ-8 Swe for individuals with SSc reflects both physical and mental aspects. The medical treatment indicates that pain was a problem in our sample, which the strong correlation between pain and PHQ-8 Swe also implies, an association in agreement with previous reports [45]. Our results align with those of other studies indicating that symptoms of depression are associated with decreased HRQL more than with organ manifestations that may be life-threatening [8]. However, the relationships are complex; living with depressive symptoms can influence the person’s experienced life situation and, thus, may influence the completion of PROMs.

Although the results (n = 90) of the assessments of the MSS and the medical treatment may indicate severe disease for a number of patients in our sample and that patients also could have other rheumatic diseases and comorbidity, such as cardiovascular diseases, only 17% of the patients had at least moderate symptoms of depression on the PHQ-8 Swe. An earlier study of PHQ-8 in SSc has shown that 26% had at least moderate symptoms of depression which is somewhat higher than in our sample [14]. However, there are more patients in percent, in our study, with at least moderate symptoms of depression compared to the general population in Sweden and the USA [24, 46]. Nevertheless, due to the risk for overestimation when using PROMs, a diagnosis of depression must be confirmed by validated diagnostic interviews [7].

A sufficient internal consistency was found, and these results are comparable with those from the PHQ-9 in patients with SSc [11, 12]. The ICC confirmed sufficient test–retest reliability. To the best of our knowledge, the test–retest reliability of the PHQ-8 has not been assessed previously in SSc. Kroenke et al. [13] assessed the PHQ-9 for test–retest reliability in a primary care setting and found a strong association between the test occasions. The agreement between the items in our study in the test–retest was moderate to substantial [41], though a significant difference was obtained in item 3 in the test–retest procedure. The latter might be the result of fluctuations in trouble falling or staying asleep. However, the difficulties in the symptoms of depression manifesting in daily life did not differ between test occasions, implying stability in the consequences during the testing period. Thus, the test–retest reliability of the PHQ-8 Swe for individuals with SSc is satisfactory for the total score.

One limitation of our study is that convergent validity was not tested with another instrument assessing depression, such as the Center for Epidemiologic Studies Depression Scale [12]. This was not feasible because no questionnaires assessing depression have been psychometrically tested in Swedish among individuals with SSc. However, we found a strong association between the PHQ-8 Swe and mental health in RAND-36. Another limitation is that we did not evaluate the associations between the PHQ-8 Swe and all organ systems in the MSS as well as to comorbidities such as cardiovascular diseases. Nevertheless, among patients with comorbidities, there were almost equally amount that reported no significant depressive symptoms (PHQ-8 scores 0–4) as depressive symptoms of different severity (PHQ-8 scores 5–24) (data not shown). A further limitation is that approximately half of the patients scored no significant symptoms of depression during the latest two weeks, though they could have experienced depressive symptoms earlier [42]. On the other hand, one-third of the patients scored mild symptoms of depression, while one-sixth had moderate-to-severe symptoms of depression.

In conclusion, in this psychometric study with in majority individuals with lcSSc, the content validity was satisfactory the reliability was sufficient and there were no floor or ceiling effects. The PHQ-8 Swe was more strongly associated with self-reported disability, pain, disease interference with daily activities, fatigue, and HRQL than to disease severity assessments, except for a moderate association with lung disease severity. As health professionals struggle to support patients with SSc in self-management, identifying symptoms of depression by the PHQ-8 Swe could be one of several means. Future studies in SSc on other aspects of validity, such as investigating the PHQ-8 Swe’s ability to discriminate between patients with a confirmed diagnosis of depression by validated interviews and those without a diagnosis, are needed.