Introduction

A comprehensive understanding of health requires considering the health status of people based on a bio-psycho-social model [1]. Accordingly, the construct of health-related quality of life (HRQoL) has been established as the third central outcome parameter in health research—in addition to mortality and morbidity. HRQoL is understood as a multidimensional construct. HRQoL reflects subjectively reported aspects of physical and mental health of individuals and the impact of the health status on QoL [2,3,4].

The Short-Form-36 (SF-36 [5,6,7]) is one of the most frequently used instruments for HRQoL assessment in international health research. With 36 items, the instrument records aspects of physical, mental and social health from the subjective perspective of the respondents. Based on the answers to the 36 single items, the values on the 8 underlying single constructs Physical Functioning (PF), Physical Role Functioning (PR), Bodily Pain (BP), General Health (GH), Vitality (VT), Social Functioning (SF), Emotional Role Functioning (RE) and Mental Health (MH) can be determined. Additionally, the values on these 8 dimensions can be aggregated to a Physical Component Summary (PCS) value and a Mental Component Summary (MCS).

Original factorial SF-8 structure proposed by Ware et al. [8]

To provide a time-efficient screening of physical and mental aspects of HRQoL the SF-8 has been developed. In the SF-8 each of the 8 SF-36 dimensions is represented by a single item [6]. In their original study Ware et al. [8] applied a principal component analysis (PCA) to identify the factorial structure of the SF-8 (see Fig. 1; full model). Factor loadings were allowed for all 8 single items on each of the two uncorrelated constructs PCS and MCS. Nevertheless, both constructs proved to be mainly represented by 6 items. The physical component PCS reflects Physical Functioning, Physical Role Functioning, Bodily Pain, General Health and Vitality. The mental component MCS mainly represents the facets Social Functioning, Mental Health, Emotional Role Functioning, General Health and Vitality [8]. Accordingly, General Health and Vitality proved to be germane indicators of both underlying constructs PCS and MCS (see Fig. 1; restricted model structure).

Fig. 1
figure 1

Structure of the full and restricted (without loadings marked with dashed lines) WIM-models according to Ware et al. [8]. PF Physical Functioning, PR Physical Role Functioning, BP Bodily Pain, GH General Health, VT Vitality, SF Social Functioning, RE Role Functioning Emotional, PH Physical Health, MH Mental Health

Confirmatory factorial analyzes of the SF-8 structure

Wang et al. [9] as well as Lang et al. [10] used a confirmatory factor analytical (CFA) approach to investigate the underlying latent structure of the SF-8. In CFA models, a theory-based specification is made for each item to which latent variable it is assigned. CFA models assuming between-item-multidimensionality (BIM) require that each item loads on only one factor. Wang et al. [9] as well as Lang et al. [10] identified a three dimensional BIM structure as the best fitting model in Chinese samples. The third factor Overall Health is reflected by the item pair General Health and Vitality (see Fig. 2; 3-DIM). Lang et al. [10] emphasize that this result for the SF-8 is consistent with studies on the SF-36, which have shown a third component of General Well-Being besides Physical and Mental Health to be relevant [11,12,13,14,15].

Fig. 2
figure 2

Factorial model structures of the SF-8 according to Lang et al. [10]. PF Physical Functioning, PR Physical Role Functioning, BP Bodily Pain, GH General Health, VT Vitality, SF Social Functioning, MH Mental Health, RE Role Functioning Emotional, PH Physical Health, OH Overall Health

Furthermore, Lang et al. [10] found a two-dimensional CFA model to be acceptable (see Fig. 2; 2-DIM). Nevertheless, the item Vitality showed a noticeably weak item-total correlation. On closer examination of the data reported by Lang et al. [10], this seems quite reasonable: the item Vitality is closely related to the item General Health, which is clearly assigned to Physical Health in the two-factor model. Accordingly, Vitality should be considered as an indicator of Physical Health rather than Mental Health. This model structure corresponds exactly to the structure that Hann and Reeves [16] found valid for the SF-36. In Fig. 2 the 2-DIM-Modified model represents the corresponding latent model structure.

For the SF-36 [16, 17], the SF-12 [18, 19] and the SF-8 [9, 10] the underlying constructs proved to be highly correlated. Nevertheless, the assumption that a general component is reflected in all SF-items could not be confirmed (unidimensional model; see Fig. 2: 1-DIM), because multifactorial models provided a better data fit.

Bi-factor models of the SF-8 structure

Bi-factor models consider the answer to each item to be determined by two information components simultaneously (with2in-item-multidimensionality; WIM; [20, 21]). Regarding the construct HRQoL, each item has to be assigned to a general (i.e. general HRQol) and a specific latent variable (i.e. physical, mental or overall). As shown in Fig. 2, three bi-factor models can be defined for the SF-8 by combining the single factor model (1-DIM; left) with one of the three multi-dimensional models (2-DIM, 2-DIMMOD, 3-DIM; right). Accordingly, the response to each item reflects the general HRQoL on the one hand and an physical, mental or overall aspect on the other [22,23,24]. Chen, West and Sousa [25] pointed out, that bi-factor models generally provide a reasonable alternative model approach, if highly related domains comprise the general multifaceted construct of interest. The assumption that the general characteristic HRQoL value is included in the answers to each item of an HRQoL scale is in concordance with the underlying theoretical assumptions regarding HRQoL [2].

Knowing the underlying model structure is a prerequisite to validly interpret and use the information of the SF-8 items for diagnostic and evaluative purposes. Hence, the central aim of the present study was to comparatively evaluate the factor structures underlying the SF-8 items. The specific aims were:

  1. 1.

    To determine the fit of existing SF-8 models for a German general population sample.


    CFA models assuming both WIM (see. Fig. 1) and BIM (see Fig. 2: 2-DIM, 2-DIMMOD and 3-DIM) have been evaluated.

  2. 2.

    To determine the fit of bi-factor models which assume a general factor HRQoL as an additional source of information.


    The three WIM models combine each of model structures in Fig. 2 on the right (2-DIM, 2-DIMMOD and 3-DIM) with the general 1-DIM model (Fig. 2 on the left).

Methods

Data collection

The SF-8 data were collected in a multi-topic survey commissioned by the University of Leipzig and conducted by the research institute USUMA Berlin in autumn 2004. The aim of the survey was to obtain a representative sample of people living in private households in Germany aged 14 and over. In order to ensure the representativeness of the sample for the German population, a random selection of households was first made using the random route method [26]. The person to be interviewed was selected randomly in the household. The utilization rate of the survey is 62.3%. A total of N = 2591 persons between the ages of 14 and 99 were interviewed on the basis of voluntary participation.

The research institute USUMA provided weighting factors (γi) for each participant. These weighting factors (γi) can be used to correct violations of representativeness with regard to central socio-demographic characteristics (i.e. state, gender and age). The weights correct the increased selection probability of individuals in small households and the distortions due to the lack of participation of randomly selected individuals. Members of groups that are underrepresented (vs. overrepresented) in the sample receive a weight greater (vs. smaller) than 1, ensuring that the corrected actual values correspond to the target values in the population. These corrective weighting factors were used to determine the univariate distributions and correlation statistics.

Statistical analysis

The SF-8 models were estimated using CFA (BIM and WIM). The CFA determines the model parameters ensuring the best fit of (a) the model-based and (b) the empirical item associations (variance–covariance matrix). The χ2-value allows to determine the significance of the differences between the empirical and the model-based information. However, the validity of this test is considerably limited due to its overly high testing power in large samples. Alternative measures focus on the empirical relevance of the differences [27]: According to the root mean square error of approximation, a model is considered as good fitting if less than 5% (RMSEA < 0.05) of the information in the empirical variance–covariance matrix remains unexplained (acceptable model fit: RMSEA < 0.08). The incremental fit measures Confirmatory Fit Index (CFI) and Tucker–Lewis-Index (TLI) exhibit higher values, the more information a model can explain compared to a baseline model that assumes uncorrelated items (good model fit: CFI, TLI > 0.97; acceptable model fit: CFI, TLI > 0.95, [27]).

The maximum likelihood (ML) approach assumes normally distributed data and allows the most comprehensive determination of model fit criteria [27, 28]. This procedure proves to be robust to moderate violations of the normal distribution [29]. The ML approach is generally used for the analysis of SF-8 in the literature [9, 10]. However, in the present norm data, distribution problems caused by considerable ceiling and floor effects prevail (see Table 2). The WLSMV algorithm (Weighted-Least-Squared-Means-Variance) requires only ordinally scaled data. It has proven advantageous over alternative distribution-free estimation methods (e.g. MLR; robust ML estimation) for sufficiently large samples (N > 1000) [29,30,31]. The statistical model assumes that the categorically measured data are based on a multivariate latent normal distribution. This is a generally plausible assumption for ordinal collected questionnaire data [32,33,34]. When using the WLSMV algorithm, the Standardized Root Mean Square Residual (SRMR; good fit; < 0.05) proved to be a more valid fit indicator than RMSEA, especially in large samples [35]. Despite the superiority of the WLSMV algorithm for the present data set, the findings for the ML estimates are also reported to ensure comparability with existing analyses. For all model estimates, the item loadings are freely estimated in case of more than two indicators (tau-equivalent modelling). The model estimates are performed using the software Mplus 8.0 [36].

In addition to global quality criteria, it must also be ensured at the local item level that each item is sufficiently closely associated with the factor to which it is assigned: factor loadings > 0.63 or indicator reliabilities > 0.4 indicate a sufficiently clear item-construct assignment [28].

Results

Sociodemograhic characteristics

Sociodemographic characteristics of the N = 2545 people in the sample are depicted in Table 1. The weighting factors γi indicate that distortions could not be avoided despite the elaborate procedure for ensuring representativeness. The last column shows the correlation between γ and the socio-demographic characteristics. The significant correlations were due to the fact that women, people in the higher age groups, people with lower income and lower education, workers and people living alone are overrepresented in the study (γi < 1). The SF-8 items were positively correlated with γi (see Table 2; column r): People with lower HRQoL were overrepresented in the study sample.

Table 1 Sociodemographic characteristics of the study sample
Table 2 Descriptive values and results of the scale analysis for the values of the SF-8 single items or the scale values

All SF-8 items have been answered almost completely (maximum missing data rate: 1.3%). To avoid biases caused by missing data the very few missing responses were imputed by the EM algorithm [37].

SF-8 structure according to Ware et al. [8]

Table 3 shows the results for the model specification according to the original model proposed by Ware et al. ([8]; see Fig. 1). All global fit-measures identify the Full Model (assigning each item to both factors) as better fitting than the Restricted Model. As in the original study applying PCA [8], the items Physical Functioning, Physical Role Functioning, and Bodily Pain were most strongly associated with the Physical Health component. The items Social Functioning, Mental Health and Emotional Role Functioning reflected the construct Mental Health most distinctly. In accordance with the results reported by Ware et al. [8], the items General Health Perception and Vitality showed a clear double loading in the present study. However, both items were more strongly associated with the Mental Health factor in the full model. Note, that the variance of the items General Health Perception (R2 = 0.370–0.470) and Vitality (R2 = 0.462–0.590) was explained most weakly for both model definitions.

Table 3 Factor loadings and global fit measures for the for the ML- and WLSMV estimates of the Full 2-DIM WIM-model and the Restricted 2-DIM WIM-model according to Ware et al. [8]

Confirmatory factorial analyzes of the SF-8 structure

Table 4 shows the results for the CFA model structures assuming BIM. For both estimation methods, similar differences in the model quality criteria were found. In the following, we refer primarily to the WLSMV-estimates which are based on more valid distributional assumptions. In accordance with study results reported by Lang et al. [10] the best data fit was found for the three-dimensional CFA model (χ2df=19 = 248.68; CFI = 0.995; RMSEA = 0.073; SRMR = 0.014). The modified two-dimensional model (2-DIMMOD; χ2df=19 = 622.50; CFI = 0.987; RMSEA = 0.112; SRMR = 0.024) assuming Vitality to be an indicator of Physical Health allowed for a better data fit, than the two-dimensional BIM-model (2-DIM; χ2df = 19 = 730.83; CFI = 0.985; RMSEA = 0.112; SRMR = 0.026).

Table 4 Factor loadings and measures of global fit for the ML- and WLSMV-estimates of the models assuming between-item-multidimensionality (BIM); 1-, 2-, 3-DIM = one-, two- and three-dimensional model specification

Bi-factor models of the SF-8 structure

For the bi-factor models, both the two-dimensional modified model (2-DIMMOD) and the three-dimensional model (3-DIM) show a considerably better data fit than BIM models. In particular, the χ2-values (71.073df = 12, 92.38df = 13), the RMSEA (0.044, 0.049) value and the SRMR (0.001) are significantly lower than for the BIM models (Table 4). The BIC which can only be determined for the ML-estimation also identified the bi-factor models as best fitting.

In the bi-factor models, all SF-8 items are associated with the general factor (loadings: 0.698–0.873) to a much higher degree than with the specific factors Physical, Mental and, if applicable, Overall Health (loadings: 0.216–0.582). The general factor, which can be interpreted in terms of the general HRQoL, thus proves to be the dominant source of the item variances (Table 5).

Table 5 Factor loadings and global fit measures for the for the ML- and WLSMV estimates of the bifactor-models assuming within-item-multidimensionality (WIM; all factor assumed to be uncorrelated)

Calculation of the SF-8-scale scores

According to these results, five scale scores (T-values: M = 50; SD = 10) can be calculated representing the information of the SF-8 items according to the 2-DIMMOD model and the 3-DIM model (BIM models) as well as the bi-factor specification. The syntax for calculating these scale scores is attached in the Additional file 1. The Mental Health Score (MHS) (α = 0.854) aggregates the item group identified as homogeneous in both the 2-DIMMOD and 3-DIM models: Social Functioning, Mental Health and Emotional Role Functioning.

The Physical Health Score (PHS2) (α = 0.898) represents the information of the items Physical Functioning, Physical Role Functioning, Bodily Pain, General Health Perception and Vitality according to the 2-DIMMOD model. According to the 3-DIM model, the Physical Health Score (PHS3) (α = 0.892) aggregates the information of the items Physical Functioning, Physical Role Functioning and Bodily Pain. Overall Health (α = 0.779) represents General Health Perception and Vitality. The SF-8 total score (α = 0.918) combines the information of all 8 items to a general indicator of HRQoL. Table 2 shows the item-total correlation for each scale definition.

Table 6 displays the correlation of these scale scores and the scale scores Physical Component Summary (PCS) and Mental Component Summary (MCS) according to Ware et al. [8]. As expected PCS was very strongly associated with the physical scores PHS2 (r = 0.960) and PHS3 (r = 0.973), respectively. MCS values corresponded very highly with the mental score MHS (r = 0.939). OHS and SF-8-Total were more strongly correlated with PCS than with MCS. Generally, all scale scores were highly intercorrelated (r ≥ 0.633), which underlines the high commonality of the HRQoL-related information collected by the SF-8 items.

Table 6 Correlation of the SF-8 scale scores PCS and MCS proposed by Ware et al. [8] and the scale scores based on the Bi-factor models

Discussion

In this study, a satisfactory fit of the SF-8 to different model specifications could be confirmed by means of CFA in a German general population sample. Ware et al. ([8]; see Fig. 1) suggest that the SF-8 data can be summarized as a Physical and a Mental health component score. The according Full 2-DIM WIM-model assuming double loadings for all 8 items (Table 3) exhibited a slightly better model-fit than the restricted model definition. The explained proportion of variance is weakest for the items Vitality and General Health Perception (0.469–0.590).

Alternatively, models assuming each item to be indicative for only one of the underlying latent factors (BIM) also showed a good data fit (Fig. 2, Table 4). Assuming BIM, the best CFA model fit has been identified for the three-factor model structure (3-DIM) reported by Lang et al. [10]. The third factor, Overall Health, is formed by the two items General Health Perception and Vitality.

For the two-dimensional BIM definition, the assignment of the item Vitality to the physical factor in the model 2-DIM-Modified lead to an improved data fit. This is in accordance with the results of Lang et al. [10] in a representative Chinese population. Lang et al. [10]) discuss these results as particularly characteristic for the Asian region (see also: [9, 11,12,13, 15]) in comparison to European and US-American data. The findings reported in the present paper provide evidence that cultural differences should not be assumed as the main cause for differences in the reported findings. The well-founded CFA approach of Lang et al. [10] yields very similar results in the Chinese population as the CFA approach in the data presented here for Germany. Differences to earlier analyses in the United States [8, 38], thus seem to be due to the CFA approach.

For the short versions SF-12 and SF-8, high correlations of the Physical)PCS) and Mental Component Summary (MCS) are reported in the literature [8,9,10, 19]. Despite this high correlation of Physical Health and Mental Health, a general factor HRQoL has not yet been considered when evaluating the SF-8. The underlying assumption of the BIM is: Each item exclusively covers either a physical or mental aspect. If the bi-factor approach (WIM assumption) is applied, a fundamentally different model is assumed. Bi-factor models allow the information of the SF-8 items to be determined by general HRQoL. Our findings showed a clearly better data fit for the bi-factor models (Table 5). Note, that in these models Physical and Mental as well as Overall Health are assumed to be uncorrelated components. The correlation of the single items assigned to different facets is completely modeled by the general factor HRQoL. In the bi-factor models our results showed, that the general HRQoL dominantly determines the variance of all items. WIM thus represents a plausible and statistically superior model assumption, which opens a completely new view on the structure of the SF-8 [22, 23, 25]: The SF-8 primarily measures a general HRQoL component. Assuming a dominant principal component HRQoL for the items of SF-8 is further supported by the results of a PCA: Only the eigenvalue 5.11 of the first component is greater than 1. This first component explains a very high amount of the item variances: 63.40%.

Accordingly, a SF-8 overall score can be determined, which represents HRQoL across physical and mental facets. This approach thus represents a psychometrically well-founded alternative to existing evaluation approaches for scale variants of the SF family. The suitability should also be tested for the SF-12 and SF-36.

At the level of global model fit measures, the two-factor model (Bi-factor 2-DIMMOD) allows for a better data fit than the three-factor model (Bi-factor 3-DIM) (Table 4). However, this superiority is not supported by the item loading structure. In contrast to the items Physical Functioning, Physical Role Functioning and Bodily Pain (loadings: 0.289, 0.371, 0.155), the two items General Health Perception and Vitality were associated negatively (loadings: −0.093, −0.206) with the factor Physical Health. Accordingly, these two items proved to be indicators of the general health factor rather than specific health factors.

The model estimates were calculated using both the ML algorithm as well as the WLSMV algorithm. Generally, the global fit measures (especially χ2, CFI and SRMR) indicated a better model fit for the WLSMV estimates. The poorer fit for the ML estimates was expected because of strong violations of the normal distribution in the analyzed norm data set. The WLSMV algorithm is methodologically superior to alternative modeling approaches when the underlying latent correlation structure is analyzed. WLSMW prevents underestimation of correlations due to asymmetric data distributions and categorical data format [30, 33, 34]. Accordingly, applying the WLSMV algorithm leads to factor loadings and explained item variances being higher. The validity of all modeling results is systematically attenuated when the ML approach is used in the case of clearly non-normal distributed data [27, 28].

Some limitations of the study must be considered in order to correctly assess the study results. We focused on the dimensional structure of the SF-8, without analyzing further clinimetric characteristics of the instrument [39]. Clinimetrics emphasizes, that each assessment has to be evaluated regarding its suitability for specific purposes in clinical practice comprehensively. In addition to our study results, it would be particularly important to find out to what extent the individual items as well as the scale scores of the SF-8 are able to reflect clinically relevant changes in health status validly over time. In addition, future research should focus on how the SF-8 can be embedded in an overall assessment to address individual patient needs in treatment planning and to sensitively evaluate clinically significant changes [40, 41].

Conclusions

For the SF-8, the fit to two-factorial (Physical and Mental Health) and three-factorial latent structure models (in addition: Overall Health) could be substantiated in a German general population sample. Furthermore, a good model fit was achieved using bi-factor models, in which the generic construct HRQoL is shown to be a major source of variance in each of the SF-8 items. Accordingly, the SF-8 Total Score may be a valid way of summarizing the SF-8 data indicating general HRQoL. Future studies should evaluate the usefulness of the SF-8 Total Score in quantifying disease burden and evaluating clinically significant changes.