Background

In healthcare, the degree of patient harm was first publicized in the 1990s [1]. After the book To err is human: building a safer health system [2] was released, 474 papers were published on Medline using the keyword of patient safety culture (PSC) to search as of June 3, 2017. Safety experts [3,4,5] addressed that patient safety begins with the enforcement of system safety of healthcare organizations, and this culture is a fundamental factor that influences healthcare system safety [4].

Many studies [6,7,8] have used the Safety Attitudes Questionnaire [9] as a tool to verify PSC reliability and validity [6, 10]. However, the comparison in practice is commonly made between departments in a hospital. Few studies examine the reliability and validity of PSC on a department-unit base. Person misfit indicator is commonly used in the literature [11] to identify person possible carefulness and careless behavior in response. If data are not purified or polished, the comparison or analysis is meaningless because a quality scale must be one-dimensional, or all variables loading on the same factor should make sense when scores are summed. The first research question is how to easily examine unit-based construct validity.

Psychological characteristics are defined as an abstract or latent nature rather than a tangible and observable entity [12,13,14]. A summation score across items on a domain is meaningless if items measure different features. The most top priority for an analysis is “determining the number of factors to retain”, although certain considerations are made in exploratory factor analysis (EFA) [15,16,17] including Kaiser’s rule (factors with eigenvalue >1), Scree plot criteria, and variance explained criteria. Horn’s Parallel analysis (PA) [18] has been reported to be the best method [19,20,21]. The number of factors is determined where the eigenvalue in the random data is lower than the respective component of the actual data (Fig. 1) [22, 23]. However, PA is not implemented in commonly used statistical software (e.g., SPSS and SAS). A user-friendly website [24, 25] and the author-made macro applied to SAS [26] have been recommended. It is unheard for use on Microsoft Excel.

Fig. 1
figure 1

View of the scree parallel plot and scree simulation for determining the number of factors to retain

A dimension coefficient (DC) has also been proposed in a previous study [27] using Rasch model [28]. We are interested in further comparing the two aforementioned methods of PA and DC that can be combined for use in practice.

The present study aimed to identify (1) why Cronbach’s α (i.e., internal consistency reliability) is not a sufficient condition of validity [29, 30]; (2) whether DC can be incorporated with PA for precisely inspecting a one-dimensional scale; and (3) how a module in Microsoft Excel can be used for examining the validity of a PSC domain on a unit base by checking the DC.

Methods

Simulative datasets

Using Microsoft Excel Visual Basic for Applications, we designed a computer module to manipulate 2000 datasets fitting the Rasch rating scale model (i.e., like 5-point Likert-type scale) [31], which consisted of (i) five correlation coefficients (correl. = 0.3, 0.5, 0.7, 0.9, and 1.0) on two latent traits (i.e., true scores) following a normal distribution and answering their respective 1/3 and 2/3 items; (ii) 20 scenarios of item lengths from 5 to 100; and (iii) 20 sample sizes from 50 to 1000. Each item containing 5-point polytomous responses (similar to the PSC format) was uniformly distributed in difficulty across a ± 2 logit range. Three methods (i.e., dimension interrelation ≥0.7, Horn’s PA 95% CI, and individual random eigenvalues) were used to determine one factor to retain.

Simulation process

When person true scores and item difficulties are known, we can simulate Rasch data [32]. Scale’s reliability and DC were simultaneously calculated for each simulative dataset, where reliability is defined as Cronbach’s α. DC is expressed in a previous study [27, 33, 34].

The upper 95% CIs of an eigenvalue in the random simulative dataset can be used for determining the number of factors [23] using the website link [25] or Vista software [35]. In this study, the eigenvalues of PA 95% CI were extracted from the website [25] (see Additional file 1 [extracting data from website] and Additional file 2 [simulation process]). Box plots were applied to show their dispersions of Cronbach’s α and DC (on y-axis) across five scenarios (on x-axis) of dimension correlation (correl. = 0.3, 0.5, 0.7, 0.9, and 1.0). The first research question (i.e., Cronbach’s α is not sufficient to a scale’s validity) could be verified by the plot comparison.

In literature, Tennant and Pallant [36] addressed that the Rasch fit statistics performed poorly (i.e., identify two domains) where dimensions were interlaced with equal item length and where the correlation between factors was near 0.7. If two dimensions (corr. ≈ 0.7) are interlaced with only 1/3 items, we can reasonably consider the scale as one dimension.

The accuracy of the three methods (i.e., dimension interrelation ≥0.7, PA 95% CIs, and individual random eigenvalues) was compared to determine the number of factors (i.e., 1 denotes one factor and 0 represents many factors maked with dots in Additional file 3). Indicators included sensitivity, specificity, and area under the receiver operating characteristic curve (AUC). The second question (i.e., DC can be combined with PA for inspecting the number of factors for a scale) can then be answered.

Demonstrations of actual PSC data

To answer the third research question, we collected data from previous studies regarding a PSC survey in a hospital [6, 7, 37]. The sample size was 2237 with 97 departments. The PSC questionnaire comprised six domains with 30 items. The domains were teamwork climate (D1, 6 items), safety climate (D2, 7 items), job satisfaction (D3, 5 items), recognition of depression (D4, 4 items), perception of management (D5, items), and working conditions (D6, items). An author-made MS Excel module was applied to examine the validity of PSC in a unit base. All computations for the unit DC and Cronbach’s α were subjected to a sample size > = 10. The study flowchart is shown in Fig. 2.

Fig. 2
figure 2

Study flowchart

Statistical analysis

SPSS 18.0 for Windows (SPSS Inc., Chicago, IL, USA) and MedCalc 9.5.0.0 for Windows (MedCalc Software, Mariakerke, Belgium) were used to draw box plots and AUC curves. In the author-made MS Excel module, two methods of DC and PA were used to determine the number of factors. In addition, a scree plot was drawn.

Results

Task 1: Cronbach’s α is insufficient to determine a scale’s validity

Figure 3 (left) shows that most values of Cronbach’s α were greater than 0.80 regardless of the degree of dimension interrelation (on x-axis). The long item length increased the value of Cronbach’s α according to the Spearman–Brown prediction formula [38]. The criterion at 0.70 (on y-axis), which represented an acceptable quality of scale, was improbable. Some data with 1/3 proportion of items with a low dimension interrelation (=0.3) yielded a high Cronbach’s α, as shown in the first bin of Fig. 3 (left).

Fig. 3
figure 3

Cronbach’s alpha and DC related to the dimension interrelation

In Fig. 3 (right), 59 (2.95%) cases were located at the left top (false positive) part, 42 (2.1%) were at the right bottom (false negative) part, 1159 (57.95%) were at the right top (true positive) part, and 740 (37%) were found at the left bottom (true negative) part. This pattern shows that a high degree of dimension interrelation indicated strong DC tendency. A cutting point of 0.7 (on y-axis) suggested that DC could exactly separate two domains (i.e., top and bottom) based on our simulation data.

Task 2: DC combined with PA to inspect the number of factors for a scale

Continous variable DCs (y-axis) were combined with a binary variable classified by checking PA 95 CIs and individual random eigenvales along with dimension interrelations (x-axis) (Fig. 4).

Fig. 4
figure 4

Comparison of AUCs of 0.71 (left) and 0.68 (right) by two PA methods

The two methods demonstrated equivalent sensitivity (90%:91%), specificity 45%:43%), and AUC (0.71:0.68) with a common DC criterion at 0.68 (Fig. 4), which was not competitive to the counterpart with known information (i.e., dimension interrelation ≥0.7 was defined one-dimensional) in Fig. 5 (left). Thus, a criterion for DC at 0.70 presented 96% sensitivity, 92% specificity, and AUC = 0.98.

Fig. 5
figure 5

DC (AUC = 0.98) combined with PA (AUC = 0.71) for inspecting the number of factors for a scale with AUC = 0.99

After combining PA and DC with a tranformation rule [i.e., by a MS Excel equation = IF (PA determination = 1, 1, IF (DC ≥ 0.7, 1, 0), where 1 denotes one factor and 0 represents many factors maked with dots in Additional file 3), the specificity and AUC could be improved (up to 100 and 0.99, respectively), but the sensitivity was unchanged. The DC criterion was still located at 0.70 (Fig. 5, right).

A comparison of the results in Additional file 3 revealed that the two PA methods led to misclassification on two situations: (i) short item length (=5 items) and small sample size (=50) yielded false positive (i.e., two factors are classified as one factor); and (ii) long item length (>40 items) and large sample size (>200) were prone to false negative (i.e., one factor is grouped into many factors).

Task 3: MS excel module is an easy way to examine the validity of PSC

All DCs and Cronbach’s α in hospital units could be easily computed within 30 s for a domain (see Additional file 4). Table 1 shows that some DCs and Cronbach’s α were less than 0.7, thereby indicating that data entries should be purified or polished prior to analysis. Such data may be due to cheating, careless behaviors, or other reasons in responses. The global subscales with low construct validity (<0.70, see the first row in Table 1) were teamwork climate (0.58) and working conditions (0.66), indicating that some misfit items existed in the datasets. Whether a differential item functioning (DIF) [39] phenomenon is emerging among hospitals needs further clarification is required to futhrer investigate, The term of DIF shows the extent to which a specific item might be measuring different features for members of separate subgroups, such as members from different types of hospital in this study.

Table 1 Unit-based DC and Cronbach’s alpha computed by the author-made module

Discussion

Principal findings

Our most important finding was that (1) Cronbach’s α was not a sufficient condition of validity; (2) DC is an essential complement to PA for inspecting whether a scale is one single construct; and (3) a module in Microsoft Excel is presented as an easy approach for examining the validity of PSC on a unit base by assessing DC.

What this adds to what was known

Our finding in Task 1 (Cronbach’s α was insufficient to a scale’s validity) was consistent with the literature [40, 41]: Internal consistency is a necessary but not sufficient condition for measuring homogeneity or unidimensionality in a sample of test items. Thus, an extremely low Cronbach’s α refers to a poor validity (i.e., necessary condition). By contrast, a high Cronbach’s α is not always related to a good validity (i.e., sufficient condition; see Fig. 3). Given the limitations of Cronbach’s α, a different but more conservative measure of internal consistency reliability (i.e., the composite reliability) should be applied because it considers the different outer loadings of the indicator variables [42].

Reports [43,44,45] about the acceptable values of Cronbach’s α (from 0.70 to 0.95) are inconsistent. The number of item length, item interrelatedness, and dimensionality affect the value of alpha [46]. We simulated scenarios with item length from 5 to 100 and found that (i) a low alpha could be due to a low number of items (=5), poor interrelatedness (=0.3) between items, or heterogeneous constructs [47] (Fig. 3, left); and (ii) a high alpha (>0.9) implies that some items were redundant as they were testing the same question (=100) but in a different guise [47]. A maximum alpha of 0.90 has been recommended [48].

What it implies and what should be changed?

Our finding in Task 2 (DC combined with PA to inspect the number of factors for a scale) is congruent with a previous study [17] that suggested incorporating Cronbach’s α with DC to jointly assess a scale’s quality. We also responded to the argument [49] that using Cronbach’s α often is related to the PCA approach in practical test construction, especially when factor loadings are not easily obtained in MS Excel.

Referring to the literature [50], composite reliability (CR)=\( \frac{{\left(\sum {\lambda}_i\right)}^2}{{\left(\sum {\lambda}_i\right)}^2+\left(\sum {\varepsilon}_i\right)} \), where λ (lambda) is the standardized factor loading for item i, and ε is the respective error variance for item i. However, it is hard to gain the CR for a scale using MS Excel. We suggest that DC reached an equivalent effect in comparison with CR (≥0.70) as a criterion to measure rigorous internal consistency reliability [42]. DC combined with PA to examine a scale’s unidimensionality significantly improved classification precision (Fig. 5).

Strengths of this study

In Task 3, we applied an author-made MS Excel module to examine the validity of a PSC domain in a unit base. Such a tool is rarely reported in previous papers. We also demonstrated five videos in Additional files for interested readers to easily understand (i) how to extract PA 95% CI eigenvalues from the internet, (ii) how to manipulate scenarios under requirements of the Rasch model and generate a dual dimension interrelation for true scores based on the literature [51] using a rotation of axes with a trigonometric function: true score * cos(RADIANS(angle) + random dataset* sin(RADIANS(angle), (3) how to quickly complete the computation for 100 hospital units at a single time to examine the validity of PSC domains using the DC detection technique, and (4) how to draw ROC curves with MedCal statistical package (see Additional file 5). In addition, the MS Excel module we programmed could enhance the article, through which future researchers can imitate our methodology to simulate their own data and verify their own results.

Limitations and future study

Our study had some limitations. First, only two PA methods of 95% PA random data and individual eigenvalues were used to compare the accuracy. Results showed that PA was merely accurate at a medium item length and sample size (see Additional file 3). Readers are recommended to replace the 95% PA random data eigenvalues in the spreadsheet [eigen] in Additional file 6 with other alternatives for further investigations using the extraction technique shown in Additional file 1 in the future.

Second, we compared the observed eigenvalues with those obtained from uncorrelated normal variables. The result might be different if random eigenvalues were obtained from data of known factorial structure [52]. The finding that PA was merely accurate at a medium item length and sample size from this study could not refute the argument from various studies [21, 53]. PA is the most accurate, showing the least variability and sensitivity to different factors. However, such results should be further verified in the future.

Third, we demonstrated PSC validity in a hospital unit and found that the global subscales with low construct validity (<0.70, first row in Table 1) were teamwork climate (DC = 0.58) and working conditions (DC = 0.66). Those misfit items emerging in data could not be used to make an inference on the issue of data purification or data entry errors with seriously cheating and careless behaviors, ora DIF [39] phenomenon possibly occurring among hospitals, which were not investigatged in the previous published paper [6]. .

Conclusion

PA is not well known among researchers partly because it is not included as an analysis option in most professional statistical packages. This study provides an alternative user-friendly application (a Microsoft Excel tool) that can determine whether a scale is one-dimensional once DC has been applied to examine scale unidimensioality. Such findings support the idea of combining PA and DC to jointly assess a scale’s quality in healthcare settings.