Psychological constructs are unobservable and often indirectly measured by scales with multiple items. For example, the Center for Epidemiologic Studies Depression Scale (CES-D Scale; Radloff , 1977) measures the construct of depression using 20 items that assess depressive symptoms (e.g., how often one had a poor appetite during the past week). Social and behavioral researchers commonly use the sum or mean scores of scale items to compare a psychological construct across groups. Using CES-D as an example, past research has compared gender differences in depressive symptoms among adolescents with the sum scores of the scale (Avison & McAlpine, 1992), and examined depression levels among U.S. adults during the COVID-19 pandemic using the mean scores of the scale (Fitzpatrick et al., 2020).

Group comparisons with scale scores are valid only when the observed items measure the same latent construct equivalently across groups, a condition known as measurement invariance (MI; Mellenbergh , 1989; Meredith , 1993; Millsap , 2011). MI is an important measurement property that has been widely assessed in psychological and behavioral sciences. In a quick search on PsycINFO, 1,664 peer-reviewed articles published in 2019-2021 contained the keyword “measurement invariance” or “measurement equivalence” in the abstract. If MI does not hold, a condition known as measurement noninvariance, differences in scale scores may reflect not only differences in the latent construct of interest but also incomparable measurement across groups, leading to biased estimates of group differences and erroneous inferences. Therefore, MI is a prerequisite for the valid use of scale scores, particularly when evaluating group differences.

Traditionally, popular approaches to MI testing often involve four sequential stages (Widaman & Reise, 1997): configural (equality of model structure), metric (equality of loadings), scalar (equality of loadings and intercepts), and strict (equality of loadings, intercepts, and unique factor variances) invariance.Footnote 1 However, in practice researchers often omit applying the test of strict invariance, because scalar invariance supports group comparisons of the observed means or factor means with continuous items (Meredith & Teresi, 2006; Putnick & Bornstein, 2016; Vandenberg , 2002).Footnote 2 When one or more items are not scalar invariant, group comparisons based on observed means may be biased (Schmitt & Kuljanin, 2008), even though group comparisons of factor means may still be permissible when researchers fit a partial scalar invariance model that correctly adjusts for the noninvariant parameters (Byrne et al., 1989).

As many psychological scale items are not continuous but categorical, researchers have adapted the above multistage procedure to evaluating MI for ordered-categorical items (e.g., Likert-scale questionnaire items; Millsap & Tein, 2004). For example, unlike a continuous measure that can take on an unlimited number of values, a Likert-scale item on “how often one had a poor appetite during the past week” in CES-D often consists of four response categories: rarely, sometimes, occasionally, and most of the time (Radloff, 1977). Under the item factor model (Birnbaum , 1968; Wirth & Edwards, 2007), latent responses to ordered-categorical items are continuous but discretized into observed categories by a set of thresholds. As such, modeling ordered-categorical items requires an additional set of threshold parameters, in addition to loadings, intercepts, and unique variances. The intercepts denote the conditional means of the latent response distributions when the latent factor mean is zero and are usually set to zero to define the scale of the latent responses (Wu & Estabrook, 2016), whereas the thresholds indicate the position on the latent trait where a respondent transitions from a lower to a higher category and are often freely estimated (Bovaird & Koziol, 2012). The MI testing procedure for ordered-categorical items parallels the one used for continuous items but with some differences. In particular, while the equality constraints for the configural and metric models are the same, the scalar model often evaluates equality of loadings and thresholds, and the strict model tests equality of loadings, thresholds, and unique variances, fixing intercepts at zero in all models (Millsap & Tein, 2004).

With a different distribution, however, ordered-categorical items often involve different MI testing practices than continuous items, including estimation methods (Millsap , 2011; B. O. Muthén , 1984); identification conditions (Millsap & Tein, 2004; Wu & Estabrook, 2016); and parameterization (B. O. Muthén , 2002). In addition, dichotomous items (i.e., with two categories) have different properties and, therefore, involve different practices than ordered-polytomous items (i.e., with three or more ordered categories). An example of this difference is that dichotomous items require additional constraints for identification than ordered-polytomous items (Millsap & Tein, 2004; Wu & Estabrook, 2016).

Does scalar invariance support mean comparisons with ordered-categorical items?

While scalar invariance allows factor mean and observed mean comparisons with continuous items, the question remains as to whether or not the same practice generalizes to both dichotomous and ordered-polytomous items. Many methodological guidelines have suggested that scalar invariance supports factor mean comparisons with ordered-categorical items (e.g., Bauer , 2017; Bovaird & Koziol, 2012; Bowen & Masa, 2015; Kite et al. , 2018; Putnick & Bornstein, 2016), and some studies have further advised that scalar invariance allows observed mean comparisons with such items (e.g., Svetina et al. , 2019). A general belief is that “scalar invariance supports cross-group comparisons of manifest (or latent) variable means on the latent variable of interest” (Svetina et al. , 2019, p. 2). As such, strict invariance, the most stringent invariance condition, is often considered “optional” (Pendergast et al. , 2017, p. 71) and is “rarely pursued” (Svetina et al. , 2019, p. 2).

For these reasons, tutorials on MI testing with ordered-categorical items often include only tests of configural, metric, and scalar invariance, but not strict invariance (e.g., Bowen & Masa, 2015; Pendergast et al. , 2017; Svetina et al. , 2019). Moreover, in the popular software Mplus for latent variable modeling, the convenient MODEL option for MI testing supports only up to scalar invariance for both dichotomous and ordered-polytomous items (L. K. Muthén & Muthén, 1998–2017, 2013). Such an option may encourage users to stop invariance testing at the scalar invariance stage for ordered-categorical items; however, researchers can still manually define a strict invariance model in Mplus.

Whereas the common presumption is that scalar invariance supports both factor and observed mean comparisons with ordered-categorical items, opposing arguments have maintained that strict invariance is required for some forms of mean comparisons. Liu et al. (2017) proved that strict invariance is necessary to ensure that differences in the observed means of ordered-categorical items are attributable to only the differences in the latent construct. In other words, valid comparisons of observed means require invariance of loadings, thresholds, intercepts, and unique variances for both dichotomous and ordered-polytomous items. On the other hand, Wu and Estabrook (2016) noted that scalar invariance supports factor mean comparisons specifically for ordered-polytomous items, although they did not discuss the dichotomous case. For dichotomous items, however, little is known in the literature on whether factor mean comparisons are valid in the scalar invariance model.

Given inconsistent guidelines and limited research on the invariance condition required for observed and factor mean comparisons with ordered-categorical items (Pendergast et al., 2017), there is a need to bring clarity to the question of whether strict invariance is a prerequisite for factor mean and observed mean comparisons. Moreover, dichotomous and ordered-polytomous items are often considered together, implicitly or explicitly, in a broader type of “ordered-categorical” items. However, whether the same practices apply to both types of items also remains a question.

The current study

To fill that gap in the literature, the current paper discusses and evaluates the necessary MI condition for valid observed and factor mean comparisons with dichotomous and ordered-polytomous items. As illustrated, unlike the cases for continuous items, strict invariance is necessary when the goal is to compare observed means of both dichotomous and ordered-polytomous items. Moreover, factor mean comparisons are valid in the scalar or partial scalar model with ordered-polytomous items but not dichotomous items; for the latter, the strict or partial strict model is needed for valid factor mean comparison, as demonstrated in the simulation results.

We begin with a brief review of MI testing practices as reported in the literature and present an illustrative example showing that observed mean and factor mean comparisons can provide diverging results. We then define the stages of invariance testing for ordered-categorical items. Next, we perform a Monte Carlo simulation study to systematically evaluate the impact of strict noninvariance on observed and factor mean comparisons. Even when all items are strict invariant, using only a scalar invariance model can introduce bias in the estimation of factor mean differences. Lastly, we provide a tutorial on MI testing with ordered-categorical items, including a demonstration of how to establish partial invariance when needed and how to perform factor mean comparisons when strict invariance fails.

Strict invariance was not commonly tested in the literature

We performed a brief review of MI testing practices with ordered-categorical items in the psychological-related research, with a focus on studies that evaluated MI using multigroup confirmatory factor analysis (MG-CFA) with weighted least squares (WLS).Footnote 3 From a search on the PsychINFO database using the following keywords: (“measurement invarian*” OR “factorial invarian*” OR “differential item function*”) AND (WLS* OR “diagonally weighted” OR DWLS OR Categorical OR Ordinal OR binary OR Likert), we identified 74 peer-reviewed articles published in 2017 and 2018. Fifteen of them were excluded because they (a) were not written in English (\(n\) = 3), (b) did not test MI using empirical data (\(n\) = 10), (c) did not treat scale items as ordered-categorical (\(n\) = 1), or (d) were a corrigendum of a previously published article (\(n\) = 1). Thirty-one of the remaining articles tested MI using MG-CFA, and the rest of them evaluated MI within the item response theory (IRT) framework or used other approaches (i.e., bootstrap, moderated nonlinear factor analysis, or multiple indicator multiple cause modeling).

Among the 31 articles that used MG-CFA, three involved dichotomous items, and 28 included ordered-polytomous items with more than three response categories. These articles used either a variant of the diagonally weighted least square estimation (DWLS; \(n = 24\)) or a variant of the maximum likelihood estimation (ML; \(n = 3\)),Footnote 4 but four of them did not specify the estimation method. Whereas some (41.94%) of the articles evaluated strict invariance, the majority (58.06%) of them tested up to the model of configural (\(n = 1\)), metric (\(n = 2\)), or scalar invariance (\(n = 15\)). Finally, a handful of the articles further compared observed means (\(n = 7\)) or factor means (\(n = 10\)) across groups. Two of these articles compared observed means of the ordered-polytomous items without establishing strict invariance, and one compared factor means of dichotomous items in the scalar invariance model.

This brief review shows that the test of strict invariance was often missed when testing MI for scales with ordered-categorical items. In addition, we found instances of observed mean comparisons with ordered-categorical items without the support of strict invariance and an instance of factor mean comparison with dichotomous items in the scalar invariance model.

Table 1 Observed and factor mean comparison in the illustrative example

An illustrative example

The following example shows how inferences of comparing observed means and factor means can diverge due to noninvariance in unique variances. To illustrate, we simulated data based on an empirical study by Sharman et al. (2019), who developed The Beliefs About Crying Scale (BACS), a psychological scale that measures beliefs about whether crying is a helpful or unhelpful behavior in individual and social contexts. For simplicity, we focus here on the one-dimensional Helpful subscale of BACS and compare means between males and females to examine the role of unique factor invariance in mean comparisons.

The Helpful subscale has seven ordered-polytomous items, each with five response categories (1–5). To create an example for dichotomous items, the response categories below 3 were collapsed into 0, and those at or above 3 were collapsed into 1. As will be illustrated in the tutorial of this paper, the Helpful subscale achieves partial strict invariance with ordered-polytomous items and achieves strict invariance with dichotomous items.

We used the parameter estimates from Sharman et al. (2019) to simulate two toy datasets, one for dichotomous items and another for ordered-polytomous items with five categories. We simulated the datasets to have invariant loadings and thresholds but noninvariant unique variances in the last three items between the two groups. In other words, the simulated datasets achieve scalar invariance but not strict invariance. Each dataset had a sample size of 1000, and the goal was to detect an assumed population mean difference of 0.2. The full R script for the simulation is available in the supplemental materials.

We evaluated the observed mean difference by performing an independent sample t test on the mean scores of the seven items between males and females. Furthermore, we examined the factor mean difference estimate, \(\hat{\alpha }_f\), in two models: (a) the scalar model, which allows the unique variances to freely vary, and (b) the partial strict model, which constrains the unique variances to be equal except for the noninvariant items. To allow for factor mean comparison, we fixed the factor mean of the male group at 0; thus, the factor mean of the female group indicated the difference between the two groups. Note that neither the t test nor the scalar invariance model accounted for the noninvariance of unique variances, whereas the partial strict model did.

As shown in Table 1, the result of the observed mean comparison did not agree with that of the factor mean comparison. For dichotomous items, the independent sample t test failed to detect a difference in observed means between the two groups, t(998) = 1.40, p = 0.16; similarly, the Wald test in the scalar model also failed to detect a factor mean difference between the two groups, z = 0.69, p = 0.49. However, the Wald test in the partial strict model detected a factor mean difference between the two groups, z = 2.19, p < .05. For ordered-polytomous items, whereas the independent sample t test failed to detect an observed mean difference, t(998) = 1.23, p = 0.22, the Wald test in both the scalar, z = 2.28, p <.05, and partial strict models, z = 2.20, p <.05, detected a factor mean difference.

The above example illustrates a case where the conclusions of mean comparisons diverged in different models even when the data was scalar invariant. While in practice the population mean difference is unknown, the question lies in which conclusion is valid if some items have noninvariant unique variances. In the following, we will review MI testing with ordered-categorical items and systematically evaluate the impact of unique factor noninvariance on mean comparisons with a simulation study.

Measurement invariance testing

MI testing typically involves a multistage procedure that sequentially evaluates nested models each of which has additional equality constraints across groups. This procedure was originally developed for continuous items under the multivariate normality assumption within a common-factor model (Horn & McArdle, 1992; Meredith , 1993; Vandenberg , 2002; Widaman & Reise, 1997). Since ordered-categorical items do not fulfill such distributional assumptions, alternative multistage procedures were established within the item factor model framework (Liu et al. , 2017; Millsap & Tein, 2004; Svetina et al. , 2019; Wirth & Edwards, 2007). In this section, we begin by defining the common factor model and the item factor model and then discuss the MI testing procedures for ordered-categorical items.

Common factor model with continuous items

Let \(Y_{ij}\) (\(i\) = 1, 2, \(\ldots \), \(N\); \(j\) = 1, 2, \(\ldots \), \(p\)) be the response of the \(i\)th person on the \(j\)th item in a scale of \(p\) items measuring a latent common factor \(\eta \). A measurement model links \(Y\) and \(\eta \) probabilistically with a set of parameters, expressed as \(P(Y_{ij} | \eta )\). Formally, MI holds when the conditional distribution of the observed items is the same across subgroups, such as gender and ethnicity (Mellenbergh , 1989; Meredith , 1993). That is, for a subgroup membership variable \(G\),

$$\begin{aligned} P(Y_{ij} | \eta _i, G_i = g) = P(Y_{ij} | \eta _i), \forall j, g. \end{aligned}$$
(1)

In other words, responses to scale items depend solely on the common factor but not the group membership. For example, two people with the same beliefs about crying should have the same propensity to respond to the scale items similarly, regardless of their group membership.

For continuous items, the common factor model (Thurstone , 1947) is usually used, represented as

$$\begin{aligned} Y_{ij} = \nu _j + \lambda _j \eta _{i} + \varepsilon _{ij}, \end{aligned}$$
(2)

where \(\nu _j\) is the measurement intercept, \(\lambda _j\) is the factor loading, and \(\varepsilon _{ij}\) is the realized value of the unique factor. It is commonly assumed that \(\varepsilon _j\) is normally distributed with constant variance \(\theta _{j}\), so that \(Y_{ij}\) is also normally distributed conditioned on \(\eta _i\). In addition, the local independence assumption is usually applied such that, when conditioned on \(\eta _i\), \({{\,\mathrm{\textrm{Cov}}\,}}(Y_{ij}, Y_{ij'} | \eta _i) = 0\) for \(j \ne j'\). When there are \(K\) groups, the model is

$$\begin{aligned} Y_{ijk} = \nu _{jk} + \lambda _{jk} \eta _{ik} + \varepsilon _{ijk}, \end{aligned}$$
(3)

where \(k\) = 1, 2, \(\ldots \), \(K\), and \({{\,\mathrm{\textrm{Var}}\,}}(\varepsilon _{ijk})\) = \(\theta _{jk}\).

When a common factor model holds, MI requires that the measurement parameters, for example \(\nu _j\), \(\lambda _j\) and \(\theta _{j}\) for the model in Eq. 2, are the same across groups (e.g., Meredith, 1993). For continuous variables, valid group comparisons do not require all measurement parameters to be equal across groups. Conventionally, researchers have distinguished between four stages of measurement invariance: (a) configural invariance, which requires the configuration of the factor loadings to be the same across groups (Horn & McArdle, 1992; b) metric/weak invariance, which requires equal factor loadings (i.e., \(\lambda _{jk}\) = \(\lambda _j\) for all \(j\)s and \(k\)s) in addition to configural invariance; (c) scalar/strong invariance, which requires equal measurement intercepts (i.e., \(\nu _{jk}\) = \(\nu _j\) for all \(j\)s and \(k\)s) in addition to metric invariance; and (d) strict invariance, or strict factorial invariance, which requires all measurement parameters (\(\nu _j\), \(\lambda _j\), and \(\theta _j\) for all \(j\)s) to be equal across groups.

Item factor model with ordered-categorical items

Let \(y_{ij}\) be the observed categorical response and \(y^*_{ij}\) be the latent continuous response of the \(i\)th person for item \(j\), under the item factor model:

$$\begin{aligned} Y^*_{ij} = \nu _j + \lambda _j \eta _i + \varepsilon _{ij}, \end{aligned}$$
(4)

where \(\eta _i\) is the latent common factor, \(\nu _j\) is the latent intercept, \(\lambda _j\) is the factor loading, and \(\varepsilon _{ij}\) is the realized value of the unique factor. The equation is the same as the factor model for continuous variables. From here, however, it is assumed that \(Y^*_{ij}\) is mapped to \(Y_{ij}\), the observed variable with \(C - 1\) thresholds and \(C\) categories (0, 1, \(\ldots \), \(C - 1\)), by a cumulative link function such that

$$\begin{aligned} Y_{ij} = {\left\{ \begin{array}{ll} 0 &{} \text{ if } Y^*_{ij} \le \tau _{j}^{(1)} \\ c &{} \text{ if } \tau _{j}^{(c)} < Y^*_{ij} \le \tau _{j}^{(c + 1)} \\ C-1 &{} \text{ if } Y^*_{ij} > \tau _{j}^{(C-1)} \end{array}\right. }, \end{aligned}$$
(5)

where \(\tau _{j}^{(1)}\), \(\ldots \), \(\tau _{j}^{(C-1)}\) are the threshold parameters for the \(j\)th item. For example, consider that the latent responses, \(Y^*\), to the item “crying makes me feel better” take a normal distribution. As shown in Fig. 1, the latent responses under \(\tau ^{(1)}\) fall in Category 0 (e.g., “Not true for me at all”), those between \(\tau ^{(1)}\) and \(\tau ^{(2)}\) are in Category 1 (e.g., “Moderately”), and those above \(\tau ^{(2)}\) are in Category 2 (e.g., “Extremely true for me”). The item factor analysis model assumes that an observed response is “Extremely true for me” if the latent response lies above \(\tau ^{(2)}\).

Fig. 1
figure 1

An illustration of biases due to noninvariance in different parameters. All plots show the latent response distributions of an item with three response categories, scoring 0, 1, and 2. Both the reference group (red, solid line) and the focal group (blue, dashed line) share the same factor mean (\(\alpha = 0.2\)) and factor variance (\(\psi = 1\)). Regions below the first threshold, between the first and second thresholds, and above the second threshold indicate the probability of scoring 0, 1, and 2, respectively. E(Y) = observed mean. \(\lambda \) = factor loading. \(\tau ^{(1)}, \tau ^{(2)}\) denote the first and second thresholds. \(\theta \) = unique factor variance. \(\nu \) = intercept. Bias = observed mean difference due to noninvariance in different parameters while the population mean difference is 0

With a probit link,Footnote 5 it is assumed that \(\varepsilon \) follows a normal distribution,

$$\begin{aligned} \varepsilon _{ij} \sim \mathcal {N}(0, \theta _j). \end{aligned}$$
(6)

which implies that \(Y^*_{ij}\), conditioned on \(\eta _i\), is normally distributed:

$$\begin{aligned} y^*_{ij} | \eta _i \sim \mathcal {N}(\nu _j + \lambda _j \eta _i, \theta _j). \end{aligned}$$
(7)

Measurement invariance testing with ordered-categorical items

Millsap and Tein (2004) identified four types of parameters, \(\nu _j\), \(\tau _j\), \(\lambda _j\), and \(\theta _j\), for MI testing with ordered-categorical items. As for continuous items, methodologists (Liu et al., 2017; Millsap, 2011; Millsap & Tein, 2004; Svetina et al., 2019) have proposed similar multistage procedures for ordered-categorical items. These procedures also compare nested models by adding equality constraints of parameters, but they differ in the identification conditions, parameterizations, and order of tests of invariance.

Unlike the procedure used with continuous items, MI testing with ordered-categorical items involves, additionally, \(\tau _j\). A typical option to identify an item factor model is by setting \(\nu _j\) to zero (Liu et al. , 2017; Millsap & Tein, 2004), which is the default of popular statistical programs Mplus and lavaan in R. Alternative identification conditions allow estimations of \(\nu _j\) (e.g., Svetina et al. , 2019). Interested readers are referred to Wu and Estabrook (2016) for a comprehensive discussion on identification conditions of item factor models with constraints on different types of parameters. Furthermore, B. O. Muthén (2002) discussed two parameterizations for defining the scales of ordered-categorical items: delta and theta. To allow the test of strict invariance, Millsap and Tein (2004) recommended theta parameterization, with which unique variances are estimable parameters.

Millsap and Tein (2004) introduced a procedure that evaluates invariance models in the following order: (a) configural invariance, (b) invariance of loadings (metric/weak), (c) invariance of loadings and thresholds (scalar/strong) and (d) invariance of loadings, thresholds, and unique variances (strict). This order of invariance tests is also popular in literature (e.g., Liu et al. , 2017; B. O. Muthén , 2002; Pendergast et al. , 2017), although the test of strict invariance is often considered optional (Bowen & Masa, 2015; Pendergast et al. , 2017; Svetina et al. , 2019). In an alternative order of tests, the test of threshold invariance comes before the test of loading invariance (Svetina et al. , 2019; Wu & Estabrook, 2016). Moreover, for dichotomous items, because the metric model is an equivalent model to the configural model, the invariance of loadings and thresholds are usually tested together (Millsap & Tein, 2004; Wu & Estabrook, 2016), resulting in only three stages: configural, scalar, and strict (B. O. Muthén , 2002; Putnick & Bornstein, 2016).

Observed mean comparison

Just as with continuous items, configural invariance and metric invariance do not support observed mean comparisons with ordered-categorical items. As shown in Fig. 1d, even with the same common factor mean \(\alpha = 0.2\), the differences in loadings, thresholds, and unique variances yield different observed scores of an ordered-categorical item in the two groups. Such differences are not attributable to the group difference in the latent construct, but merely due to measurement artifacts when the ordered-categorical item is measured differently between groups. Similarly, Fig. 1c shows that when the thresholds are unequal, the two groups can have different observed scores. If two persons have a latent response at around 1.25, the person from the reference group (red, solid line) would endorse Category 1, as 1.25 falls below \(\tau ^{(2)}_r = 1.5\), but the person from the focal group (blue, dashed line) would choose Category 2, as 1.25 falls above \(\tau ^{(2)}_f = 1\). Hence, threshold noninvariance results in different probabilities of item endorsement as well as observed scores of an ordered-categorical item.

Although for continuous items scalar invariance supports observed mean comparisons, strict invariance is required for ordered-categorical items. Even when both loadings and thresholds are invariant across groups, the differences in observed responses of ordered-categorical items do not necessarily reflect the differences in the latent responses or the latent common factor (Liu et al. , 2017). As shown in Fig. 1b, due to unequal unique variances, the probability of choosing any of the three response categories differs. This can result in a difference in observed scores, even if the two groups share the same factor mean. When strict invariance holds, differences in the observed means are entirely attributable to the differences in the latent common factor (Liu et al. , 2017). Figure 1a shows that the latent distributions of the two groups align when strict invariance holds. Only in this situation do the probabilities of endorsing each response category overlap between groups, hence accurately reflecting the fact that the two groups share the same standing in the latent construct. The unique variance parameter generally affects the distribution and hence the expected value of the observed responses, except when the distributions are symmetric for all groups. In Appendix A, we present and discuss the mathematical details that support these conclusions.

To summarize, observed mean comparisons with ordered-categorical items require full invariance in loadings, thresholds, intercepts, and unique variances to accurately infer differences in the latent common factor. If any of the items are not strict invariant (i.e., partial strict invariance) distributed, observed means can be different across groups even if they share the same common factor mean. Without full strict invariance, one should consider comparing the factor means.

Factor mean comparison

To allow factor mean comparisons, it is important to first ensure that the identification condition does not involve fixing all factor means to be zero across groups (Wu and Estabrook , 2016). One way to identify the model is by fixing the factor mean of one group (i.e., reference group) to zero and freely estimating the factor mean of the other groups (i.e., focal groups). The estimated factor mean of a focal group reflects the factor mean difference between the focal group and the reference group.

Factor mean comparisons are permissible in the scalar or partial scalar invariance model for ordered-polytomous items, but only in the strict or partial strict invariance model for dichotomous items. For ordered-polytomous items, as scalar invariance equates the scales of the latent responses, group differences in the factor means reflect group differences in the latent common factor (Wu & Estabrook, 2016). If some thresholds are invariant but some are not, factor means can be compared in the partial scalar invariance model that correctly frees the noninvariant thresholds and constrains the invariant thresholds to be equal across groups.

Table 2 Practices for valid mean comparisons

For dichotomous items, however, using the scalar or partial scalar invariance model does not ensure valid factor mean comparisons. When unique variances are allowed to freely vary across groups, the scalar or partial scalar invariance model fails to uniquely identify factor means of the focal groups, even if the model correctly constrains invariant loadings, intercepts, and thresholds to be equal across groups. Contrarily, with additional equality constraints on unique variances, the strict or partial strict invariance model uniquely identifies factor means of the focal groups. Therefore, valid factor mean comparisons require correct equality constraints on the invariant unique variances in addition to loadings, intercepts, and thresholds. Appendix B shows the mathematical support for factor mean comparisons with ordered-polytomous and dichotomous items.

Table 2 summarizes the practices required for valid mean comparisons with ordered-polytomous and dichotomous items. If the goal is to compare observed means, the data must establish strict invariance for both ordered-polytomous and dichotomous items. Whereas factor mean comparisons with ordered-polytomous items are permissible in the scalar or partial scalar invariance model, such comparisons with dichotomous items are valid only in the strict or partial strict invariance model.

Table 3 Parameter values for data generation

Simulation study

We conducted a Monte Carlo simulation study to evaluate the observed and factor mean differences when scale items demonstrate unique factor noninvariance. The goal was to address the following two main questions: (a) How does the lack of strict invariance impact the statistical inference and estimation of the observed mean difference between groups with ordered-categorical items? and (b) Does the scalar model give an accurate estimate and inference of the factor mean difference for both dichotomous and ordered-polytomous items?

In this simulation study, we examined the impact of unique factor noninvariance on items with two, five, or seven response categories, which are common item types in psychological scales. We used three sets of parameter values to generate observed data (a) with negatively skewed distributions, (b) with positively skewed distributions, and (c) based on an empirical example. Table 3 summarizes the parameter values for data generation, as well as the skewness of the observed response distribution and the proportion of endorsing each response category. For (a) and (b), we adapted parameter values in the Sass et al. (2014) to simulate data of ten items. For ease of comparison, we maintained a constant skewness in the observed response distribution across item types. For (c), as a follow-up of the Illustrative Example, we simulated data of seven items based on the parameter estimates from the Helpful subscale of BACS to systematically evaluate the impact of unique factor noninvariance on empirical data. As most BACS items have a negatively skewed distribution, the result patterns for the simulated BACS data are expected to be similar to those for negatively skewed data.

To isolate the effect of unique factor noninvariance, we simulated data that are scalar invariant but noninvariant in unique variances between groups. Specifically, the focal group had a larger unique variance than the reference group. We defined the mean difference as the mean of the focal group minus the mean of the reference group. Based on these definitions and the analytic results discussed above, we expect the following:

  1. 1.

    In the conditions with unique factor noninvariance, observed mean difference will be underestimated for the simulated data with negatively skewed distributions and overestimated for the simulated data with positively skewed distributions.

  2. 2.

    For dichotomous items, as the scalar invariance model is unidentified, the factor mean estimate will be biased in this model, whether or not the data achieve strict invariance. However, using the correctly specified strict (or partial strict) invariance model will give an unbiased estimate of the factor mean difference.

  3. 3.

    For ordered-polytomous items (items with five or seven categories), using either the correctly specified scalar or strict (or partial strict) invariance model will produce unbiased estimates of factor mean differences.

Simulation design factors

We manipulated five design factors: group size, number of noninvariant items (\(p_{ni}\)), degree of noninvariance (\(d_{ni}\)), population factor mean difference (\(\alpha _f\)), and number of response categories (C). Similar to previous simulation studies (Hsiao & Lai, 2018; Yoon & Lai, 2018), we set the group size (\(n_k\)) to 100, 200, and 500. With two groups, therefore, the total sample sizes were 200, 400, and 1000, indicating relatively small, medium, and large sample sizes, respectively.

With reference to the simulation design in Liu and West (2018), we simulated data to have zero, one, and three items that demonstrated unique factor noninvariance in the same direction. The numbers of noninvariant items (\(p_{ni}\)) corresponded to 0, 10, and 30% of the ten items (first and second sets of parameter values) and 0, 14, and 43% of the seven items (third set of parameter values) with larger unique variances in the focal group than in the reference group, reflecting an absence, a small amount, and a large amount of noninvariance, respectively. Similar to the design in Liu and West (2018), in the conditions with noninvariant items, the focal group had \(1.25^2\) or \(1.5^2\) times larger unique varaince(s) than the reference group, indicating a small or a large degree of noninvariance (\(d_{ni}\)).

Following conventional practices with model identification, we fixed the factor mean of the reference group at 0. As such, the factor mean difference between the two groups was equivalent to the factor mean of the focal group (\(\alpha _{f}\)). The population factor mean of the focal group was set at 0, 0.2, and 0.5, similar to the design in Lai et al. (2021). We simulated items that have two, five, and seven response categories (\(C\) = 2, 5, 7).

Data generation

We used the SimDesign package (Chalmers & Adkins, 2020) in R (R Core Team, 2022; version 4.1.3) to structure the simulation. For each design condition, we generated 2500 data sets for analysis.

Assuming a single underlying factor, we simulated the latent common factors (\(\eta _{ijk}\)) from a normal distribution with a variance of one for both groups. The common factor mean was set at 0 for the reference group (\(\alpha _r = 0\)) and varied depending on the design conditions for the focal group (\(\alpha _f\)). The continuous latent responses for each item were generated based on Eq. 4, where both groups shared the same intercepts of 0 (\(\nu _{jr} = \nu _{jf} = 0\)) and the same loadings (\(\lambda _{jr} = \lambda _{jf}\)). The unique factors (\(e_{jk}\)) were simulated from a normal distribution with a mean of zero for both groups. For the reference group, the unique factor variances were 1 (\(\theta _{jr} = 1\)); for the focal group, the unique factor variances (\(\theta _{jf}\)) varied according to the design conditions. Lastly, we used the same set of thresholds (\(\tau ^{(c)}_{jr} = \tau ^{(c)}_{jf}\)) for both groups to convert the latent responses into observed responses with two, five, or seven categories based on Eq. 5.

Data analysis

Per generated data set, we compared the observed means and factor means between groups. For observed mean comparison, we computed the average item score across the seven items per individual, \(\bar{Y}_{ik} = \frac{1}{P}\sum _{j = 1}^P Y_{ijk}\) for \(P\) items, and the observed means across individuals in each group, \(\bar{Y}_k = \frac{1}{n_k}\sum _{i=1}^{n_k} \bar{Y}_{ik}\).Footnote 6 We then performed an independent sample t test in R to test against the null hypothesis that the population observed mean difference was zero at \(\alpha = .05\). For factor mean comparisons, we used lavaan (Rosseel, 2012) to analyze the data with (a) a correctly specified scalar invariance model and (b) either a correctly specified strict or partial strict invariance model, depending on the design conditions. All models were identified with the default identification conditions in lavaan and the theta parameterization to allow for free estimation of unique variances.Footnote 7 In all models, we examined the factor mean estimate and statistical significance of the focal group, which denoted the factor mean difference as the factor mean of the reference group was set at 0.

Fig. 2
figure 2

Type I error rate of the observed mean comparisons. \(n_k\) = group size. C = number of response categories. \(p_{ni}\) = number of unique factor noninvariant items. \(d_{ni}\) = degree of unique factor noninvariance. \(\alpha _{f}\) = population factor mean of the focal group. The shaded area is the acceptable range of type I error rates, 4.13–5.87%, in this study

We summarized the simulations in terms of rejection rate and raw bias of the observed or factor mean differences. Rejection rate denotes the type I error rate or power when the population factor mean difference was zero or nonzero, respectively. The expected standard error of the current simulation was .44%, calculated using \(\sqrt{(1 - \alpha )\alpha / R}\) (Sass et al., 2014) with \(R = 2,500\) and \(\alpha = 5\%\). Therefore, we determined that the acceptable range for type I error rates was 4.13%-5.87%—two standard errors away from the nominal 5% alpha level. As power is a function of sample size, we compared other conditions with noninvariant items against the baseline conditions that had zero noninvariant items to evaluate the impact of unique factor noninvariance. The raw bias was the average deviation of the sample observed or factor mean difference from the population mean difference across the replications.

Simulation results

The result patterns for the simulated BACS data, of which most items have a negatively skewed distribution, were the same as those for the negatively skewed data. The result patterns between negatively skewed and positively skewed data were highly similar, except that the directions of biases in the observed mean difference varied. As expected, although the magnitudes of biases were similar, the observed mean difference was underestimated for negatively skewed data and overestimated for the positively skewed data. For observed mean comparison, the result patterns of type I error rate were the same between both types of data, but power decreases for the negatively skewed data and “increases” for the positively skewed data. The increase in power, however, was due to an overestimated mean difference between the focal and reference group, and should not be considered desirable. For the factor mean comparison, the result patterns were consistent across all three types of data. Since the result patterns were highly similar, we report the simulation results for the negatively skewed data in the following and provide the details for positively skewed data and simulated BACS data in the supplemental materials.

Fig. 3
figure 3

Statistical power of the observed mean comparisons. \(n_k\) = group size. C = number of response categories. \(p_{ni}\) = number of unique factor noninvariant items. \(d_{ni}\) = degree of unique factor noninvariance. \(\alpha _{f}\) = population factor mean of the focal group. The dashed line indicates 80\(\%\) power

Observed mean comparison

Overall, the effect of unique factor noninvariance on observed mean comparisons was similar for all types of items. As shown in Fig. 2, comparing observed means from data without noninvariant items controlled the type I error rate at the 5% level. However, when unique factor noninvariance was present, observed mean comparisons resulted in an inflated type I error rate. The type I error rate increased with more noninvariant items, a larger degree of unique factor noninvariance, and a larger group size. The type I error rate was similar across item types and was as large as 18.16% for dichotomous items, 18.56% for items with five categories, and 19.52% for items with seven categories when more items (\(p_{ni} = 3\)) demonstrated a large degree of unique factor noninvariance (\(d_{ni} = 1.5^2\)).

Fig. 4
figure 4

Type I error rate of the factor mean comparisons. \(n_k\) = group size. C = number of response categories. \(p_{ni}\) = number of unique factor noninvariant items. \(d_{ni}\) = degree of unique factor noninvariance. \(\alpha _{f}\) = population factor mean of the focal group. Scalar = the scalar invariance model. (Partial) Strict = the strict invariance model if all items are invariant or the partial strict invariance model if some items demonstrate unique factor noninvariance. The shaded area is the acceptable range of type I error rates, 4.13–5.87%, in this study

Fig. 5
figure 5

Statistical power of the factor mean comparisons. \(n_k\) = group size. C = number of response categories. \(p_{ni}\) = number of unique factor noninvariant items. \(d_{ni}\) = degree of unique factor noninvariance. \(\alpha _{f}\) = population factor mean of the focal group. Scalar = the scalar invariance model. (Partial) Strict = the strict invariance model if all items are invariant or the partial strict invariance model if some items demonstrate unique factor noninvariance. The dashed line indicates 80\(\%\) power

As shown in Fig. 3, compared to the level in the baseline conditions, power dropped as the degree of noninvariance and the number of noninvariant items increased. From the level in the baseline conditions, power decreased from 73% to 28.32%, from 79.52% to 33.92%, and from 80.60% to 33.80% for items with two, five, and seven categories, when the degree of noninvariance and number of noninvariant items were large and the population mean difference was small (\(p_{ni} = 3\), \(d_{ni} = 1.5^2\), and \(\alpha _f = 0.2\)).

The raw bias of the observed mean difference is summarized in the supplemental materials. The sample mean difference of observed items underestimated the population mean difference when the data contained noninvariant items. The magnitude of the raw bias increased with more noninvariant items and a larger degree of unique factor noninvariance (e.g., the magnitude of the bias went up to 0.07 when three items had a large degree of noninvariance).

Factor mean comparison

Figures 4 and 5 show the results of factor mean comparisons in the scalar and strict/partial strict models. If the simulated data were strict noninvariant, we freely estimated the noninvariant item(s) in the correctly specified partial strict model; otherwise, we evaluated the factor mean difference in the strict model. The model convergence rates were high for items with five response categories (> 99%). For items with seven response categories, the convergence rates were lower (< 80%) when the group size was small \((n_k \!=\! 100)\) but reached high convergence rates (> 99%) when the group size was sufficiently large \((n_k = 200)\). Models failed to converge when there were empty categories in the simulated data, which occurred more often when the group size was small and the number of response categories was large. For dichotomous items, although the convergence rates of the strict/partial strict models were high (> 99%), the convergence rates of the scalar models were low (< 51%) due to model identification issues regardless of sample size conditions.

For dichotomous items, using the strict/partial strict model, factor mean comparisons resulted in type I error rates within the acceptable range. Power was low in conditions with a small group size and a small population factor mean difference (\(n_k = 100\), \(\alpha _{f} = 0.2\)) but increased as the group size and/or the mean difference were larger. By contrast, using the scalar model, the type I error rate was substantially outside the acceptable range and was highest (30.12%) when the group size was small and more items demonstrated a large degree of unique factor noninvariance \((n_k = 100\), \((p_{ni} = 3\), \(d_{ni} = 1.5^2)\). Power in the scalar model was low even when the group size and the population factor mean difference were large \((n_k = 500\), \(\alpha _{f} = 0.5)\). Regardless of whether the simulated data were strict invariant, using the scalar model consistently led to inflated type I error rates and reduced power. On the other hand, for ordered-polytomous items with five or seven categories, using either the scalar or the partial strict invariance model resulted in a type I error rate within the acceptable range and similar power for all conditions.

The supplemental materials include a summary table that shows the raw bias and standard error of the factor mean difference. For dichotomous items, the raw biases and the standard errors were substantially larger in the scalar model than in the strict/partial strict model. Although both the raw biases and the standard errors decreased as the group size increased, they converged to zero more slowly in the scalar model than in the strict/partial strict model. This finding explains the high type I error rate and low power issues in the scalar model for dichotomous items. For ordered-polytomous items, the raw biases were close to zero in both the scalar and strict/partial strict models, and the standard errors were the same between the two models across all conditions.

Summary

The results of our simulation study show that different levels of invariance were required for comparing the observed means or factor means with dichotomous or ordered-polytomous items. For all types of ordered-categorical items, valid observed mean comparisons required full strict invariance. Unique factor noninvariance led to biases and erroneous inferences in the observed mean differences between groups for all types of simulated data. For factor mean comparisons, using both scalar and strict/partial strict models yielded similar results for ordered-polytomous items. However, for dichotomous items, comparing factor means in the scalar model consistently resulted in a higher type I error rate, lower power, higher bias, and higher standard error than the strict/partial strict model across conditions.

Tutorial on measurement invariance testing for ordered-categorical items

In the following tutorial, we aim to demonstrate the MI testing procedure with ordered-categorical items and illustrate mean comparisons when a subset of the items fails the invariance assumptions (i.e., partial invariance). Although the previous literature has discussed the procedure for testing configural, metric, and scalar invariance with ordered-categorical items (e.g., Bowen & Masa, 2015; Svetina et al. , 2019), we extend the demonstration to the test of strict invariance and the search for partial invariance when a few items exhibit threshold or unique factor noninvariance.

The tutorial follows the identification conditions proposed by Millsap and Tein (2004) and Liu et al. (2017). While there are alternative procedures for MI testing with ordered-categorical items, such as Wu and Estabrook (2016) and Svetina et al. (2019), regardless of identification conditions, the central idea remains that researchers should ensure strict invariance before comparing the observed means with ordered-categorical items and adjust for strict noninvariance to make valid factor mean comparisons with dichotomous items. We provide the lavaan syntax for MI testing with ordered-polytomous items in the following. The supplemental materials include the complete R script of this tutorial with both ordered-polytomous and dichotomous items.

We used the same example as in previous sections: the seven-item Helpful subscale of BACS developed by Sharman et al. (2019). The data were collected from a sample of 210 college students aged between 17 and 48 (\(71.4\%\) female; \(M_{age} = 20.18\), \(SD = 4.79\)). The reliability of the subscale is high with Cronbach’s \(\alpha =\) 0.91. Our goal was to examine whether there is a gender difference in the helpful beliefs about crying. For replicability, we use the following syntax to import the data provided by Sharman et al. (2019) and select only relevant variables, including the grouping variable and the seven items in the Helpful subscale:

figure a

We begin this tutorial by testing the unidimensionality assumption, which is a prerequisite for the use of the observed mean of a psychological scale (McNeish & Wolf, 2020; Widaman & Revelle, 2022) and a one-factor model.Footnote 8 Unidimensionality denotes that a single dimension underlies a set of items and can be evaluated with statistical methods such as the scree plot (Cattell, 1966), parallel analysis (Horn , 1965; Humphreys & Montanelli, 1975; Velicer , 1976), and the Hull method (Lorenzo-Seva et al., 2011). We briefly illustrate the test for the unidimensionality assumption with parallel analysis and refer interested readers to Bandalos (2018) for a comprehensive discussion of other methods. To perform parallel analysis on the Helpful subscale, we utilize the fa.parallel() function in the psych package (Revelle, 2022).

figure b

The first three eigenvalues from the parallel analysis are 4.24, 0.16, and 0.09, where the first eigenvalue was substantially larger than the subsequent eigenvalues. The result supports the undimensionality assumption that there is one factor underlying the seven items of the Helpful subscale.

In this tutorial, we follow the MI testing procedure discussed in Liu et al. (2017) and sequentially evaluate configural, metric, scalar, and strict invariance. The configural model is identified by fixing the common factor variance to 1 for the reference group and freely estimate all loadings Wu and Estabrook (2016).Footnote 9 For ordered-polytomous items, the configural model has additional identification constraints as follows (Liu et al. , 2017, p. 494):

  1. 1.

    Fix the latent intercepts \(\nu _j\) to 0 across groups.

  2. 2.

    For each of \(m\) common factors, select an observed item as the marker variable, and fix the loading of this marker variable to equality across all groups.

  3. 3.

    In one group (i.e., the reference group), fix the common factor mean \(\alpha _k\) to 0 and the unique factor variances \(\theta _k\) to 1. For the remainder of the groups, freely estimate the unique factor variances.

  4. 4.

    Fix one threshold for each item across groups. For the marker variable, additionally fix a second threshold.

We start with identifying the marker variable, which should have an invariant loading between groups, at least two invariant thresholds, and a meaningful metric or a high factor loading (Liu et al. , 2017). We fit a single-group one-factor model to the data and identify BACS_38 as a candidate item, which has the highest factor loading. With BACS_38 as the marker variable, we continue the MI testing procedure and examine if this item has invariant loadings and/or thresholds. If invariance fails in this item, we return to the beginning and select another candidate item as the marker variable. This process is repeated until a marker variable has been identified.

To identify the set of thresholds to constrain, we initially fix the first threshold of all items to be equal between groups and then examine whether the selected thresholds are invariant in the metric model. If invariance holds for these thresholds, we will proceed to the next stage of invariance testing; otherwise, we return to the beginning and repeat the process with another set of thresholds (Liu et al., 2017).

In all invariance models, we use the cfa() function to perform MG-CFA along with specifying the grouping variable in group = "Gender". To account for the ordered-categorical nature of the data, we specify the items as ordered and the estimation method as "WLSMV" with "theta" parameterization. Depending on the specific model, mod refers to the corresponding lavaan model syntax, which is available in the supplemental materials. The syntax of the configural model is as follows:

figure c

This configural model has an acceptable fit, \(\chi ^2(28) = 56.56\), \(p = .001\), RMSEA = 0.10, 95% CI [0.06, 0.14], CFI = 0.99, SRMR = 0.04.

We then move on to assess metric invariance, which has the same identification constraints as the configural model, except that it includes additional equality constraints on the loadings across groups. This metric model has an acceptable fit, \(\chi ^2(34) = 61.25\), \(p = .003\), RMSEA = 0.09, 95% CI [0.05, 0.12], CFI = 0.99, SRMR = 0.04. The modification indices (see syntax below) suggest the loadings and thresholds of item BACS_38 are invariant, as well as the first threshold of all items.

figure d

Thus, we confirm the initial identification constraints for the configural model, i.e., fixing the first threshold of all items equal between groups and using BACS_38 as the marker variable. The chi-square difference test is statistically nonsignificant (syntax provided below), scaled \(\Delta \chi ^2(6) = 9.13\), \(p = .166\), suggesting insufficient evidence that the loadings are noninvariant.Footnote 10

figure e

Next, we move on to the scalar model which further constrains all thresholds to be equal between groups in addition to the constraints in the metric model. The scalar model has an acceptable fit, \(\chi ^2(54) = 101.91\), p < .001, RMSEA = 0.09, 95%CI [0.06, 0.12], CFI = 0.99, SRMR = 0.04, but is significantly different from the metric model, scaled \(\Delta \chi ^2(20) = 43.41, p = .002\), indicating that some thresholds are noninvariant.

Since full threshold invariance failed, the unconstrained thresholds for all items must be tested sequentially to identify the noninvariant threshold(s). This sequential specification search has been found to perform well in controlling false positive rates (Yoon & Kim, 2014). As the modification index suggests that the first threshold of BACS_30 is noninvariant, we free this threshold and use the resulting model as the partial scalar model. The partial scalar model has an acceptable fit \((\chi ^2(53) = 91.11)\), \((p = .001)\), \(\text {RMSEA} = 0.08, 95\%~\text {CI} {[}0.05, 0.11{]}, \text {CFI} = 0.99, \text {SRMR} = 0.04\), and does not fit worse than the metric model, scaled \(\Delta \chi ^2(19) = 32.15\), \(p = .030\). Thus, we proceed to the partial strict invariance model, which constrains the unique factor variances to be equal in the items that have invariant thresholds in the partial scalar model.

The fit of the partial strict model is similar to that of the partial scalar model, scaled \(\Delta \chi ^2(6) = 15.76\), \(p = .015\). Therefore, the final model is a partial strict invariant model with the first threshold and unique factor variance of the item BACS_30 freed. In other words, items BACS_38, BACS_31, BACS_29, BACS_1, BACS_26, BACS_4 are strict invariant, whereas item BACS_30 is metric invariant. This final model has acceptable fit, \(\chi ^2(59) = 106.03\), \(p <.001\), \(\text {RMSEA} = 0.09, 95\%~\text {CI} {[}0.06, 0.11{]}, \text {CFI} = 0.99, \text {SRMR} = 0.05\).

As only partial strict invariance holds, we recommend not comparing observed means of this subscale between groups. Instead, we can compare the factor means in the partial scalar or partial strict model and use the following command, for example, to examine the factor mean difference in the partial strict model:

figure f

The factor mean difference is statistically significant in both the partial scalar model, \(-.63, 95\% \text { CI }[-0.99, -0.29]\), and the partial strict model, \(-.64, 95\% \text { CI }[-0.99, -0.28]\). For ordered-polytomous items, factor mean comparisons are valid in both the scalar/partial scalar and strict/partial strict models. For dichotomous items, however, we recommend researchers compare factor means in only the strict/partial strict model, as suggested in the simulation results.

One thing to note is we used the sequential approach of testing proposed by Yoon and Millsap (2007), which does not guarantee to yield the true model when a large number of items violate the MI assumption (Yoon & Kim, 2014). Since we did not find evidence of noninvariance for all items except item BACS_30, we believe the results given by this sequential approach are valid. Further details on the comparisons of sequential approach versus nonsequential approach can be found in Yoon and Kim (2014) and Pohl et al. (2021).

Discussion

The literature lacks consensus about the necessary condition for valid mean comparisons with ordered-categorical items (Pendergast et al., 2017). On one hand, generalized from the literature for continuous items, some authors assumed that strict invariance is optional for ordered-categorical items when comparing factor means (e.g., Bauer , 2017; Bovaird & Koziol, 2012; Putnick & Bornstein, 2016), as well as when comparing observed means across groups (e.g., Svetina et al. , 2019). Therefore, strict invariance has rarely been tested in published research (Svetina et al., 2019), as observed in the brief review of the present paper on MI testing with ordered-categorical items. On the other hand, Liu et al. (2017) argued that strict invariance is needed for valid observed mean comparisons with ordered-categorical items. Given the inconsistent recommendations in the literature, the aim of the present paper was to revisit the question: Is strict invariance a prerequisite for valid group comparisons of observed means and factor means with dichotomous and ordered-polytomous items?

For observed mean comparisons, the present study echoes the argument of Liu et al. (2017) that valid group comparisons require ordered-categorical items to achieve full strict invariance: invariance of loadings, thresholds, intercepts, and unique factor variances. In the simulation study, we found that the observed mean difference had increased bias and inflated type I error rate as the number of unique factor noninvariant items and the degree of noninvariance increased. We note that the impact of unique factor noninvariance could be worse than what is shown in the simulation study. For example, if the items were simulated with a stronger skewness of -2, the type I error rate would increase to more than 40% and power would decrease by more than 60 percentage points for all item types. Furthermore, as a function of group size, the type I error rate could also reach more than 50% for all item types when we increased the group size to 2000. We report details of additional analyses in the supplemental materials.

Relatedly, the distributions of the observed responses also impact the magnitude of the bias. As shown in Appendix A, the impact of unique factor noninvariance reduces with a less skewed (i.e., more symmetric) distribution. Stated differently, ordered-categorical items with a more symmetric distribution are less influenced by unique factor noninvariance in observed mean difference and behave more similarly as continuous items. This is in line with previous studies, which showed that continuous methodology can outperform categorical methodology for ordered-categorical items with a symmetric distribution (e.g., Rhemtulla et al. , 2012; Sass et al. , 2014).

The simulation study showed that for dichotomous items, factor mean comparisons are valid only in the correctly specified strict/partial strict invariance model; for ordered-polytomous items, such comparisons are valid in both the correctly specified scalar/partial scalar and strict/partial strict invariance models. Consistent with the past literature (e.g., Wu & Estabrook, 2016), scalar invariance (i.e., invariance of loadings and thresholds) effectively equates the scale of the latent responses with the latent common factor for ordered-polytomous items. As such, the factor mean difference in the scalar model accurately reflects the group difference in the latent common factor. By contrast, dichotomous items contain fewer response categories and reduced information than ordered-polytomous items, resulting in unidentified parameters when the unique variances freely vary between groups. As confirmed in the simulation study, the factor mean difference in the scalar model consistently had inflated type I error rates, lower power, higher biases, and higher standard errors than in the strict/partial strict model. Such biases are present in the scalar model even when the data are strict invariant.

In summary, for dichotomous items, we strongly advise testing strict invariance prior to any form of mean comparisons. If full strict invariance holds, one can compare the observed means or compare the factor means in the strict model, but not in the scalar model. If strict invariance fails, one should establish a partial strict invariance model to compare the factor means with dichotomous items. For ordered-polytomous items, if the goal is to compare observed means, we suggest the support of full strict invariance. Otherwise, factor mean comparison is valid in either the correctly specified scalar or strict model.

Limitations and future research directions

The current paper fills the gap in the literature about the necessary invariance conditions for factor and observed mean comparisons with ordered-categorical items. However, we have not examined the impact of unique factor noninvariance on more complex analyses, such as regression and longitudinal analysis, with ordered-categorical items. It is relatively straightforward to generalize mean comparison to a regression model with the group membership as the only predictor and the latent common factor of interest as the outcome variable measured by a set of ordered-categorical items, as these are equivalent analyses. However, when the regression model includes an additional latent predictor that is measured by another set of ordered-categorical items, it is unclear whether unique factor noninvariance would bias the regression slopes. While past research has investigated the impact of loading and intercept noninvariance on regression slopes (Chen , 2008; Hsiao & Lai, 2018), similar future research should similarly examine the effect of unique factor noninvariance on regression slopes.

Furthermore, the current paper focused on the invariance of unique factor variances; however, whether the results generalize to the invariance of unique factor covariances requires further investigation. Liu et al. (2017) mathematically showed that observed mean comparisons require invariance of unique factor variances as well as covariances. Future research could evaluate the impact of which noninvariance in unique covariances on statistical inferences and parameter estimations of observed and factor mean comparisons.

Open Practices Statement

Simulation codes and data are openly available on the project’s GitHub page (https://github.com/winniewytse/micat-supp).