In the social sciences, comparing means of two or more groups is one of the most frequently encountered research tasks. A special case is when one of those means is a group mean and the other the total mean. For instance, one could ask whether the mean performance of companies from a specific sector differs from the mean performance of all companies, or whether students in the United States differ from all students worldwide with respect to their mean educational outcome. For such group to total mean comparisons, the classical t-test (e.g., Kalpic, Hlupic, & Lovric, 2011) and the z-test (e.g., Salkind, 2010) are not appropriate because the groups are not independent from each other (see, for example, OECD, 2005, p. 132).

To construct appropriate group to total mean difference tests, (1) the group to total mean difference \( {\overline{y}}_{group}-\overline{y} \) and (2) the standard error for this difference need to be obtained. Whereas the first calculation is rather trivial, the computation of standard errors requires more effort. A straightforward way is to capitalize on linear regression methods (Searle, 1971). If the contrasts in a linear regression analysis are not explicitly specified, the reference coding which is used by default in most software packages provokes that each group is compared to the reference group (\( {\overline{y}}_{group}-{\overline{y}}_{ref} \)). Changing the contrasts according to weighted effect coding (WEC; te Grotenhuis et al., 2017) in the regression model yields regression parameters which correspond to \( {\overline{y}}_{group}-\overline{y} \), that is, the group to total mean difference. WEC simply requires redefining the contrasts in a linear regression analysis. The intercept then represents the total mean, and the regression coefficients represent deviations of the group means from the total mean. See Appendix 1 for an illustration of the differences between various coding schemes with minimal example data.

The procedure described above yields analytical standard errors when results from linear regression theory are applied. One disadvantage of WEC is that most software packages do not have implemented this procedure or use different coding schemes per default. However, we think that using WEC regression comes with several desired features: WEC regression analysis can be adequately adapted when design and/or data characteristics are more complicated. Consider, for example, that the sampling scheme involves unequally weighted cases, because individuals included in the sample are not equally representative of the whole population. In this case, we propose using a coding scheme that we call “weighted effect coding for weighted samples” (WECW; see Appendix 1 for details). WECW simply adjusts the WEC contrasts according to the individual sampling weights. Moreover, by employing the regression approach we can draw and rely on well-studied and established methods and extensions that come into play when design and/or data characteristics are more complicated, for example when clustered or multi-stage sampling is applied, or imputed variables are part of the analyses. For these scenarios, various well-studied extensions building on linear regression exist and are also promising for the estimation of group to total mean differences with WEC or WECW.

Instead of using analytical methods, standard errors for the group to total mean difference can also be estimated by employing bootstrap methods (Davison & Hinkley, 1997; Efron & Tibshirani, 1986). As pointed out by Efron and Tibshirani (1986), bootstrap methods are an alternative when the analytical computation of the standard errors becomes increasingly complicated or in certain situations such as small number of clusters (e.g., Cameron, Gelbach, & Miller, 2008).

Scope and objectives

This article aims to reach practitioners who are confronted with group to total mean comparisons. We assembled and combined trusted statistical techniques to conduct such comparisons for frequently encountered scenarios into one easily accessible software solution. In the present article, we (1) discuss frequently encountered scenarios in social science research, surveys and large-scale assessments, (2) describe how various types of effect coding and statistical routines can be used to target research questions that imply a comparison between a group mean with the total mean within these scenarios, (3) present the R (R Core Team, 2019) package eatRep (Weirich, Hecht, & Becker, 2020), which facilitates such comparisons, and (4) give two empirical examples for illustration.

In the following, we present some typical scenarios which come with specific characteristics researchers are confronted within experimental studies and survey analyses and describe how WEC can be adjusted and extended. For the sake of clarity, analytical and implementation details are given in the appendices. All supported scenarios are summarized in Table 1 along with practical guidance on how to use the function repMean() from the eatRep package. Annotated R code with runnable examples is provided in the Supplementary Material.

Table 1 Frequently encountered scenarios for group to total mean comparisons, two exemplary combinations, and arguments for function repMean()

Commonly encountered scenarios

Scenario 1: Random samples

To compare group means with the total mean in random samples, we suggest employing linear regression with weighted effect coding, a coding scheme which defines the contrasts in a way that the regression coefficients represent deviations of group means from the total mean (see, e.g., Sweeney & Ulveling, 1972). In contrast to effect coding (EC), WEC takes into account that the groups may be of unequal size in the population. The WEC-intercept represents the total mean, whereas the EC-intercept represents the “synthetic” total mean (i.e., the mean of the equally weighted group means). Both approaches yield identical results if the groups are of equal size in the population, which is, however, rarely the case. For an illustration of the differences between EC and WEC, see Table 5 in Appendix 1. The contrasts for group 1 are –1 for EC and –n2/n1 for WEC, where n1 is the number of observations in group 1, and n2 is the number of observations in group 2. As illustrated in Appendix 1, applying linear regression with WEC yields regression estimates which represent point estimates for the differences of the group means from the total mean. Moreover, the corresponding standard errors of parameter estimates represent standard errors for these differences. As shown in Table 1, for scenario 1, the argument crossDiffSE of the repMean() function needs to be set to “wec” (default).

In the following section, we describe how WEC can be applied to designs and situations that are typically encountered in the context of surveys and large-scale assessments. Samples from such studies differ from common random samples in several respects (for a more detailed description of survey samples, see, e.g., Rutkowski, Gonzalez, Joncas, & Von Davier, 2010).

Scenario 2: Weighted samples

Often, sampling designs include over- and/or under-sampled groups. One common reason for this is that groups that are only marginally represented in the population should be represented more strongly in the sample to ensure sufficient power in between-group comparisons (Schofield, 2006). Hence, group-level weights are necessary to ensure that the estimates represent population parameters if the proportion of the groups in the sample does not represent the proportion of the groups in the population. Moreover, individual weights are needed to adjust for nonresponse (Rust, 2014).

To apply weighted effect coding for weighted samples (WECW), the contrasts are defined in a different manner. More specifically, the contrasts now additionally must take into account that the relative group sizes in the sample differ from the relative group sizes in the population. Picking up the example given in Appendix 1, the contrast for group 2 is now calculated as \( -\left(\sum \limits_{i={n}_1+1}^{n_1+{n}_2}{w}_i/\sum \limits_{i=1}^{n_1}{w}_i\right) \), where wi are the individual weights, and n1 and n2 are the number of examinees in the corresponding group (see Appendix Table 5). Hence, the number of observations in the corresponding group is replaced by the sum of weights for all individuals in the corresponding group. To use WECW instead of WEC, simply supply the name of the weighting variable to the wgt argument of the repMean()function.

Scenario 3: Clustered samples

If the sampling design is hierarchical, the primary sampling unit is often some kind of higher-level entity, for example, school classes instead of individuals. It is well known that analyzing clustered samples with methods that are based on the assumption of random sampling yields biased standard errors (Lumley, 2004; Wolter, 1985). Alternatively, so-called sandwich estimators (Freedman, 2006; Skinner & Wakefiel, 2017) can provide consistent standard errors even if there is heteroscedasticity or clustered sampling (Rogers, 1993). However, in large-scale assessments, using complex designs which employ clusters with unequal selection probabilities (probability-proportional-to-size (PPS) selection; see Rust, 2014), sandwich estimators are seldom used (Gonzalez, 2014). One possible reason might be that sandwich estimators can yield biased results if the response variable is dichotomous or when cluster sizes are small (Rabe-Hesketh & Skrondal, 2006). Moreover, Efron and Tibshirani (1986) noted that analytical computation of standard errors becomes increasingly complicated for complex sampling designs. Hence, a common approach is to use resampling techniques such as, for example, the bootstrap (Davison & Hinkley, 1997), jackknife (Rust, 2014; Rust & Rao, 1996; Wolter, 1985), or balanced repeated replicates (BRR; Rao & Wu, 1985). Resampling methods like the bootstrap might be superior to analytical methods such as, for example, the sandwich estimator (e.g., Harden, 2011), particularly when the number of clusters is small (e.g., Cameron et al., 2008; Feng, McLerran, & Grizzle, 1996; Sherman & le Cessie, 1997). Resampling techniques are implemented in various software programs (Westat, 2000) as well as in R packages such as survey (Lumley, 2019) or BIFIEsurvey (Robitzsch & Oberwimmer, 2019). Which of these methods is appropriate depends on the specific sampling procedure used in the study. For example, when the aim is to re-analyze the PISA 2015 data, the sampling procedure used by PISA 2015 should be taken into account. PISA 2015 used a balanced repeated replication (BRR) variance estimator which is adjusted for sparse population subgroups by Fay’s method (Judkins, 1990; OECD, 2017, p. 123). However, when re-analyzing data of TIMSS 2007 (Mullis et al., 2008), the jackknife estimator should be used as described in the TIMSS 2007 technical report (Foy, Galia, & Li, 2008). The R package eatRep includes both methods to yield standard errors for group mean comparisons. For technical details, see Appendix 2. In repMean(), the replication method can be specified by the type argument. Valid options are “JK1”, “JK2”, “BRR”, and “Fay”. Depending on the method, some additional arguments need to be specified which are described in the help files of repMean() in detail.

Furthermore, a common type of clustered data are repeated measurements (i.e., longitudinal data). Here, the groups (level 2 units) are the persons, and the level 1 units are the observations which are nested within persons. As longitudinal data are just a special case of two-level clustering, the presented methods are suitable for longitudinal data as well.

Scenario 4: Imputed variables

When missing values occur in surveys or large-scale assessments, multiple imputation is a common method to provide complete data for secondary analyses. Also, a special case in which imputed values occur are latent variable models (for an introduction to latent variable modeling, see, e.g., Beaujean, 2014), where individual values on the latent constructs (for example, mathematical or reading literacy) must be inferred from observed indicators, for instance from items of a competence test or from additional background information from a questionnaire. For missing values as well as for latent constructs, imputation techniques (Little & Rubin, 1987; Rubin, 1987; van Buuren, 2007) such as plausible values (PVs) imputation (Mislevy, Beaton, Kaplan, & Sheehan, 1992; von Davier, Gonzalez, & Mislevy, 2009) are often applied. It is not uncommon to replace each single missing value with multiple imputed values, a procedure that results in multiple (imputed) data sets. The analysis of this kind of data requires applying specific routines for pooling the results (Rubin, 1987). These pooling routines are also applicable to linear regression with WEC. Technical details are given in Appendix 3. When using multiple imputed data in eatRep, the data needs to be in the long format with a variable indicating the number of the imputation. The name of this variable needs to be passed to the imp argument of the repMean() function.

Scenario 5: Heterogeneous group variances

Linear regression with weighted effect coding relies on certain distributional assumptions, one of which is homoscedastic residuals. Especially in survey analyses, this assumption is frequently violated, which can also lead to biased standard errors (White, 1980). To compute standard errors which are robust with respect to heteroscedasticity, various methods have been proposed (Bell & McCaffrey, 2002; MacKinnon & White, 1985; Smyth, 2002; Zeileis, 2004). We adopt these methods for comparisons of group means with the total mean to receive unbiased standard errors. Within the R package eatRep, the function lm_robust() from the estimatr package (Blair, Cooper, Coppock, Humphreys, & Sonnet, 2020) is called, which provides a variety of heteroscedasticity-robust variance estimators. In repMean(), the argument hetero defines whether group variances should be considered as heterogeneous or homogeneous. For heterogeneous variances, just set argument hetero to TRUE (default). With the additional argument se_type the method to handle heterogeneous variances can be chosen with valid options being “HC3” (default), “HC0”, “HC1”, “HC2” (which are exactly the same as the labels in the lm_robust() function from the estimatr package).

Scenario 6: Stochastic group sizes

Mayer and Thoemmes (2019) emphasize the distinction between fixed and stochastic group sizes. Group sizes are fixed when the researcher determines the number of persons in each group in advance of the sampling. This might be the case, for instance, in experiments where the experimenter determines how many persons are assigned to each experimental group or in surveys/large-scale assessments where the number of sampled units is determined by the sampling design. However, when population group sizes are unknown, these need to be estimated from the group sizes in the sample. As group sizes vary over samples, they are “stochastic” or “random”, and estimation is accompanied by uncertainty. This uncertainty should be taken into account to avoid flawed inferences (e.g., Mayer & Thoemmes, 2019). For the estimation of group to total mean differences, we adapted and implemented a multigroup structural equation model with stochastic group sizes as proposed by Mayer, Dietzfelbinger, Rosseel, and Steyer (2016) using the R package lavaan (Rosseel, 2012). Thus, the uncertainty associated with stochastic group sizes enters into the standard errors of the mean differences.

The R package eatRep

When group to total mean differences are to be estimated, eatRep employs linear regression. If no weights are specified (scenario 1), contrasts are defined according to WEC. If weights are specified (scenario 2), contrasts are defined according to WECW (see Appendix 1 for details). In clustered samples (scenario 3), eatRep uses lm() in combination with the withReplicates() function from the survey package (Lumley, 2019) to provide cluster-robust standard errors using replication techniques (see Appendix 2). When imputed variables are part of the analysis (scenario 4), the results of the regression analysis are pooled using the pool() function from the mice package (van Buuren & Groothuis-Oudshoorn, 2011; see also Appendix 3). Heterogeneous group variances (scenario 5) are taken into account by calling the lm_robust() function (instead of the lm() function) from the estimatr package (Blair et al., 2020). If group sizes should be considered stochastic (scenario 6), a multigroup SEM approach (Mayer & Thoemmes, 2019) is called instead of the lm() function, using the R package lavaan (Rosseel, 2012). These methods can also be combined. For example, the multigroup SEM approach (scenario 6) can be used with or without imputed data, WECW can be used with or without clustered data, and so onFootnote 1.

Empirical examples

The abovementioned scenarios are prototypical. In practice, however, researchers are often confronted with combinations of such scenarios—for example, missing values in weighted clustered samples. The R package eatRep (Weirich et al., 2020) offers easy-to-use functionality to compute group to total mean differences for the five presented prototypical scenarios and further combinations. In the following, two empirical examples with annotated R code (see Supplementary Material) are provided to illustrate how these comparisons can be conducted with data from survey and large-scale assessment studies.

Empirical example 1: MIDUS 1

In this example, we investigate whether the mean tobacco usage in several industry sectors differs from the mean tobacco usage in the population. To this end, we use data from the “Midlife in the United States (MIDUS 1), 1995-1996” project (Brim et al., 2019). This example can be seen as a combination of scenarios 2, 5, and 6 because we have a weighted sample and heterogeneous group variances. The group sizes need to be treated as stochastic because the population sizes of the industry sectors are estimated by the sector sizes in the sample. From the total sample of 7108 participants, we chose current smokers from the main sample who completed the phone interview and the self-administered questionnaire with non-missing values on the variables “cigarettes per day” (A1PA44) and “current industry” (A1PINMJ)Footnote 2. Moreover, industries with sample sizes below 30 were discarded. This yielded a sample size for our analysis of 451 participants. As weights, we used the values from the provided weighting variable (A1WGHT2). The analysis was conducted with the repMean() function from the R package eatRep using linear regression with weighted effect coding for weighted samples (WECW) to test for differences between the group means and the total mean.

Results are presented in Table 2. The estimated total mean was 27.61 cigarettes per day. The group means of the seven industries ranged from 24.30 (“Professional and related services”) to 34.25 (“Construction”). The results indicate that in the industries “Construction” (M = 34.25) and “Transportation, communications, and public utility” (M = 31.58), significantly more cigarettes are smoked each day than in the total population (p = .003 and p = .011, respectively). In “Professional and related services”, the average tobacco use is significantly lower than in the population (M = 24.30, p = .018). The group means of the other industries do not significantly differ from the total mean. Annotated R code and an example data set which was generated based on these results is provided in the Supplementary Material.Footnote 3

Table 2 Mean number of cigarettes per day by industry (MIDUS 1 data)

Empirical example 2: PISA 2015

We used data from the 2015 PISA study (OECD, 2016) to compare the OECD countries’ performances in science. The purpose was to test which country’s mean performance differs from the OECD averageFootnote 4. This example can be seen as a combination of Scenarios 2, 3, 4, and 5 as we have a weighted, clustered sample with imputed values (PVs) and heterogeneous group variances. The group sizes, however, are considered to be fixed. The total sample consisted of N = 248,620 students in 35 countries. As the dependent variable, we used 10 plausible values (variables PV1SCIE to PV10SCIE from the publicly available PISA 2015 data set)Footnote 5. We employed the senate weight variable (SENWT) “to sum up to the target sample size of 5000 within each country” (OECD, 2017, p. 292). Again, the analysis was conducted with the repMean() function from the R package eatRep.

Results are summarized by Table 3. In line with the results reported in the PISA 2015 report (OECD, 2016, p. 67), means of seven OECD countries (United States, Austria, France, Sweden, Czech Republic, Spain, and Latvia) did not significantly differ from the 2015 OECD average of 493, whereas 18 countries range above the OECD average and 10 below. Annotated R code to reproduce these results using the freely available data from the OECD homepage is provided in the Supplementary Material.

Table 3 Mean science performance by country (PISA 2015 data)

Discussion

Research questions aiming at group to total mean comparisons are frequently encountered in the social sciences. To address such comparison problems analytically, different methods can be used. A straightforward method is linear regression with weighted effect coding. To facilitate and promote usage of this approach, we developed the R package eatRep, in which routines for various situations that are typical in survey and large-scale assessment studies (e.g., heterogeneous variances, weighted samples, clustered samples, and multiple imputations) are implemented. To illustrate the usage of eatRep, we conducted two empirical example analyses in which we compared mean tobacco consumption in certain industries to the total mean (MIDUS 1 data) and the mean science competence of students in the OECD countries to the total OECD mean (PISA 2015 data).

Several issues and limitations need to be taken into consideration: (1) Weighted effect coding (WEC) presupposes that there is only one single grouping variable. With more than one variable (e.g., country and gender), the crossed groups (e.g., Japanese girls) need to be technically mapped onto one grouping variable. (2) In eatRep, the functionality to handle stochastic group sizes is incorporated. To this end, we used the Mayer et al. (2016) approach, which takes the additional uncertainty due to the stochasticity of the group sizes into account. Mayer and Thoemmes (2019) note that alternative model-based approaches are also feasible, for example, a multinomial model which could be estimated using the KNOWNCLASS option in Mplus (Muthén & Muthén, 1998–2017). (3) To date, the multigroup SEM implemented in eatRep does not account for clustered data. Hence, for complex samples, the standard errors of the group to total mean differences are determined using resampling approaches. To appropriately account for stochastic group sizes in clustered and/or complex data, we believe that an appropriate resampling procedure needs to be chosen. For example, we assume that resampling approaches are needed in which the group sizes vary over replicates (e.g., classic bootstrap with case-wise resampling). However, as this is a topic for future research, to date, eatRep treats group sizes as fixed when resampling methods are applied. (4) The implementation of weighted effect coding for clustered samples and/or for imputed data is based on replication methods. In contrast to alternative methods for cluster-robust standard errors like sandwich estimators, replication methods like BRR or jackknife are also suitable for nonlinear statistics (Krewski & Rao, 1981; Rao & Wu, 1985) and therefore more flexible. They come, however, with substantially more computational effort. Following the PISA example, 80 replication analyses are conducted according to 80 replicate weights, and afterwards, the whole procedure is replicated 10 times according to 10 plausible values. Overall, 80 × 10 = 800 replications are necessary which is computationally very demanding. In the future, the currently implemented routines in eatRep might possibly be improved, for instance, by employing computationally more efficient C++ routines or computational optimizations, for example, suitable time-saving shortcuts for replication methods (e.g., Magnussen, McRoberts, & Tomppo, 2010; Westfall, 2011). (5) Most data in the context of large-scale assessments provided by institutions such as the OECD is presented in the wide format; that is, each line in the data set represents one discrete person. Imputed variables, if present, occur in different columns. However, the package eatRep requires that data are in long format. Thus, as illustrated in the supplementary R code, the user needs to reshape the data manually. Amongst others, the R packages reshape2 (Wickham, 2007) or tidyr (Wickham & Henry, 2020) provide convenient and efficient functionality for this task. (6) Although we used trusted statistical approaches and routines, the complexity encountered in survey and large-scale assessment studies calls for further validation of the proposed methods. In future research, simulation studies should examine their performance and estimation quality.

In conclusion, we have compiled trusted methods into a versatile software solution that can be used to solve the common problem of comparing group means to the total mean, and we hope that this will help researchers conducting such mean comparisons in the future.

Open practice statement

We did not preregister our presented work because we do not test substantive hypotheses. The data which we have used are already publicity available (see MIDUS 1 and PISA 2015 citations). We provide annotated R code to reproduce the reported analyses as supplementary material.