Introduction

Worldwide, children spend a large part of their early childhoods in some form of out-of-home child care (UNICEF 2008). A large body of research has shown that high-quality child care has advantages for child development (e.g., Belsky et al. 2007). Child care quality is operationalized in many ways, but generally refers to the broad range of environmental features and interactions in non-parental care and education settings that have been linked to children’s development (Zaslow et al. 2011). Internationally, there is substantial agreement about what is considered essential in providing for children’s developmental needs in center-based child care (e.g., Lamb and Ahnert 2006). Core elements that are universally recognized as required for children’s positive development are: safe and healthy care settings, developmentally appropriate stimulation and opportunities for learning, positive interactions with adults, and the promotion of individual emotional growth and positive relationships with other children (Cryer et al. 2002). Despite this consensus regarding the components that contribute to child care quality, little is known about levels of child care quality across different countries. By covering a wide range of regulatory systems, a cross-country comparison of quality may provide more insight into possible determinants of high-quality care. The current meta-analysis focuses on the most widely used instruments to measure quality: the Early Childhood Environment Rating Scale (ECERS; Harms et al. 1980), the Infant/Toddler Environment Rating Scale (ITERS; Harms et al. 1990), and its revisions (ECERS-R; Harms et al. 1998 and ITERS-R; Harms et al. 2003). Whilst these measures have been used extensively to compare levels of center-based child care quality, or to assess change over time within countries, this study is the first systematic examination of quality assessments around the globe.

The Environment Rating Scales (ERS) were developed to evaluate process quality in early child care settings. Process quality refers to the experiences of children within the child care environment including their interactions with others, materials, and activities (Phillipsen et al. 1997). Process quality is assessed primarily through observation and has been found to be more predictive of child outcomes than structural indicators of quality such as staff-to-child ratio, group size, cost of care, and type of care (Whitebook et al. 1989).

The ERS have a long history of use worldwide; the first use of the original ECERS in the USA dates back more than 30 years ago (Harms and Clifford 1983). Outside the USA, the first studies using the ERS were published almost 20 years ago (e.g., Kärrby and Giota 1994). The long-term international use of these scales, which in itself is suggestive of cross-cultural validity or at least feasibility, provides an excellent opportunity for an international comparison of child care quality. Despite the extensive use of the ERS, however, psychometric research is restricted. Gordon et al. (2013, 2015) recently tested the scales’ criterion validity in a US sample and found little evidence with respect to child developmental outcomes and moderate evidence of validity with alternative observational measures of quality. In the present paper, we add to this work by examining associations between ERS process quality and observed proximal aspects of child care quality (caregiver sensitivity) in a wide range of international studies. We also consider possible differences in ERS associations arising from scale characteristics (infant vs early childhood version, original vs revised scale, full vs shortened version). A further goal is to examine associations between ERS and structural features (group size, caregiver–child ratio). In the following sections, these goals are further explained.

Process Quality and Structural Features

Government regulation and other forms of quality assurance within countries impact a range of structural features that have been shown to affect process measures of quality (NICHD ECCRN 2002). There are major differences across countries in policy focus, regulations, and subsidies with respect to child care, which are likely to have an effect on levels of ERS quality. The diversity is evident in the UNICEF Innocenti Research Center (2008) Report Card 8 that compared 25 OECD countries on ten benchmarks, representing minimum standards for protecting the rights of children (e.g., regulation and accreditation of early childhood services, training of early childhood staff, minimum staff-to-child ratios). This comparison, for example, showed that in 2004 not even half of the countries met the minimum staff-to-children ratio of 1:15 for preschool education.

Two structural features, group size and child–caregiver ratio, are easily quantifiable and are operationalized in the same way across studies, thereby allowing comparisons across countries. In within-country studies, the range of these structural indicators may be limited because of government legislations, and therefore across-country studies are expected to include a wider range of group size and child–caregiver ratios. Previous cross-country studies that encompassed a wider range of quality than typically seen in one country (Love et al. 2003) have demonstrated that this approach can uncover stronger associations than is possible in within-country studies. Drawing on earlier work (Phillipsen et al. 1997), our hypothesis is that process quality will be negatively associated with group size and child–caregiver ratio; that is, it will be higher when group sizes are smaller and/or when fewer children are cared for by one caregiver.

Process Quality and Caregiver Sensitivity

The ERS are usually considered instruments for measuring process quality, targeting children’s interactions with caregivers and peers and their participation in different activities (Vandell and Wolfe 2000). In the ERS, features of the physical environment (personal care, space, furniture, and physical safety) have been emphasized, leading to criticism because of a perceived underrepresentation of caregiver–child interactions (e.g., Cassidy et al. 2005a). Because of this, researchers have often used additional instruments beyond the ERS for capturing caregiver–child interactions, such as the Arnett Caregiver Interaction Scale (CIS; Arnett 1989) or, more recently, the Classroom Assessment Scoring System Pre-K (CLASS pre-K; Pianta et al. 2008). The inclusion of the CIS measure in this meta-analysis not only provides a cross-cultural comparison of caregiver sensitivity in group settings around the world, but also contributes to the evidence for the validity of the ERS as a measure of process quality. We hypothesize that ERS scores will be positively associated with caregiver sensitivity as assessed by alternative measures.

Process Quality and Scale Characteristics

Because different variants of the ERS have been widely used in studies with both comparative and longitudinal designs, we investigate whether different ERS yield different results. For instance, in longitudinal studies where children make the transition from infant to toddler to preschool groups, different types of ERS must be used; or, after the publication of revised editions, researchers need to decide whether they will stick to the original version or change to the new version of the scale. In this study, we distinguish between type of scale (ITERS vs ECERS), original versus revised scales (ITERS vs ITERS-R; ECERS vs ECERS-R), and full versus shortened versions of the ERS.

Although the revised versions of the ERS share the same rationale and underlying constructs as the original scales, they slightly differ from the original versions. Items have been added (e.g., cultural diversity, inclusion of children with disabilities), the scoring of the items is somewhat different than in the original scales, and more information with respect to scoring has become available (i.e., via a Web site and through handbooks). Although a previous study by Sakai et al. (2003) showed that the ECERS scores and ECERS-R scores were comparable for the same sample, researchers in Germany (Tietze et al. 2001) have reported half a scale point difference in quality estimates in favor of the original ECERS. Because these findings diverge and are based on a limited number of studies, it is currently unclear whether observed differences in quality, as measured with the original and the revised versions, represent measurement differences or real differences. The current meta-analytic study examines possible differences in mean scores of the ERS in its original versus revised version. In addition, we test whether quality ratings have decreased or increased across time.

Shortened versions of the ERS are used for different reasons. First, many researchers do not include the scores of the subscale Parents and Staff (or “Adult needs” dependent on the type of scale) when calculating a mean score, because the items from this subscale do not reflect the children’s everyday experiences and thus do not cover process quality. Second, in some countries, concerns have been raised about the applicability of particular items of the scales, and one response of researchers is to delete inappropriate items. In Sweden for instance, some items are excluded because they do not reflect the Swedish preschool care practice (Kärrby and Giota 1994). Third, researchers may decide to limit the number of items to reduce the time-consuming training and administration of the full ERS. Previous analyses of US studies (e.g., Cassidy et al. 2005b; Perlman et al. 2004; Scarr et al. 1994) have shown that a shorter version of the ECERS-R is a good proxy for scores on the full scale. The current study examines whether shortened versions of the ERS yield the same results as the full scale in a larger international sample of studies.

Research Aims

In summary, this study serves three goals. First, we provide an international perspective on child care quality by reviewing a wide range of studies in which the ERS were used. Geographic region was included as a moderator in the analyses because it was assumed that there would be some cultural similarities by regional area. Second, we examine whether assessments with the ERS are influenced by scale characteristics. Third, we provide insight into the associations between ERS process quality, caregiver sensitivity, and structural features of child care (group size, caregiver–child ratio). Research questions are: (1) Does process quality of child care differ across geographic regions? (2) Does process quality of child care depend on ERS characteristics (type of scale, original versus revised versions, full versus shortened versions)? and (3) How are structural components of child care quality and caregiver sensitivity related to ERS ratings of process quality?

Method

Selection Procedure

We systematically searched the electronic databases ERIC, Current Contents, PsychInfo, and PubMed using single and combined search terms as follows: Early Childhood Environment Rating Scale*, Infant/Toddler Environment Rating Scale*, ECERS*, ITERS*, child care, day care, center/centre care, and preschool. Please note that the ECERS Extension (ECERS-E; Sylva et al. 2003) that was developed to supplement the ECERS-R in terms of curricular aspects of quality was not included in the meta-analysis. Second, the references of the collected papers were searched for relevant studies. Studies were included if the following criteria were met: Studies (1) were carried out within child care centers for children up to 5 years, (2) provided descriptive statistics for the ERS, and (3) achieved pre-established levels of inter-rater reliability for these measures.

A further criterion for inclusion in this meta-analysis was the representativeness of the centers in each study. Two steps were taken. Studies that selected only high-quality centers (e.g., Perlman et al. 2004) and studies that targeted specific populations, such as in Head Start settings (e.g., Ontai et al. 2002), were not included. Of the remaining studies, we examined to what extent the authors acknowledged that the centers were an adequate representation of that particular country, and based on the information provided, we distinguished between three levels of representativeness (high, moderate, and low). We considered studies to be highly representative if stratified random sampling was used at the country or state (US) level; moderately representative if stratified random sampling was used but only for part of the country or state (US), for example, metropolitan areas; and low either if no random sampling was reported or if random sampling was reported but only for part of the country or state (US) and without considering strata in the selection. If authors did not explicitly mention the term “stratified,” but described random sampling maximizing diversity with respect to relevant features of a particular country, such as geography, program settings, and family SES, we considered this as stratified random sampling.

The number of groups or classrooms observed for each study was noted, but we did not establish a minimum number of centers within a study as a criterion for inclusion.

Further, if a study reported on the results of a quality improvement program or an intervention, only pretest scores for that sample were used. If more than one publication was found for the same study or dataset, only the results from the most recent publication were used in the meta-analysis, unless an earlier publication provided more relevant information than the latest publication. If a publication reported on both the ITERS(-R) and ECERS(-R) in different subsamples, these subsamples were considered as separate studies. Similarly, if a publication reported quality scores for more than one country, each country sample was considered as a separate study.

We finished the search in summer 2012. This procedure yielded 72 publications, published from 1989 to 2012, covering a total of 7737 child care groups or classrooms.

Quality Measures

Environment Rating Scales

The ITERS and its revision the ITERS-R have been developed for groups with children under 2.5 years of age, whereas the ECERS and its revision the ECERS-R are intended for groups with children between 2.5 and 5 years of age. Each of these scales comprises seven subscales, that is: (a) Space and Furnishings (e.g., indoor space, room arrangement for play, child-related display), (b) Personal Care Routines (e.g., greeting/departing, nap/rest, health practices), (c) Language–Reasoning (e.g., books/pictures, informal use of language), (d) Activities (e.g., fine motor, dramatic play), (e) Interaction (e.g., supervision of children, staff–child interactions, interactions between children), (f) Program Structure (e.g., free play, group time), and (g) Parents and Staff (e.g., provisions for parents, staff interaction). The items (ranging from 35 for the original ITERS to 43 for the ECERS-R) are presented on a seven-point scale with detailed criteria for 1 (inadequate), 3 (minimal), 5 (good), and 7 (excellent). For each item, a score is given from 1 to 7, generating an average score for process quality across all items. Each item equally contributes to the average process quality score. Scoring is based on observation (minimal 2 h) as well as caregiver responses to questions about aspects of the program that are not directly observable.

Arnett Caregiver Interaction Scale

The Arnett Caregiver Interaction Scale (CIS; Arnett 1989) consists of 26 descriptions of caregiver behavior that are scored on a four-point scale, with scores ranging from 1 (not at all) to 4 (very much). Ratings are based on how often (after a few hours of observation) a caregiver was observed to perform the behavior described in the item. Arnett (1989) originally distinguished four subscales: (a) Positive Interaction, (b) Punitiveness, (c) Permissiveness, and (d) Detachment; however, subsequent analysis by Whitebook et al. (2004) found a three-factor solution that included the factors Sensitivity, Harshness, and Detachment. This was replicated by other researchers (e.g., Tietze et al. 1996). A more recent validation study (Colwell et al. 2013) showed that the Arnett scale measures one substantive dimension (sensitive caregiver interaction) rather than four subscales. For the purposes of this meta-analysis, therefore, we used the reported score for caregiver sensitivity (although sometimes labeled differently), which is defined as caregiver behavior that is warm, attentive, and engaged.

Data Extraction

Sample characteristics comprised publication year, the geographical area from which the sample originated (country, geographic region), sample size (number of groups or classrooms observed), and the assigned level of representativeness (high, moderate, low). The coded outcomes were the mean scores on process quality as measured with the ERS, mean scores on Arnett caregiver sensitivity, mean reported group size, child–caregiver ratio, and reported Pearson’s correlations (rs) between process quality and caregiver sensitivity, group size, and child–caregiver ratio. Group size was defined as the mean number of children present during the observation. Because studies varied in how ratio was reported, that is, as child–caregiver ratios and caregiver–child ratios, we extracted a figure for mean child–caregiver ratio (number of children present divided by number of caregivers present; e.g., 4:1). If only caregiver–child ratio was reported (i.e., 1: 4), we calculated its inverse (child–caregiver ratio = 1/caregiver–child ratio). Studies that only provided correlations between caregiver–child ratio and process quality could not be included in the meta-analysis, because it is not possible to convert these correlations to extract child–caregiver ratio. In order to examine the reliability of the ERS, the reported internal consistencies (Cronbach’s alpha) of the scales were also coded.

Scale moderators included type of scale (ITERS[-R] vs ECERS[-R]), original versus revised versions, and full versus shortened versions. As for the latter, we coded whether subscales or individual items were excluded from the scales before analysis. If so, the scale was coded as shortened version.

To assess inter-coder reliability, 21 publications (29 %) were coded by two coders. Agreement between the coders for outcome variables and moderators was satisfactory (mean kappa for categorical variables .77; percentage agreement between 76 and 100 %; mean intra-class correlations for continuous variables .99).

Meta-analytic Procedures

The meta-analysis was performed using the Comprehensive Meta-Analysis Program (CMA; Borenstein et al. 2009). Tests for significance and moderator analyses were performed through random-effects models (Borenstein et al. 2007). A random-effects model allows for the possibility that there are random differences between studies that are associated with variations in procedures, measures, settings, that go beyond subject-level sampling error and thus point to different study populations. Q-statistics (Borenstein et al. 2009) were computed to test the homogeneity of the overall set and specific sets of effect sizes. Contrasts were only tested if a subset consisted of at least four studies (k ≥ 4) (Bakermans-Kranenburg et al. 2003).

To address possible publication bias, we (1) used the “trim and fill” method (Duvall and Tweedie 2002a, b) to calculate the effect of potential publication bias on the outcomes of the meta-analysis and (2) computed the fail–safe N according to the method proposed by Orwin (1983), referring to the number of studies necessary to bring the effect size down to trivial levels (e.g., r < .10). Using the “trim and fill” method, a funnel plot is constructed of each study’s effect size against the sample size or the standard error (usually plotted as 1/SE or precision). These plots should be shaped like a tunnel if no publication bias is present. However, since smaller studies and statistically nonsignificant studies are less likely to be published, studies in the bottom left hand corner of the plot are often omitted (Duvall and Tweedie 2002a, b). With the “trim and fill” procedure, the k rightmost studies considered to be symmetrically unmatched are trimmed and their missing counterparts are imputed or “filled” as mirror images of the trimmed outcomes. This leads to an adjusted estimate of the combined effect size taking into account potential publication bias.

Results

These results provide a summary on the features of the studies identified in the systematic search of the literature. The analyses then report findings on the research questions which explore how process quality of child care, measured by mean ERS scores, varies for moderator variables of geographic region, sampling characteristics, and scale characteristics. Then, we examine how structural components of child care quality (group size, child–caregiver ratio) and associations between structural components and process quality vary for moderator variables of geographic region and scale characteristics. Finally, we examine how caregiver sensitivity and associations between caregiver sensitivity and process quality vary for moderator variables of geographic region and scale characteristics.

We found 72 studies (56 publications) with a total of 7737 child care groups (infant groups, toddler groups, preschool groups, mixed age groups) in 23 countries covering five geographic regions (Asia, Australia, Europe, North America, South America). Table 1 provides an overview of study characteristics and moderators. About one-third of the studies (k = 25) were conducted in the USA. Reported Cronbach’s alpha (k = 34) of the full ERS ranged from .66 to .97 with a mean alpha of .90.

Table 1 Study characteristics (country, authors, year of publication, sample size, additional quality measures) and moderators (geographic region, representativeness, type of scale, full scale)

Mean ERS score for the combined set of studies (k = 72, N = 7737) was 3.96 (CI 3.79–4.12), which is just below the midpoint of the seven-point scale (see Table 2). Mean scores ranged from 2.4 to 5.98. Duval and Tweedie’s (2002a, b) trim and fill approach revealed no asymmetry in the funnel plots (see Fig. 1); the absence of unmatched studies on the left side suggests that publication bias is unlikely. A cumulative meta-analysis confirmed the absence of an association between year of publication and mean ERS score: A trend toward higher or lower scores across time was not present.

Table 2 Results of moderator analyses: number of studies and classrooms and combined mean quality scores including 95 % confidence intervals (CI)
Fig. 1
figure 1

Funnel plots showing each study’s effect size against the standard error (plotted as precision). Mean ERS scores are plotted on the X-axis. Each dot represents a study

At the country level, average child care quality was lowest (mean scores < 3) in Bangladesh (Aboud 2006), the Netherlands Antilles (Meerdink and Schonenburg 2010), and South Korea (Sheridan et al. 2009) and highest (mean scores > 5) in Australia (Fenech et al. 2010; Skouteris et al. 2007).

How Does Process Quality of Child Care Vary by Geographic Region?

We conducted a moderator analysis contrasting mean ERS scores across five geographic regions, which gave statistically significant results, Q(4) = 24.75, p < .001 (see Table 2).

Pairwise post hoc contrasts indicated that child care quality in Australia and New Zealand was significantly higher than in all other geographic regions (see Fig. 1). Child care quality in North America (including the Netherlands Antilles) was significantly higher than in Europe, South America, and Asia (see Fig. 2). Other contrasts across geographic regions were not statistically significant.

Fig. 2
figure 2

Mean process quality as observed with the ERS in five geographic regions. Mean ERS scores are plotted on the Y-axis and classified as low (mean score < 3), moderate (3 ≤ mean score < 5), and high (mean score ≥ 5). *p < .05; **p < .01

In addition to mean scores, we examined whether variances in ERS scores differed across geographic regions. For this purpose, an ANOVA (using SPSS 19) was applied to the data, with mean SDs on the ERS as dependent variables and geographic region as the independent variable. Results showed a main effect for geographic region (F[4, 67] = 8.02; p < .001). Post hoc tests (Bonferroni) demonstrated that SDs were significantly lower in the European and Asian samples than in the North American samples (d = 1.35 and d = .97 respectively). This implies that, in general, there is more variety in ERS scores within North American countries than within Asian and European countries. Because variances in scores could be confounded with scale characteristics, we added type of scale, original versus revised scale, and full versus shortened scale, as covariates in three subsequent analyses. After including these covariates, the effects remained unchanged.

How Does Process Quality of Child Care Vary by Sampling Characteristics?

As for representativeness, 26 studies were designated as highly representative, 26 studies were moderately representative, and 20 studies were low representative of a particular country. We conducted a moderator analysis contrasting mean ERS scores across the three levels, which was not statistically significant (see Table 2), indicating that level of study representativeness for a particular country did not affect quality levels. We conducted a further test for differences in ERS scores across geographic regions by including only the 52 studies that were highly or moderately representative. Results were significant, Q(4) = 22.88, p < .001. Pairwise post hoc contrasts confirmed our findings from the whole dataset: Child care quality in Australia and New Zealand was significantly higher than in all other geographic regions. Child care quality in North America was significantly higher than in South America and Asia. The differences in quality between North America and Europe disappeared, although a trend toward higher scores in North America was still visible.

How Does Process Quality of Child Care Vary by Scale Characteristics?

Type of Scale

Twenty-one studies reported on the ITERS or its revision (ITERS-R); 48 studies used the ECERS or ECERS-R, and three studies reported on combined outcomes for ITERS and ECERS (these studies were excluded from the moderator analysis). A moderator analysis contrasting studies reporting on the scale for infants and toddlers (ITERS[-R]) versus studies reporting on the scale for older children (ECERS[-R]) did not yield a statistically significant result.

Original Versus Revised Scale

The original ITERS or ECERS was used in 40 studies, whereas the revised version was used in 32 studies. A moderator analysis contrasting scores on the original scales versus the revised scales did not yield a statistically significant result: Scores on process quality did not depend on the version of the scale.

Full Scale Versus Shortened Scale

In 25 studies, a shortened version of the ERS was used, most often (74 %) excluding the subscale Parents and Staff or Adult Needs. A moderator analysis contrasting scores on the full scale versus shortened scale was not statistically significant.

Additionally, we examined whether interactions between the moderators geographic region and scale characteristics yielded different scores on process quality. We tested whether geographic region in interaction with respectively type of scale, original scale versus revision, and full scale versus shortened scale yielded different results for mean ERS ratings. We found only one interaction effect: Within Europe, a moderator analysis contrasting studies reporting on the scale for infants and toddlers (ITERS[-R]) versus studies reporting on the scale that was developed for older children (ECERS[-R]), yielded a statistically significant result, Q(1) = 7.70, p < .01, with higher scores on the early childhood version compared with the infant–toddler version.

How do Structural Features and Associations Between Structural Features and Process Quality Vary by Geographic Region and Scale Characteristics?

Group Size

Mean group sizes and standard deviations were reported in 21 studies, from which 11 were European and 10 were from North America. Mean group size for the combined set of studies (k = 21, N = 2467) was around 15 (M = 15.19; CI 13.25–17.13), with a range from 9.1 to 30.0. A moderator analysis showed that group size in Europe was not statistically significantly different from group size in North America. Moderator analyses on group size including other geographic regions were not possible, because less than four studies (k < 4) were involved.

A moderator analysis with scale characteristics showed that group size differed dependent on type of scale. As was expected, group sizes were smaller in groups in which the ITERS(-R) was used than in groups in which the ECERS(-R) was used, Q(1) = 15.44, p < .001. No differences in group sizes were found when comparing groups in which the original versus revised scales were used and groups in which the full scale versus shortened scales were used.

Pearson’s correlation coefficients (rs) between mean ERS scores and group size were reported in 17 studies (see Table 3) and ranged from −.40 to .29. Mean r for the combined set of studies (k = 17, N = 1710) was −.03 (CI −.11–.06; p = .53), demonstrating no overall statistically significant association between process quality and group size. Moderator analyses showed no effect of continent or scale characteristics.

Table 3 Pearson’s correlations between ERS scores, group size, ratio, and caregiver sensitivity: number of studies and classrooms, and combined r’s including 95 % confidence intervals (CI)

Child–Caregiver Ratio

In 21 studies (11 in Europe and 10 in North America), mean child–caregiver ratio and standard deviations were reported. Mean child–caregiver ratio for the combined set of studies (k = 21, N = 2638) was 8.60 (CI 7.21–9.51), with a range from 3.1 to 25. Across all studies, on average eight to nine children were cared for by one caregiver. A moderator analysis showed that child–caregiver ratio in Europe did not statistically significantly differ from child–caregiver ratio in North America. Moderator analyses on ratio including other geographic regions were not possible, because less than four studies (k < 4) were involved.

Further analyses showed that ratio differed dependent on type of scale. Expectedly, ratios in groups in which the ITERS(-R) were used were lower than in groups in which the ECERS(-R) was used, Q(1) = 33.74, p < .001. No differences were found for original scale versus revision and full scale versus shortened scale.

Pearson’s correlation coefficients (rs) between mean ERS scores and child–caregiver ratio were reported by 10 studies (6 in Europe, 3 in North America, 1 in South America) and ranged from −.33 to .22 (see Table 3). Mean r for the combined set of studies (k = 10, N = 963) was −.17 (CI −.27 to −.07; p = .001), indicating that process quality was significantly associated with child–caregiver ratio: Process quality was higher when fewer children were under the care of a caregiver. Moderator analyses either were not possible because of too small k (geographic region, type of scale) or did not yield any statistically significant results (original vs revised scale, full vs shortened scale).

How do Caregiver Sensitivity and Associations Between Caregiver Sensitivity and Process Quality Vary by Geographic Region and Scale Characteristics?

Caregiver Sensitivity

In 19 studies (8 in Europe and 11 in North America), caregiver sensitivity, as measured with the Arnett scale, was reported. Mean caregiver sensitivity for the combined set of studies (k = 19, N = 2212) was 3.06 (CI 2.98–3.15). A moderator analysis showed a significant difference between caregiver sensitivity in Europe and North America, Q(1) = 5.85, p = .02. On average, caregivers in North America received higher ratings for sensitive behavior than caregivers in Europe. Moderator analyses on caregiver sensitivity including other geographic regions were not possible, because less than four studies (k < 4) were involved.

Pearson’s correlation coefficients (rs) between mean ERS scores and caregiver sensitivity were reported in 13 studies (see Table 3). Mean r for the combined set of studies (k = 13, N = 1127) was .62 (CI .56–.67; p < .01), demonstrating a statistically significant positive association between scores on the ERS and caregiver sensitivity. After correcting for attenuation based on the reliabilities of both scales, the correlation was .73, showing strong convergence. Effects were not moderated by scale characteristics. Moderator effects for geographic region and type of scale could not be tested because k < 4. As a further test, Duval and Tweedie’s (2002a, b) trim and fill approach revealed no asymmetry in the funnel plots; the absence of unmatched studies on the left side suggests that publication bias is unlikely. Orwin’s fail–safe method showed that, using a “trivial” correlation of .10, it would need another 86 unpublished studies with null effects for the association between mean ERS scores and caregiver sensitivity to bring the effect size under r = .10.

Discussion

Across five geographic regions and 23 countries, average center child care quality as measured by the ERS was moderate: nearly 4 on a seven-point rating scale. The lowest quality levels (mean scores < 3) were reported in Bangladesh (Aboud 2006), the Netherlands Antilles (Meerdink and Schonenburg 2010), and South Korea (Sheridan et al. 2009). Good quality care (mean scores > 5) was reported in Australia (Fenech et al. 2010; Skouteris et al. 2007). Our results, based on over 7700 observations, showed that, on average, levels of child care quality were higher in Australia than in all other geographic regions and that child care quality in North America was higher than in Europe, South America, and Asia. These findings were largely confirmed in a further analysis excluding studies that were not representative of child care centers for a particular country.

Differences in Process Quality Across Geographic Regions

The results of this meta-analysis provide the first opportunity for researchers and policymakers to take a global view of center-based quality for children under 5 years of age. Whilst the results for ERS scores are compelling, accounting for the observed differences in quality between geographic regions and the variations in quality within geographic regions and countries, requires consideration of multiple factors that could not be addressed within our analyses. Because the design of the current meta-analysis does not allow a cause-effect analysis, we can only speculate on possible influences on quality.

There are major differences across the countries included in this meta-analysis in policy focus, regulations, and subsidies with respect to child care. As noted in the UNICEF (UNICEF Innocenti Research Center 2008) benchmarks for early childhood education and care, government regulations and accreditation systems are key factors influencing quality. These vary from one country to another, as well as between states or provinces, and even municipalities or organizations, within a country. We note that 14 of the countries included in the current meta-analysis were also reported on in the UNICEF (2008) comparison of 25 OECD countries. Surprisingly, the countries achieving the highest scores for child care quality in our meta-analysis did not achieve a high ranking in the UNICEF list: The USA met three out of ten benchmarks; Australia achieved only two benchmarks; and Canada achieved one benchmark. In contrast, European countries met more benchmarks, but Europe had a lower average ERS score in our meta-analysis. It seems unlikely, therefore, that the higher scores on the ERS that we found in the USA and Australia can be explained in full by broad government policy initiatives (at least those identified by UNICEF).

Four additional issues are of importance when seeking to explain the observed differences in quality across countries. First, the ERS place importance on features of the physical environment (individual care, hygiene, space, furniture, equipment, and physical safety) as indicators of basic quality of care. This may be of greater importance in the USA and Australia/New Zealand than in other countries. Alternately, it may be more difficult to meet these standards in less prosperous countries or in countries in which national legislation, regulation, or policies do not stress the importance of these indicators. Consequently, the hierarchically ordering of items within the ERS, which require meeting basal aspects of hygiene and physical environment before other higher-order indicators can be assessed, may have resulted in lower scores. For example, ratings for some items will not exceed a score of 1 (inadequate) because requirements for specific hygienic procedures or furnishings have not been met, even though higher scale indicators that might be assessed positively are observed.

Second, child care centers in the USA and Australia/New Zealand may be expected to provide environments for children’s educational outcomes to a greater extent than in other countries. Lower scores on some items or subscales (e.g., Activities) may be obtained because the specified amount or variety of equipment for stimulating children’s development (e.g., blocks, fine motor equipment, supplies for pretend play, sand and water) are not evident in daycare centers in Europe or other parts of the world (e.g., Vermeer et al. 2008).

Third, in some parts of the USA, ERS have been seen as important tools for the evaluation and improvement of quality of care (e.g., Park et al. 2012). Although studies that specifically targeted centers involved in quality improvement programs were not included in the meta-analysis, it is conceivable that the expectations for child care organizations to meet the ERS criteria are more common in the USA than in other countries. It may also be the case that the attention given to quality improvement in general in the USA, through state-based Quality Improvement and Rating Systems, may have raised awareness of the importance of caregiver sensitivity in interactions with children. In turn, this may account for the observed differences in caregiver sensitive behavior that, on average, was higher in North America than in Europe, as measured with the Arnett Caregiver Interaction Scale.

Fourthly, studies examining the internal and cross-cultural validity of the ERS are limited. Gordon et al. (2013, 2015) showed that, among other things, the category ordering of the ECERS-R (response process validity) assumed by the structure of the scale is not consistently evident; that is, indicators attached to higher rating categories do not necessarily reflect higher quality. Furthermore, few studies have assessed how well the items of the ERS represent indicators of child care quality in countries outside the USA. An exception is the detailed consideration of linguistic, functional, cultural, and metric equivalence of the ERS in Chile and Bangladesh (Limlingan 2011).

Differences in Process Quality Within Geographic Regions

Although average ERS scores for child care quality in North America exceeded that in three other geographic regions, our analyses revealed higher variance in scores in North America than in Europe and Asia. Thus, in North America scores on the ERS were less clustered around the mean than in Europe and Asia. This higher variance in scores may be partly explained by a greater diversity in regulatory standards for child care across states within the USA. For example, in states with the most stringent child care center standards, specific training is required for caregivers and child–caregiver ratios and group sizes are very small. In other states, there are no educational requirements, one adult may supervise as many as ten or 12 toddlers, and group sizes are unregulated (Clarke-Stewart and Allhusen 2005).

Australia, on the other hand, has a universal quality assurance mechanism for center-based child care that requires centers to meet standards for accreditation and quality improvement in order to qualify for parent subsidies for the cost of care (National Child care Accreditation Council [NCAC], 1993). Harrison (2010) has argued that, as well as providing greater consistency in regulatory standards, NCAC processes have resulted in a higher minimum standard in Australia, evidenced by ERS scores that rarely fall below 3.

On the other hand, cultural and country-specific interpretations are less likely to explain the observed differences in caregiver sensitive behavior that, on average, was higher in North America than in Europe, as measured with the Arnett scale. Evidence that beliefs about ideal maternal sensitivity do not differ across the globe (Mesman et al. 2015) and that contingent responsiveness, an important aspect of sensitivity, is a universal component of parenting (Kärtner et al. 2010) suggests that these differences may be genuine rather than cultural. It should be noted, however, that specific expressions of sensitive responsiveness (such as face-to-face interactions) may differ across cultures (Kärtner et al. 2010) and that the Arnett scale is a global scale that is not well suited to detect these specific expressions. In the current study, differences in mean caregiver sensitivity were small and caregiver sensitivity was reported in 19 studies (26 %) only, of which eight were from the USA. However, the design of our study does not allow conclusions about the origin of these differences, and we can therefore only speculate about possible causes. It may also be the case that the attention given to quality improvement in the USA may have raised awareness of the importance of caregiver sensitivity in interactions with children.

ERS Process Quality, Scale Characteristics, and Structural Features

The results of our meta-analyses suggest that scores on ERS process quality were not significantly associated with ERS characteristics. This is useful psychometric information for researchers, providing evidence that different versions of the scales (type of scale, original vs revised, full vs shortened) could be reliably used in within-study and across-study comparisons without strongly affecting or distorting the outcomes.

Results of our hypothesis testing showed that child care quality was not associated with group size, but was negatively associated with child–caregiver ratio: Process quality was somewhat higher when fewer children were under the care of one caregiver. Thus, the role of structural characteristics of the care setting (in this study: child–caregiver ratio and group size) is mixed when explaining quality differences: Ratio matters more than group size. It is important to note, however, that the findings for group size were inconsistent across the combined set of studies; for example, the overall low correlation (r = −.06) between ERS process quality and group size was due to the fact that the reported outcomes covered both negative correlations (in the expected direction) and positive correlations (unexpected).

ERS Process Quality and Caregiver Sensitivity

Our meta-analysis showed that process quality was positively associated with caregiver sensitivity, as measured by the Arnett Caregiver Interaction Scale (r = .62). After correcting for attenuation, the estimated r even reached .73, explaining about half of the variance. This is an important finding, given that the ERS have often been criticized in relation to a perceived omission of process items and/or indicators. For example, Cassidy et al. (2005a) performed a content analysis of the ECERS-R at the indicator level, concluding that over half of the indicators of the ECERS-R measured structural quality rather than process quality. A specific feature of the ERS, however, is that process and structural features are intertwined within the items and indicators. Thus, the ERS items do not allow a disentanglement of process quality and structural quality.

Although the Arnett scale has been widely used in studies to measure caregiver–child interactions, validation studies are scarce. Colwell et al. (2013) showed that the Arnett scale is not well suited to distinguish between caregivers who are “highly” versus “moderately” positive in their interactions with children and that the scale measures one substantive dimension (sensitive caregiver interaction) rather than four subscales. For the purpose of our meta-analysis—to examine whether the ERS are associated with more “proximal” measures of child care quality—the Arnett scale is an appropriate measure. It should be noted, however, that in some studies, data collection for the ERS and the Arnett scale may have been done by the same observers. It is possible that this has led to an overestimation of the correlation between process quality and caregiver sensitivity. Of key importance is that our meta-analysis showed that the ERS cover a core element that is universally recognized as required for children’s positive development, that is, sensitive and responsive care by one or more adults.

Limitations

A limitation of the current meta-analysis concerns the variation of studies in terms of representativeness for a particular country. We have approached this problem in two ways. First, we excluded studies screening for only high-quality centers and studies that targeted specific populations. Second, we controlled for representativeness in additional moderator analyses that largely confirmed our results, showing that level of representativeness did not or only marginally affect quality ratings.

Another limitation is that we only included group size and child–caregiver ratio as structural indicators and that the number of studies able to be included in our hypothesis testing was relatively small (e.g., 10 for ratio). These two indicators were selected because of their importance, but also because they are objectively measured, which allowed a reliable comparison across countries. Whilst other indicators, such as caregiver qualifications, education, and experience, have often been cited as key structural features influencing process quality (see Burchinal et al. 2002), these differ worldwide and are not measured consistently. Therefore, it was not possible to include them in our meta-analysis.

The same holds for structural features at the country level: The countries included in the current meta-analyses vary widely in terms of beliefs (e.g., child care as a universal right or a social welfare provision) and policies, which in turn may influence the quality of care that is offered. Also, the distinction between public and private centers could not be included as a moderator, because this distinction does not exist in every country, and if it does, definitions concerning public and private vary across countries and may change over time.

Conclusion

Taken together, this international meta-analysis has shown that group center care, as measured by the ERS, is of average quality, with higher quality levels in Australia/New Zealand and North America. Our results emphasize the on-going need for continuing efforts by government policy makers, early childhood service providers, and educators to enhance, or maintain high levels of, childcare quality. This is especially salient in Asia, South America, and Europe. Our results suggest that scale characteristics are not responsible for differences in scores and that the ERS are related to indicators of proximal quality of care (caregiver sensitivity) and, to a lesser degree, structural quality of care (child–caregiver ratio).

This paper not only presents a state-of-the-art analysis of the ERS around the world, but also shows the need for further research focusing on psychometrics and cross-cultural issues. We believe that this work aligns with the benefits, identified by Limlingan (2011, p. 45) of making “cross-cultural comparisons using a common instrument” which when “composed and utilized in the right way, provides a good method to facilitate discussions which allow us to learn from one another.”