## Abstract

This article examines whether there are gender differences in understanding the emotions evaluated by the Test of Emotion Comprehension (TEC). The TEC provides a global index of emotion comprehension in children 3–11 years of age, which is the sum of the nine components that constitute emotion comprehension: (1) recognition of facial expressions, (2) understanding of external causes of emotions, (3) understanding of desire-based emotions, (4) understanding of belief-based emotions, (5) understanding of the influence of a reminder on present emotional states, (6) understanding of the possibility to regulate emotional states, (7) understanding of the possibility of hiding emotional states, (8) understanding of mixed emotions, and (9) understanding of moral emotions. We used the answers to the TEC given by 172 English girls and 181 boys from 3 to 8 years of age. First, the nine components into which the TEC is subdivided were analysed for differential item functioning (DIF), taking gender as the grouping variable. To evaluate DIF, the Mantel–Haenszel method and logistic regression analysis were used applying the Educational Testing Service DIF classification criteria. The results show that the TEC did not display gender DIF. Second, when absence of DIF had been corroborated, it was analysed for differences between boys and girls in the total TEC score and its components controlling for age. Our data are compatible with the hypothesis of independence between gender and level of comprehension in 8 of the 9 components of the TEC. Several hypotheses are discussed that could explain the differences found between boys and girls in the belief component. Given that the Belief component is basically a false belief task, the differences found seem to support findings in the literature indicating that girls perform better on this task

## Introduction

Emotion understanding is an ability that refers to the way in which individuals understand, predict, and explain the feelings of others and oneself (Denham 1998; Harris 1989; Saarni 1999). Children with a good level of emotion understanding are more popular among their peers, have more friends (Denham et al. 1990), do better academically (Izard et al. 2001), and show lower levels of psychological problems, such as depression, bipolar disorder, and schizophrenia (for a review see Cicchetti et al. 1995) than children who have lower levels of emotion understanding.

Children undergo three basic levels of cognitive emotion understanding (Pons et al. 2004). From the ages of 3–5 years, children gain an understanding of external aspects of emotions such as learning to recognize facial expressions of emotions. From the ages of 5–7 years, children acquire a mentalistic emotion understanding. For children to acquire a mentalistic emotion understanding, they must develop a theory of mind (ToM), which is the ability to understand that others have thoughts and beliefs that differ from one’s own. Mentalistic emotion understanding includes emotions resulting from beliefs and desires. Finally, between the ages of 7 and 9 years, children understand that we can reflect on a situation from different perspectives (Pons et al. 2004).

Although children’s development of emotion understanding undergoes a specific developmental pattern, there are individual differences in children’s emotion understanding using different tests, such as the Test of Emotion Comprehension (TEC; Pons and Harris 2005) and Denham’s Emotion Understanding Test (Denham 1986; Martin and Green 2005). There are a number of factors (e.g., mothers’ emotion talk, children’s language skills) that predict these individual differences. One such factor is children’s gender (Fivush et al. 2000).

Much research has been devoted to understanding whether there are gender differences in emotion understanding. Many studies have found that girls tend to have a better emotion understanding than boys (Bosacki and Moore 2004 with a puppet task based on Capps et al. 1992; Brown and Dunn 1996 and Denham and Kochanoff 2002, based on Denham’s (1986) Affect Knowledge Test (AKT); Garner and Waajid 2008, based on a vignette-based task designed by Michalson and Lewis 1985). A few studies have found that boys score higher than girls on emotion understanding (Laible and Thompson 2000 with measures based on Denham’s (1986) AKT). Even more studies do not find gender differences in emotion understanding (Albanese et al. 2006 with the TEC, Bennett et al. 2005 with vignettes based on Michalson and Lewis 1985; Denham et al. 2012 and Hughes and Dunn 1998 with measures based on Denham’s (1986) AKT; Pons et al. 2004 with the TEC).

Part of the reason differences may not be found is that when measures of emotion understanding are aggregated across different aspects of emotion understanding, it may mask gender differences in specific areas. For example, Aznar and Tenenbaum (2013) found no gender differences between 4-year-old children in overall emotion understanding as assessed by the TEC. However, 6-year-old boys scored higher than 6-year-old girls in understanding the situational causes of emotion, whereas 6-year-old girls scored higher on understanding reflective emotions than did 6-year-olds boys. Thus, it seems that girls and boys might differ from each other in different types of emotion understanding at particular ages.

The TEC provides a global index of emotion comprehension in children 3 to 11 years of age, which is the sum of the nine components that constitute emotion comprehension: (1) recognition of facial expressions, (2) understanding of external causes of emotions, (3) understanding of desire-based emotions, (4) understanding of belief-based emotions, (5) understanding of the influence of a reminder on present emotional states, (6) understanding of the possibility to regulate emotional states, (7) understanding of the possibility of hiding emotional states, (8) understanding of mixed emotions, and (9) understanding of moral emotions (for a detailed description of the test, see (Francisco Pons et al. 2004).

From a psychometric viewpoint, the TEC is a reliable and valid instrument as shown by studies conducted to date. Thus, Pons et al. (2002) report a good test–retest reliability after 3-months (*r* (18) = .84) and Pons and Harris (2005) a good test-retest correlation after a 13-month delay (*r* (40) = .64 and *r* (32) = .54). When internal consistency was used as a measure of reliability using Cronbach’s alpha all the values are in the range of .61 to .97; Albanese and Molina (2008), *α* = .79; Farina and Belacchi (2014), *α* = .76; Karstad et al. (2014), *α* = .61.

It should be noted that when items are not strictly parallel, or are dichotomous, the Cronbach’s coefficient provides a lower-bound estimate of true reliability. For this reason, some authors have used the theta and phi-coefficients to estimate the internal consistency reliability. Both coefficients provide an estimate of the maximum value of Cronbach’s coefficient alpha (Gadermann et al. 2008; Sun et al. 2007). Thus, Karstad et al. (2015), using the theta test to assess the reliability, obtained values of .82 and .91, and Karstad et al. (2014) obtain a value of .95 using the phi-coefficient. Previous studies have shown that the nine components of the TEC meet the requirements for a Guttman scale. This means that the components of the TEC form an ordinal scale which can be ordered hierarchically in such a way that correctly responding to one component also implies a correct response to lower-order components. The scale is usually considered valid when the coefficient of reproducibility is over 0.9 and the consistency index is over 0.5. Both indices show to what extent the items form a perfect scale (Green 1956). Pons et al. (2004) found values of 0.904 and 0.68 in the reproducibility coefficient and the consistency index, respectively. Mokken scale analysis of TEC components also yielded satisfactory results (*H* = 0.40, *Rho* = 0.79; Albanese and Molina (2008)). Furthermore, evidence of their criterion validity can be found in Albanese and Molina (2008), and Pons et al. (2014).

An important component of validity studies is testing the invariance of the measurement instrument with respect to the variables which may be relevant for theoretical, ethical, or legal reasons. For these reasons, gender is one of the variables most commonly studied. In the case of the TEC, it should be ensured that a boy and a girl with the same level of emotion comprehension have the same probability of answering the test items correctly. If the items of the test do not comply with said invariance, we say that there is differential item functioning. The existence of differences between groups, which technically is called impact, should not be confused with DIF. DIF indicates a difference in item performance between boys and girls who have the same level of emotion comprehension, whatever the distribution of the ability between the groups. To the extent that the total score on the test is usually the sum of the scores of the items which comprise it, a large number of items with DIF against one group lead to scores which systematically undervalue this group. If we use this test to compare groups, the differences found might not correspond to real differences in the distribution of ability among groups.

There is an extensive corpus of psychometric research on the best statistical procedures for detecting DIF (for a review see Osterlind and Everson (2009); Penfield and Camilli (2007). When the response to items is dichotomous (right/wrong or pass/fail), the sample size is small (*N* < 250 per group), and the DIF is uniform (the item favours the same group on all levels of the construct measured), the method of reference is the Mantel–Haenszel (MH) procedure. A limitation of this procedure is its inability to detect some types of non-uniform DIF (the item favours a group on low ability levels and is detrimental at high levels, and the opposite with the other group). Thus, it is recommended that the analysis is complemented with logistic regression, which is sensitive to non-uniform DIF. Given that the majority of research on emotion comprehension in children has relied on small sample sizes, the techniques mentioned above are the methods of choice in this field.

Once the TEC has been analysed for DIF, we are then able to examine whether there are differences between boys and girls in the different measures of emotion understanding provided by the TEC. Some studies which have used other measures of emotion understanding have indeed found differences in favour of girls (Bajgar et al. (2005); (Bosacki and Moore 2004). However, most of the studies that use the TEC have not found statistically significant differences between boys and girls (Aldrich et al. 2011; Aznar and Tenenbaum 2013; Belacchi and Farina 2010; Farina and Belacchi 2014; Grazzani and Ornaghi 2012; Molina et al. 2014; Morra et al. 2011; Pons et al. 2004; Pons et al. 2002; Pons and Harris 2005; Pons et al. 2003; Pons et al. 2014; Tenenbaum et al. 2004). The majority of the cited studies used the total TEC score as the dependent variable and model-based methods for testing statistical significance. In contrast, this study will use the TEC components as the units of analysis because the differences in gender at the component level could be masked when using the total score (which is the result of the sum of all the components) as the dependent variable. Moreover, we will use a randomization-based method for testing statistical significance.

In sum, there are no studies evaluating whether tests used to evaluate emotion comprehension are invariant with respect to a child’s gender. To fill this gap in the literature, the present study examines whether there are gender differences in the different components of the most popular tests assessing emotion understanding in children. More specifically, we use the Mantel–Haenszel and logistic regression to examine whether there are gender differences in DIF.

## Method

### Participants

The participants of the present study were 353 typically developing children (181 boys and 172 girls), ranging from 3 to 8 years (*M*
_{boys} = 5.17, SD = 1.65; *M*
_{girls} = 5.16, SD = 1.56), from a number of playgroups, nurseries, and primary schools in the greater London, UK area and surrounding counties. They all lived within 1 h by train (up to 60 miles) of London. They were of broadly middle-class backgrounds (lower to upper-middle class). Table 1 describes the sample in terms of gender and age groups.

Participants were recruited on a volunteer basis. All parents signed an informed consent form.

### Procedure

The TEC was administered in a quiet room in the schools and nurseries by a trained researcher. Its administration typically lasted 10 min.

### Measures

Participants’ responses to the TEC can be scored in at least three ways. First, they can be scored according to its nine components. A maximum of 1 point is provided for each component. Components I (recognition) and II (external cause) are comprised of five questions. Children receive a 1 on these two components if they answer four items out of five correctly. Components III (desire) and IX (moral) are comprised of two questions and children must answer both questions correctly to receive a 1 on these components. All the other components are represented by one question that is scored as pass or fail. Second, the TEC can be scored according to its subscales. The score obtained in each subscale ranged from 0 to 3, and is calculated by summing the scores obtained in each component belonging to the subscale. The external subscale includes the three first components: recognition, external cause, and desire. The mental subscale includes the next three components: belief, reminder, and regulation. The reflective subscale includes the last three components: hiding, mixed, and morality. Participants were given a pass–fail classification for each subscale. The subscales are scored as passed when all the components of the set are correctly answered. Otherwise, the subscale is scored as failed. The third way of scoring the TEC is using its total score. The overall level of emotion understanding in the TEC is calculated by summing the 9 components correctly answered. Thus, the total scale score range from 0 to 9. For a detailed description of the test and its scoring rules, see (Pons et al. 2004).

### Data Analyses

#### Testing DIF. Mantel–Haenszel procedure (MH)

As mentioned in the introduction, the DIF detection methods should make comparisons between the groups comparing individuals on the same level in the construct measured so as not to confuse impact with DIF. The MH procedure usually uses the total score as an estimate of the construct measured by the test. Therefore, the total TEC score is the stratification variable used to make the necessary group comparison (reference group = girls/focal group = boys). The logic behind the MH procedure is simple: If the variables group and response were independent, the odds of the probability of correctly responding to the item (*π*) instead of incorrectly (1-*π*) would be equal in the reference and focal groups. That is,

The above equality can be expressed as a ratio such that the ratio of the odds, referred to as the odds ratio, will be 1. Assuming homogeneity of the odds ratios of each stratum, the MH measure of association is the common odds ratio estimator (\(\hat \alpha _{{\rm MH}}\)). \(\hat \alpha _{{\rm MH}}\) can be used as a measure of DIF effect size in a metric that varies between 0 and ∞. A value of 1 indicates independence between rows and columns (No DIF). \(\hat \alpha _{{\rm MH}}\) > 1 indicate DIF in favour of the reference group (girls) and \(\hat \alpha _{{\rm MH}}\) < 1 indicate DIF in favour of the focal group (boys).

Holland and Thayer (1988) proposed the MH chi-square statistic, \(\chi _{{\rm MH}}^2\), (Mantel and Haenszel (1959) to test the null hypothesis of no DIF (\(\alpha _{{\rm MH}}\) = 1). The \(\chi _{{\rm MH}}^2\) statistic follows a chi-squared distribution with one degree of freedom. Simulations studies suggest that the \(\chi _{{\rm MH}}^2\) statistic without the continuity correction tends to be less conservative than with the continuity correction (Paek (2010). For this reason we will compute \(\chi _{{\rm MH}}^2\) omitting the continuity correction.

In order to assess and identify DIF items the Educational Testing Service (ETS) DIF classification criteria will be used (Zwick (2012)). The categorical rating of the severity of DIF is based on both the statistical significance of the results and the size of the effect. Because of the skewness of the distribution of \(\hat \alpha _{{\rm MH}}\), it is more convenient to use the natural logarithm of \(\hat \alpha _{{\rm MH}}\)
\(\left[ {\hat \lambda _{{\rm MH}} = ln(\hat \alpha _{{\rm MH}})} \right]\)
_{.} According to this classification,

DIF is negligible if \({\mathrm{\lambda }}_{{\rm MH}}\)is not significantly different from 0 (*p* ≥ .05) or \(\left| {\hat \lambda _{{\rm MH}}} \right| < 0.426\).

DIF is moderate if *λ*
_{MH} is significantly different from 0 (*p* < .05) and \(\left| {\hat \lambda _{{\rm MH}}} \right| \ge 0.426\) and either: (a) \(\left| {\hat \lambda _{MH}} \right| < 0.638\), or (b) *λ*
_{MH} is not significantly greater than 0.426 (*p* ≥ .05).

DIF is large if \(\left| {{\mathrm{\lambda }}_{{\rm MH}}} \right|\) is significantly greater than 0.426 (*p* < .05) and \(\left| {\hat \lambda _{{\rm MH}}} \right| \ge 0.638\).

A modification of the GMHDIF program (Fidalgo 2011a, b) was used to compute all the MH statistics.

#### Testing DIF. Logistic regression (LR)

LR was first proposed for detecting DIF by (Swaminathan and Rogers 1990). It assesses to what extent item scores (1 correct response, 0 incorrect response) can be predicted from total scores alone (No DIF, model 1), from total scores and group membership (uniform DIF, model 2), or from total scores, group membership, and interaction between total scores and group membership (non-uniform DIF, model 3).

In our case, *ln* is the natural logarithm, *p* is the probability of correct response to the studied component, *X* is total TEC scores, G is a dummy variable representing group membership (1 = reference group/girls, 0 = focal group/boys), *XG* is the interaction term between total TEC scores and group membership, and *β*s are the parameters in the model. The strategy for evaluating the DIF is based on the search for the most parsimonious model that best fits the data. To use LR for DIF analysis, Models 1, 2 and 3 were fit to the data using the SPSS (version 18).

LR also gives an estimation of the magnitude of uniform DIF, the \(\hat \beta _{\rm 2}\) coefficient calculated in the model 2. The criteria for assessing the severity of DIF are the same as for the MH procedure, because \(\hat \lambda _{{\rm MH}}\) and \(\hat \beta _{\rm 2}\) are equivalent. That is, the ETS DIF classification system described above was applied (for more detailed information see, Monahan et al. (2007)).

This study employs an additional measure of the magnitude of DIF based on Nagelkerke’s R^{2}. This measure enables both the magnitude of uniform and non-uniform DIF to be estimated. Thus non-uniform DIF is equal to the difference in Nagelkerke’s *R*
^{2} between the non-uniform and uniform DIF models: \(\Delta R_{\rm N}^2 = R^2\left( {model\,3} \right) - R^2\left( {model\,2} \right)\). And uniform DIF is equal to: \(\Delta R_{\rm U}^2 = R^2\left( {model\,2} \right) - R^2\left( {model\,1} \right)\). The guidelines proposed by (Jodoin and Gierl 2001) to quantify the magnitude of DIF are as follows:

Negligible DIF: Δ*R*
^{2} < 0.035

Moderate DIF: 0.035 ≤ Δ*R*
^{2} ≤ 0.070

Large DIF: Δ*R*
^{2} > 0.070

Following the criteria of Jodoin and Gierl (2001), an item is considered to have DIF if the probability of either 1 − d*f* *χ*
^{2} test was less than .05, and the corresponding Δ*R*
^{2} ≥ .035.

The reader can found a detailed description of the LR for DIF analysis in Fidalgo et al. (2014).

#### Testing gender differences

The \(\chi _{{\rm MH}}^2\)statistic (Mantel and Haenszel (1959) and the Mantel test (Mantel 1963) were employed to examine whether there are statistically significant differences between boys and girls in the different measures of emotion comprehension provided by the TEC, while controlling for age. To do so, the responses on the TEC (response variable) of girls and boys (factor) were compared within the same age group (stratification variable or covariate). The null hypothesis (*H*
_{0}) they test establishes that, in each one of the strata of the covariable (age), the response variable (TEC scores) is distributed randomly, with respect to the gender of the children. That is, the answers on the TEC are independent of the child’s gender.

The analysis was conducted by applying the \(\chi _{{\rm MH}}^2\)statistic to dichotomous scores, such as the components or subscales scored as a pass–fail classification. The \(\chi _{{\rm MH}}^2\) statistic follows a chi-squared distribution with one degree of freedom. When the response variable has more than two categories and is measured on an ordinal scale, the pertinent statistic is the Mantel Test. Under *H*
_{0}, the Mantel test has approximately a chi-squared distribution with d*f* = (*R* − 1), being *R* the number of groups. The choice of statistics included in the MH methodology, instead of an analysis of covariance (ANCOVA), which would be the most common parametric alternative, is determined by the non-randomized nature of the sample available. The model based methods, like ANCOVA, requires that participants constitute a random sample of subjects from a well-defined population (Manly 2006; Zheng and Zelen 2008). Unfortunately, that is a very unrealistic assumption in this field of research. On the contrary, MH statistics permit the use of samples of convenience on not assuming a known sampling link to a larger reference population (Koch et al. 1980). This is possible, thanks to the fact that the *H*
_{0} of interest—that the distribution of the responses is random with respect to the levels of the factor—induces a probabilistic structure (the multiple hypergeometric distribution) that allows for judgment of its compatibility with the observed data without the need for external assumptions. More detailed information about this methodology and its use in the behavioral sciences can be found in Fidalgo (2005).

In addition to determining statistical significance, measures of effect size were used to evaluate the extent of the association between gender and the responses on the TEC. In the case of dichotomous responses,\(\hat \alpha _{{\rm MH}}\), was used as described in the section on *Testing DIF*. When the response variable has more than two categories, the pertinent statistic is the Liu-Agresti estimator of the cumulative common odds ratio statistic (\(\hat \psi _{{\rm LA}}\)) (Penfield and Algina 2003). It should be note that \(\hat \psi _{{\rm LA}}\) is a generalization of \(\hat \alpha _{MH}\) for this case (Liu and Agresti 1996).

## Results

The first psychometric property of the TEC evaluated was its internal consistency, which had a Cronbach’s alpha of .66. Next, the DIF analyses were conducted. Table 2 shows \(\chi _{{\rm MH}}^2\) statistics and related effect size measure (\(\hat \alpha _{{\rm MH}}\)), along with the results derived from the ETS DIF classification. As it may be observed, none of the TEC components functions differentially by gender. Results were identical when the LR was applied for detecting uniform and non-uniform DIF (see Table 3). None of the components showed DIF, by either the ETS system classification or the criteria proposed by Jodoin and Gierl (2001).

The results of the analysis of distribution of TEC scores are presented below (see Table 4). On the total test score level, we found statistically significant differences in favour of girls (Mantel test = 7.207, *p* = .007, \(\hat \psi _{{\rm LA}} =\) 1.691). In the analysis of subscales, we only found differences in the mentalistic subscale. On the component level, we only found statistically significant differences in the Belief component. When the effect size was evaluated, it was found that the odds of answering correctly the belief component is estimated to be 1.75 times greater for girls than boys, adjusting for age. If we reanalyse the mentalistic subscale, eliminating the belief component from the calculation, there are no longer any statistically significant differences between boys and girls, whether scoring on the 0 to 2 scale (Mantel test = 1.343, *p* = .247, \(\hat \psi _{{\rm LA}}\) = 1.286) or dichotomously (\(\chi _{{\rm MH}}^2\)= 1.06, *p* = .301, \(\hat \alpha _{{\rm MH}} = 1.318\)). Equally these differences decrease, although they remain statistically significant (*α* = .05), when the belief component is eliminated from the total TEC score (Mantel test = 3.897, *p* = .048, \(\hat \psi _{{\rm LA}}\) = 1.464). It may therefore be concluded that the belief component is largely responsible for the differences between boys and girls in the TEC scores.

## Discussion

Developed by the International Test Commission (ITC), the International Guidelines for Test Use are a set of guidelines that provide an international view on what constitutes “good practice” in test use. In Section 2.3 on issues of fairness in testing, the ITC recommends the need of DIF studies when tests are to be used with individuals from different groups (International Test Commission 2001). In fact, the study of differential item functioning is one of the routine stages in the construction and evaluation of tests in aptitude and educational testing. Unfortunately, in other areas of psychology, DIF analyses between groups that are subject to frequent comparison are not common. This is the case, for example, of the tests designed to evaluate emotion comprehension in children, and more specifically, of the TEC. Therefore, the first goal of this study was to determine whether the TEC components display gender DIF. The results indicate that none of the nine components of the TEC function differentially in boys and girls. That is, children with the same level of emotion comprehension have the same probability of passing the component, regardless of their gender.

Next, we examined whether there are differences between boys and girls in the different measures of emotion comprehension provided by the TEC. To date, the study of gender differences has always been a secondary goal of studies employing the TEC. Furthermore, these studies have typically used the total TEC score as the dependent variable. When the subscales were analysed, we found statistically significant differences only in the Mentalistic subscale. An individual analysis of the various components showed that the cause of the differences between boys and girls on this subscale was due exclusively to the Belief component (see Table 4). Similarly, the belief component is largely responsible for the differences between boys and girls in the total TEC scores.

There are several hypotheses that could explain the differences found. The first, and most general, is that girls have slightly earlier neurocognitive maturation that may serve ToM development which is at the base of much emotion comprehension (Thompson and Thornton 2014). In ToM studies reporting gender differences, the results have typically favoured girls (Calero et al. 2013; Devine and Hughes 2013). And more specifically, some research has shown better emotion comprehension by girls (Bajgar et al. 2005; Bosacki and Moore 2004), which is in accordance with the results found here (see Table 4 and Fig. 1).

This hypothesis of maturational differentiation would explain the small differences in favour of females in the total TEC score found across all ages. However, it would not explain why this difference is only statistically significant and of a relevant magnitude for the belief component. The second explanation is much more specific and has to do with the differences between boys and girls in cognitive knowledge of false belief. In the TEC (Pons et al. 2004), children are first asked about a rabbit who cannot see a fox behind a bush. After being asked if the rabbit cannot see the fox (and being corrected if they are incorrect), children are asked how the rabbit feels. As accurately described by Morra et al. (2011), “the component ‘Belief’ of the TEC is similar to a classical false-belief task, because it involves (a) an element of factual information and (b) a representation of the protagonist’s state-of-knowledge, but in addition, the rabbit/fox problem also involves a third element (c) that represents the affective value of state (a) for the protagonist”. It seems that the attribution of emotions based on false beliefs is a task which is acquired later than cognitive knowledge of false belief (Bradmetz and Schneider 1999; de Rosnay et al. 2004), and that can be partially explained in terms of a differential working memory load (Morra et al. 2011). As Harris (2008) argues, to pass false belief on this task, one must set aside knowledge of imminent danger. Given boys’ greater propensity for crying at a young age (Weinberg 1992), this finding suggests that boys continue to find it difficult to ignore knowledge of negative emotions. Nevertheless, the second hypothesis assumes the first hypothesis of brain maturational differences (Charman et al. (2002)).

### Limitations

This study introduces DIF as a necessary part of the study of TEC validity, and by extension, other tests and questionnaires designed to measure emotion comprehension. The data analysed are compatible with the hypothesis that the scores on the various TEC components are independent of the gender of the children evaluated. That is, that the TEC does not show Gender DIF. Methodologically, one of the limitations of our study is the use of age in years as the stratification variable. Clustering the children by age in years assumes that children who might be in different periods of maturation are grouped together. The use of months as a measure of age instead of years would no doubt increase the precision of the analyses.

These findings add to the accumulation of contradictory evidence in research on gender differences. If in the scope of expression of emotions there seem to be small but significant differences in gender (Chaplin and Aldao 2013) Chaplin 2015), in the field of emotion comprehension the evidence is not so clear. Our data are compatible with the hypothesis of independence between genders and level of comprehension in 8 of the 9 components of the TEC. Given that the Belief component is basically a false belief task, the differences found seem to support findings in the literature indicating that girls perform better on this task (Charman et al. 2002; Devine and Hughes 2013) rather than studies that do not find differences in gender (Hughes et al. 2011; Kolodziejczyk and Bosacki 2015). It should be stressed that the basis of our inferences is the randomization mechanism implicit in the MH tests and not random sampling from a target population. This study evaluated gender differences in emotion comprehension controlling for age. Other variables that might influence results, such as verbal ability or family characteristics (number of siblings, mother’s education) were not controlled for, and could act as confounding variables. In sum, our findings suggest that on the majority of components of emotion understanding, boys’ and girls’ understanding is more similar than different.

## References

Albanese, O., Grazzani, I., Molina, P., Antoniotti, C., Arati, L., Farina, E., & Pons, F. (2006). Children’s emotion understanding: preliminary data from the Italian validation project of Test of Emotion Comprehension (TEC).

*Toward emotional competences*, 39–53.Albanese, O., & Molina, P. (2008).

*Lo sviluppo della comprensione delle emozioni e la sua valutazione. La standardizzazione italiana del Test della Comprensione delle Emozioni(TEC) [*The development of emotion understanding and its evaluation. Italian standardization of the Test of Emotion Understanding (TEC)] Milano, I: UnicopliAldrich, N. J., Tenenbaum, H. R., Brooks, P. J., Harrison, K., & Sines, J. (2011). Perspectiive taking in children’s narratives about jealousy.

*British Journal of Developmental Psychology*,*29*, 86–109.Aznar, A., & Tenenbaum, H. R. (2013). Spanish Parents’ Emotion Talk and their Children’s Understanding ofEmotion.

*Frontiers in Psychology*,*4*.Bajgar, J., Ciarrochi, J., Lane, R., & Deane, F. P. (2005). Development of the Levels of Emotional Awareness Scale for Children (LEAS-C).

*British Journal of Developmental Psychology*,*23*, 569–586.Belacchi, C., & Farina, E. (2010). Prosocial/Hostile Roles and Emotion Comprehension in Preschoolers.

*Aggressive Behavior*,*36*, 371–389.Bennett, D. S., Bendersky, M., & Lewis, M. (2005). Antecedents of emotion knowledge: Predictors of individual differences in young children.

*Cognition & Emotion*,*19*, 375–396.Bosacki, S. L., & Moore, C. (2004). Preschoolers’ understanding of simple and complex emotions: Links with gender and language.

*Sex Roles*,*50*(9–10), 659–675.Bradmetz, J., & Schneider, R. (1999). Is Little Red Riding Hood afraid of her grandmother? Cognitive vs. emotional response to a false belief.

*British Journal of Developmental Psychology*,*17*, 501–514.Brown, J. R., & Dunn, J. (1996). Continuities in emotion understanding from three to six years.

*Child Development*,*67*, 789–802.Calero, C. I., Salles, A., Semelman, M., & Sigman, M. (2013). Age and gender dependent development of Theory of Mind in 6-to 8-years old children.

*Frontiers in Human Neuroscience*,*7*.Capps, L., Yirmiya, N., & Sigman, M. (1992). Understanding of simple and complex emotions in non‐retarded children with autism.

*Journal of Child Psychology and Psychiatry*,*33*, 1169–1182.Chaplin, T. M. (2015). Gender and emotion expression: A developmental contextual perspective.

*Emotion Review*,*7*, 14–21.Chaplin, T. M., & Aldao, A. (2013). Gender differences in emotion expression in children: A meta-aAnalytic review.

*Psychological Bulletin*,*139*, 735–765.Charman, T., Ruffman, T., & Clements, W. (2002). Is there a gender difference in false belief development?

*Social Development*,*11*, 1–10.Cicchetti, D., Ackerman, B. P., & Izard, C. E. (1995). Emotions and emotion regulation in developmental psychopathology.

*Development and Psychopathology*,*7*, 1–10.de Rosnay, M., Pons, F., Harris, P. L., & Morrell, J. M. B. (2004). A lag between understanding false belief and emotion attribution in young children: Relationships with linguistic ability and mothers’ mental-state language.

*British Journal of Developmental Psychology*,*22*, 197–218.Denham, S. A. (1986). Social cognition, prosocial behavior, and emotion in preschoolers: Contextual validation.

*Child Development*, 194–201.Denham, S. A., McKinley, M., Couchoud, E. A., & Holt, R. (1990). Emotional and behavioral predictors of preschool peer ratings.

*Child Development*,*61*, 1145–1152.Denham, S. A. (1998).

*Emotional development in young children*. New York: Guilford Press.Denham, S., & Kochanoff, A. T. (2002). Parental contributions to preschoolers' understanding of emotion.

*Marriage & Family Review*,*34*, 311–343.Denham, S. A., Bassett, H. H., & Zinsser, K. (2012). Early childhood teachers as socializers of young children’s emotional competence.

*Early Childhood Education Journal*,*40*, 137–143.Devine, R. T., & Hughes, C. (2013). Silent films and strange stories: Theory of mind, gender, and social experiences in middle childhood.

*Child Development*,*84*, 989–1003.Farina, E., & Belacchi, C. (2014). The relationship between emotional competence and hostile/prosocial behavior in Albanian preschoolers: An exploratory study.

*School Psychology International*,*35*, 475–484.Fidalgo, Á. M. (2005). Mantel-Haenszel Methods. In B. S. Everitt & D. C. Howell (Eds.),

*Encyclopedia of Statistics in Behavioral Science*(Vol. 3, pp. 1120–1126). Chichester, England: Wiley & Sons Ltd.Fidalgo, Á. M. (2011a). GMHDIF: A computer program for detecting DIF in dichotomous and polytomous items using generalized Mantel-Haenszel Statistics.

*Applied Psychological Measurement*,*35*, 247–249.Fidalgo, Á. M. (2011b). A new approach for differential item functioning detection using Mantel-Haenszel methods. The GMHDIF program.

*The Spanish Journal of Psychology*,*14*, 1018–1022.Fidalgo, A. M., Alavi, S. M., & Amirian, S. M. R. (2014). Strategies for testing statistical and practical significance in detecting DIF with logistic regression models.

*Language Testing*,*31*, 433–451.Fivush, R., Brotman, M. A., Buckner, J. P., & Goodman, S. H. (2000). Gender differences in parent–child emotion narratives.

*Sex Roles*,*42*, 233–253.Gadermann, A., Guhn, M., & Zumbo, B. D. (2008). An empirical comparison of Cronbach's alpha with ordinal reliability coefficients alpha and theta.

*International Journal of Psychology*,*43*, 55.Garner, P. W., & Waajid, B. (2008). The associations of emotion knowledge and teacher–child relationships to preschool children's school-related developmental competence.

*Journal of Applied Developmental Psychology*,*29*, 89–100.Grazzani, I., & Ornaghi, V. (2012). How do use and comprehension of mental-state language relate to theory of mind in middle childhood?

*Cognitive Development*,*27*, 99–111.Green, B. F. (1956). A method of scalogram analysis using summary statistics.

*Psychometrika*,*21*, 79–88.Harris, P. L. (1989).

*Children and emotion: The development of psychological understanding*. Oxford: Basil Blackwell.Harris, P. L. (2008). Children’s understanding of emotion. In L. Michael, Jeannette M. Haviland-Jones & L. F. Barrett (Eds.),

*Handbook of emotions*(3rd ed.) (pp. 320–331). Guilford Press.Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.),

*Test validity*(pp. 129–145). Hillsdale, NJ: Lawrence Erlbaum Associates.Hughes, C., & Dunn, J. (1998). Understanding mind and emotion: Longitudinal associations with mental-state talk between young friends.

*Developmental Ppsychology*,*34*, 1026.Hughes, C., Ensor, R., & Marks, A. (2011). Individual differences in false belief understanding are stable from 3 to 6 years of age and predict children’s mental state talk with school friends.

*Journal of Experimental Child Psychology*,*108*, 96–112.International Test Commission. (2001). International guidelines for test use.

*International Journal of Testing*,*1*, 93–114.Izard, C., Fine, S., Schultz, D., Mostow, A., Ackerman, B., & Youngstrom, E. (2001). Emotion knowledge as a predictor of social behavior and academic competence in children at risk.

*Psychological science*,*12*, 18–23.Jodoin, M. G., & Gierl, M. J. (2001). Evaluating type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection.

*Applied Measurement in Education*,*14*, 329–349.Karstad, S. B., Kvello, O., Wichstrom, L., & Berg-Nielsen, T. S. (2014). What do parents know about their children’s comprehension of emotions? Accuracy of parental estimates in a community sample of pre-schoolers.

*Child: care, health and development*,*40*, 346–353.Karstad, S. B., Wichstrom, L., Reinfjell, T., Belsky, J., & Berg-Nielsen, T. S. (2015). What enhances the development of emotion understanding in young children? A longitudinal study of interpersonal predictors.

*British Journal of Developmental Psychology*,*33*, 340–354.Koch, G. G., Gillings, D. B., & Stokes, M. E. (1980). Biostatistical implications of design, sampling, and measurement to health science data analysis.

*Annual Review of Public Health*,*1*, 163–225.Kolodziejczyk, A. M., & Bosacki, S. L. (2015). Children’s understandings of characters’ beliefs in persuasive arguments: Links with gender and theory of mind.

*Early Child Development and Care*,*185*, 562–577.Laible, D. J., & Thompson, R. A. (2000). Mother–child discourse, attachment security, shared positive affect, and early conscience development.

*Child Development*,*7*, 1424–1440.Liu, I. M., & Agresti, A. (1996). Mantel-Haenszel-type inference for cumulative odds ratios with a stratified ordinal response.

*Biometrics*,*52*, 1223–1234.Manly, B. F. (2006).

*Randomization, bootstrap and Monte Carlo methods in biology*(3rd ed.). New York: Chapman & Hall/CRC.Mantel, N. (1963). Chi-square tests with one degree of freedom; extensions of the Mantel-Haenszel procedure.

*Journal of the American Statistical Association*,*58*, 690–700.Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies.

*J National Cancer Institute*,*22*, 719–748.Martin, R. M., & Green, J. A. (2005). The use of emotion explanations by mothers: Relation to preschoolers’ gender and understanding of emotions.

*Social Development*,*14*, 229–249.Michalson, L., & Lewis, M. (1985). What do children know about emotions and when do they know it?

*The socialization of emotions*(pp. 117–139). US: Springer.Molina, P., Bulgarelli, D., Henning, A., & Aschersleben, G. (2014). Emotion understanding: A cross-cultural comparison between Italian and German preschoolers.

*European Journal of Developmental Psychology*,*11*, 592–607.Monahan, P. O., McHorney, C. A., Stump, T. E., & Perkins, A. J. (2007). Odds ratio, delta, ETS classification, and standardization measures of DIF magnitude for binary logistic regression.

*Journal of Educational and Behavioral Statistics*,*32*, 92–109.Morra, S., Parrella, I., & Camba, R. (2011). The role of working memory in the development of emotion comprehension.

*British Journal of Developmental Psychology*,*29*, 744–764.Osterlind, S. J., & Everson, H. T. (2009).

*Differential item functioning*, Vol. 161. Thousand Oaks, CA: Sage Publications.Paek, I. (2010). Conservativeness in rejection of the null hypothesis when using the continuity correction in the MH chi-square test in DIF applications.

*Applied Psychological Measurement*,*34*, 539–548.Penfield, R. D., & Algina, J. (2003). Applying the Liu-Agresti estimator of the cumulative common odds ratio to DIF detection in polytomous items.

*Journal of Educational Measurement*,*40*, 353–370.Penfield, R. D., & Camilli, G. (2007). Differential item functioning and item bias. In C. R. Rao & S. Sinharay (Eds.),

*Handbook of statistics*(Vol. 26, pp. 125–167). Amsterdam: Elsevier.Pons, F., de Rosnay, M., Bender, P. K., Doudin, P.-A., Harris, P. L., & Gimenez-Dasi, M. (2014). The impact of abuse and learning difficulties on emotion understanding in late childhood and early adolescence.

*Journal of Genetic Psychology*,*175*, 301–317.Pons, F., & Harris, P. L. (2005). Longitudinal change and longitudinal stability of individual differences in children’s emotion understanding.

*Cognition & Emotion*,*19*, 1158–1174.Pons, F., Harris, P. L., & de Rosnay, M. (2004). Emotion comprehension between 3 and 11 years: Developmental periods and hierarchical organization.

*European Journal of Developmental Psychology*,*1*, 127–152.Pons, F., Harris, P. L., & Doudin, P. A. (2002). Teaching emotion understanding.

*European Journal of Psychology of Education*,*17*, 293–304.Pons, F., Lawson, J., Harris, P. L., & de Rosnay, M. (2003). Individual differences in children’s emotion understanding: Effects of age and language.

*Scandinavian Journal of Psychology*,*44*, 347–353.Saarni, C. (1999).

*The development of emotional competence*. New York: Guilford Press.Sun, W., Chou, C. P., Stacy, A. W., Ma, H., Unger, J., & Gallaher, P. (2007). SAS and SPSS macros to calculate standardized Cronbach’s alpha using the upper bound of the phi coefficient for dichotomous items.

*Behavior Research Methods*,*39*, 71–81.Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures.

*Journal of Educational Measurement*,*27*, 361–370.Tenenbaum, H. R., Visscher, P., Pons, F., & Harris, P. L. (2004). Emotional understanding in Quechua children from an agro-pastoralist village.

*International Journal of Behavioral Development*,*28*, 471–478.Thompson, R. B., & Thornton, B. (2014). Gender and theory of mind in preschoolers’ group effort: Evidence for timing differences behind children’s earliest social loafing.

*Journal of Social Psychology*,*154*, 475–479.Weinberg, M. K. (1992). Sex differences in 6-month-old infants' affect and behavior: Impact on maternal caregiving. Unpublished doctoral dissertation.

Zheng, L., & Zelen, M. (2008). Multi-center clinical trials: Randomization and ancillary statistics.

*The Annals of Applied Statistics*,*2*, 582–600.Zwick, R. (2012). A review of ETS differential item functioning assessment procedures: Flagging rules, minimum sample size requirements, and criterion refinement.

*ETS Research Report Series*,*2012*, 1–30.

### Author Contributions

A.M.F.: designed the study; analyzed the data; wrote the results; collaborated in writing and editing of the final manuscript. H.R.T.: collaborated in the writing and editing of the final manuscript; coordinated the data collection. A.A.: collaborated in the writing and editing of the final manuscript; executed the data collection.

## Author information

### Affiliations

### Corresponding author

## Ethics declarations

### Conflict of Interest

The authors declare that they have no competing interests.

### Ethical Approval

The Faculty of Health and Medical Sciences at the University of Surrey granted ethical approval to the data collection and all data collection procedures have been performed in accordance with the ethical standards as laid down in the 1964 Declaration of Helsinki and its later amendments or comparable ethical standards.

### Informed Consent

Letters describing the study to parents were sent home through the children’s schools. Parents provided written consent and their children gave verbal assent before being interviewed.

## Rights and permissions

This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## About this article

### Cite this article

Fidalgo, A.M., Tenenbaum, H.R. & Aznar, A. Are There Gender Differences in Emotion Comprehension? Analysis of the Test of Emotion Comprehension.
*J Child Fam Stud* **27, **1065–1074 (2018). https://doi.org/10.1007/s10826-017-0956-5

Published:

Issue Date:

### Keywords

- Emotion understanding
- Test of Emotion Comprehension
- Gender differences
- Differential item functionin
- False belief task