Introduction

Attention deficit hyperactivity disorder (ADHD) is characterized by difficulties of both inattention and hyperactivity or impulsiveness that interfere with a child’s daily functioning. At school, children have, for example, difficulty remaining in their seats and paying attention for a longer period of time. Oppositional defiant disorder (ODD) is characterized by hostile and defiant behavior towards figures with authority, going beyond normal childhood behavior. Children argue with their teacher and often lose their temper (American Psychiatric Association 2000). Numerous studies have found a negative association between ADHD and educational achievement (Polderman et al. 2010) and children with ODD receive lower grades at school (Greene et al. 2002). Both children with ADHD and ODD are more likely to attend specialized schools.

The American Psychiatric Association (APA) estimates that 3–7 % of all school-aged children are diagnosed with ADHD, while estimates of the prevalence of ODD in children range from 2 to 16 % (American Psychiatric Association 2000). It must be noted that more than 50 % of the children diagnosed with ADHD also have ODD (Angold et al. 1999; Wilens et al. 2002). In the general population, the ratio between boys and girls with ADHD is estimated to be 3:1, while the ratio is higher in a clinical population (Gaub and Carlson 1997). A potential explanation of the discrepancy in the ratio between boys and girls on population versus clinical level is bias in the ratings of the teacher (Abikoff et al. 2002; Derks et al. 2007b; Sciutto et al. 2004), because one criterion for a diagnostic and statistical manual of mental disorders (DSM-IV) diagnosis is that symptoms are present in at least two settings and often the evaluation of the teacher is taken into account. In a study focusing on children diagnosed with ADHD (Derks et al. 2007b) teachers reported more disruptive behavior at school for boys than for girls, while there is no difference for mother ratings. For ODD, teachers also report higher prevalence rates in boys than girls while parents do not (Meisel et al. 2013). To further complicate matters, teacher bias may depend on the teacher’s gender. An alternative explanation of the discrepancy is that the gender differences in ADHD and ODD behavior are more pronounced in the school environment, which may demand more of a child than the home environment.

When analyzing questionnaire data concerning psychiatric disorders, researchers often use sum scores to combine multiple items of a scale. A meaningful interpretation of a sum score is only possible when a scale measures the same disorder in all specified groups. Mellenberg (1989) defined measurement invariance (MI) with respect to group as an identical distribution of the observed sum score, conditional on the disorder that the test measures, across groups. The interpretation of group differences with respect to sum scores is only meaningful when the scale is MI (Slof-Op ‘t Landt et al. 2009). MI does not hold for example if boys score on average higher on some of the items than girls without actually scoring higher on the underlying disorder. In this case, a boy and girl, who have the same degree of a disorder, obtain systematically different sum scores. Group differences in the sum score will then reflect measurement bias instead of true underlying differences (Dolan 2000; Mellenbergh 1989; Meredith 1993; Millsap and Yun-Tein 2004).

Behavioral genetic studies have established that ADHD is amongst the most heritable psychiatric childhood disorders. According to a review of 20 twin studies, the mean estimate of the heritability of ADHD in children is over 75 % (Faraone et al. 2005). Estimates for ODD are somewhat lower with a heritability of around 50 % (Hudziak et al. 2005). Heritability estimates of problem behavior in primary school children vary widely between twins taught in the same classroom compared to twins with different teachers (Saudino et al. 2005). It is a general finding that twin correlations are larger when one teacher rates both children compared to when two teachers each rate one child. One hypothesis is that ratings could be biased due to the same person rating both children when twins are taught in the same classroom. Each teacher has his or her own perception on behavior, which can make children seem more similar when they have the same teacher (Kan et al. 2013; Simonoff et al. 1998). The second hypothesis is that there is gene-environment (GxE) interaction (Eaves 1984), which holds that the variation in the behavior of children in different classroom environments may depend on their genetic make-up. The classroom environment, teacher characteristics and peers differ when the twins do not share a classroom in primary school, and different environments might trigger different behavior depending on a child’s genes. A study of internalizing and externalizing behavior in primary school children concluded that this was the case, and that the heritability was higher in children sharing a classroom compared to children in different classrooms because of GxE interaction (Lamb et al. 2012). The question is whether this is also true for ODD and ADHD behavior and which differences between classrooms play a role.

In behavioral genetic studies, the absence of MI may have important consequences for heritability estimates. Absence of MI for an environmental factor, for example, gender of the teacher, could lead to differences in heritability estimates between groups (GxE interaction). Absence of MI for student’s gender may lead to what is known as scalar sex limitation, the effect of the genetic and environmental factors may, for example, be larger in boys than girls (Lubke et al. 2004; Neale et al. 2006). The short Conners’ Teacher Rating Scales—Revised (CTRS-R) is often filled out by teachers to assess ODD and ADHD behavior in a school setting (Conners et al. 1998). The scales of this instrument have been tested for MI in 7-year-old boys and girls (Derks et al. 2007a), showing no evidence for measurement bias regarding the gender of the student. However, the study did not take into account possible differences between male and female teachers in the perception of ODD and ADHD behavior nor did it evaluate MI at older ages. Therefore, the first objective of this study is to determine whether the scales of the CTRS-R, measuring ODD and ADHD behavior, are measurement invariant for gender of the student as well as gender of the teacher throughout primary school. When MI holds, the second objective of this study is to focus on GxE interaction, and investigate whether classroom sharing, gender of the student and gender of the teacher moderate the heritability of teacher-rated ODD and ADHD behavior.

Methods

Participants

The Netherlands Twin Register (NTR), established around 1987 by the Department of Biological Psychology at the VU University Amsterdam, registers approximately 40 % of all multiple births in the Netherlands. A survey about the development of the children is sent to the parents of the twins every 2 years until the twins are 12 years old (Boomsma et al. 2002, 2006; van Beijsterveldt et al. 2013). Since 1999, at approximately age 7, 9 and 12, when the twins attend primary school, parents are asked for their consent to approach the teacher(s) of their children with a survey. The survey sent to the primary school teachers includes items on background information of the teacher, functioning at school, educational achievement and the standardized questionnaires, the Teacher Report Form (TRF) (Achenbach 1991) and the short version of the Conners’ Teacher Ratings Scale—Revised (CTRS-R) (Conners 2001).

Since 2001 data collection has yielded surveys with information on gender of the teacher for 9,365, 8,775 and 6,649 7, 9 and 12-year-olds, respectively. We excluded children who had a disease or handicap that interfered severely with daily functioning (Age 7: N = 97; Age 9: N = 128; Age 12: N = 95) or attended specialized education, special schools are available for children with extra needs (Age 7: N = 109; Age 9: N = 237; Age 12: N = 226). Surveys were excluded if they were filled out by more than one teacher (Age 7: N = 431; Age 9: N = 259; Age 12: N = 83), filled out by someone other than the regular teacher (Age 7: N = 64; Age 9: N = 68; Age 12: N = 57), or if familiarity with the student was below average (Age 7: N = 53; Age 9: N = 62; Age 12: N = 34). This resulted in a total sample for the MI analyses of 8,611 surveys for 7-year-olds, 8,021 surveys for 9-year-olds and 5,954 surveys for 12-year-olds.

The sample for the GxE interaction analyses included complete phenotype data for most twin pairs (Age 7: N = 3,793; Age 9: N = 3,470; Age 12: N = 2,534). Incomplete data are due to only one of the teachers returning the survey. The sample consisted of 1,208, 1,102, and 762 twin pairs of opposite sex for respectively age 7, 9 and 12. For the same-sex twin pairs (Age 7: N = 2,585; Age 9: N = 2,368; Age 12: N = 1,772), determination of zygosity status was based on blood or DNA polymorphisms (Age 7: N = 224; Age 9: N = 331; Age 12: N = 393) or on the basis of parental report of items on resemblance in appearance and confusion of the twins by parents and others (Age 7: N = 2,321; Age 9: N = 1,987; Age 12: N = 1,356). This last method established zygosity with an accuracy of approximately 93 % (Rietveld et al. 2000). Zygosity was unavailable for some twins and these twin pairs were excluded from the analyses (Age 7: N = 40; Age 9: N = 50; Age 12: N = 23).

Measures

The short Conners’ Teacher Rating Scale—Revised (CTRS-R) is a measurement instrument to asses ODD and ADHD behavior at school. Teachers had to indicate whether a child displayed a certain type of behavior currently or in the prior month. The short version of the CTRS-R consists of 28 items scored on a 4 point scale from 0 (not true or never) to 3 (completely true or very often) (Conners et al. 1998; Conners 2001). The CTRS-R includes 4 scales measuring Oppositional Behavior (OPP 5 items), Cognitive Problems/Inattention (ATT 5 items), Hyperactivity (HYP 7 items) and Attention Deficit Hyperactivity Disorder Index (ADHD 12 items). One item is included in both the HYP and ADHD scale (‘Easily excited, impulsive’). The item ‘Inattentive, gets distracted easily’ of the ADHD scale was excluded from the MI analyses as it was highly correlated with some of the other items, especially ‘Easily distracted or difficulty maintaining attention’ (Age 7: r = 0.812; Age 9: r = 0.805; Age 12: r = 0.789) and ‘Short attention span’ (Age 7: r = 0.777; Age 9: r = 0.716; Age 12: r = 0.745). As a consequence, the more stringent MI models did not converge due to multicollinearity when including this item. For the GxE interaction analyses, a sum score of a scale was computed when there was at most one missing item (OPP, ATT and HYP) or at most two missing items (ADHD) for a scale. Missing items were imputed by the rounded averaged item score of the scale for that child. The sum scores of the scales showed an L-shaped distribution and therefore the data were square root transformed prior to the analyses.

Statistical analyses

Measurement invariance

The factor structure of the four CTRS-R scales was investigated with exploratory factor analyses (EFA) with an Oblimin rotation. The number of latent factors was decided based on the scree plot and eigenvalues (larger than 1) of the factors. To test whether the scales of the CTRS-R were MI across student (‘boy’ or ‘girl’) gender and teacher (‘male’ or ‘female’) gender, multigroup (4 groups) confirmatory factor analyses (CFA) for ordinal item level data were carried out (Dolan 2000; Meredith 1993; Millsap and Yun-Tein 2004) using Mplus Version 6.1 (Muthén and Muthén 2010). With ordinal item level data an underlying continuously distributed liability is assumed and thresholds that categorize the disorder are estimated based on the response frequencies (Flora and Curran 2004). Because of the low frequencies of the most extreme response categories, the highest two response categories were combined. The EFA and CFA models were fitted with the Theta parameterization and the weighted least squares with mean variance adjusted (WLSMV) estimator. Correction for dependency of the observations due to family clustering was done by the ‘complex’ option. This ‘complex’ option computes the standard errors and a χ2 of model fit taking into account this dependency.

Different levels of MI were tested by constraining the model parameters step by step. The first level is configural invariance (configural MI), where the factor structure is the same across groups. Factor means are fixed to zero for identification purposes while factor variances, thresholds, loadings and residual variances of the continuous latent response variables are group specific. One of the factor loadings is constrained to be equal to 1 for scaling purposes. A stricter model is strong factorial invariance (strong MI), where differences in latent response means are the result of differences in the latent factor means. This model is tested by constraining both the factor loadings and thresholds to be equal across groups. The factor mean of the first group is fixed to zero and freely estimated in the other groups. The last model, strict factorial invariance (strict MI) implies that the differences in the latent response means reflect true differences in the latent factor means and variances. This is tested by constraining the factor loadings, thresholds and residual variances of the continuous latent response variables to be equal across all groups. The factor mean is still fixed to zero in the first group and freely estimated in the other groups (Dolan 2000; Mellenbergh 1989; Meredith 1993; Millsap and Yun-Tein 2004).

The root mean square error of approximation (RMSEA) and the comparative fit index (CFI) were chosen as indices of model fit. A RMSEA value smaller than 0.05 indicates a good fit as does a CFI value of 0.97 or higher (Schermelleh-Engel and Moosbrugger 2003). The difference in goodness of fit between the nested MI models in χ2 values between two nested models when using the WLMSV χ2 values is not distributed as a χ2 and as a consequence regular χ2 testing is not appropriate when using the WLSMV estimator (Muthén and Muthén 2010). Instead, the ‘difftest’ option in Mplus can be used to obtain a correct χ2 difference test by using the derivatives of the variables from both models. Due to the large sample sizes these χ2 difference tests models might reject a model on the basis of a significant χ2 difference even though the model actually fit. Interpreting the χ2 as a goodness-of-fit index has been suggested as an alternative for using the χ2 as a formal test statistic. Since there are no absolute standards, a ratio between 2 and 3 is proposed to be indicative of, respectively a good and an acceptable model fit (Schermelleh-Engel and Moosbrugger 2003). Therefore, a difference in χ2 of more than 3 times the difference in estimated parameters was interpreted as a worsening of the fit of the model. In addition, we looked at the parameter estimates and the magnitude of the modification indices to make reliable decisions on acceptance of MI.

Gene-environment interaction models

The contribution of genetic and environmental effects to the variance of the CTRS-R scales was estimated in a classical twin model (Boomsma et al. 2002; Plomin et al. 2008) in the R (R Core Team 2014) package OpenMx Version 3.1.0 (Boker et al. 2011, 2012) with maximum likelihood estimation. First, a saturated model was fitted to the data in which means, variances and covariances were estimated in the different zygosity-by-gender groups rated by same (ST) and different (DT) teachers. Mean and variance differences between children taught by male and female teachers, between boys and girls, between children sharing a classroom or in different classrooms and across zygosity were tested in the saturated model. It was tested whether the twin correlations could be equated between twins sharing a classroom and twins in different classrooms.

Next, GxE interaction models for gender of the student, classroom sharing and gender of the teacher were fitted to the data. GxE interaction was modelled by using multiple group designs for classroom sharing and gender of the student, and by a moderation model for teacher’s gender (Fig. 1) (Purcell 2002). The models included additive genetic effects (A), dominant genetic effects (D) (or common environmental effects (C), shared by twins) and unique environmental effects (E), not shared by twins. To correct for possible confounding by gene-environment correlation (rGE), means were allowed to be different between boys and girls, between twins rated by the same or different teachers and between children rated by male or female teachers (Purcell 2002). In the first models, differences in heritability between boys and girls were tested by constraining the estimates to be equal over gender of the student. Total variances between boys and girls were allowed to differ. Next, it was tested whether estimates could be constrained to be equal for twins rated by the same and by different teachers. Differences in genetic and environmental variance between the same and different teacher groups could be due to GxE interaction, but may also be the result of rater bias. Therefore, a correlated errors model was applied, which is an extension of the univariate twin model as it allows the unique environmental (E) effects to be correlated for twin pairs rated by the same teacher (Simonoff et al. 1998). In the last models, GxE interaction by gender of the teacher was tested by dropping from the model the moderation of the A, D (C) and E estimates by gender of the teacher.

Fig. 1
figure 1

Gene-environment interaction (GxE) model with moderation by gender of the teacher

Difference in goodness of fit of the nested models was assessed with a log-likelihood ratio test (LRT) which calculates the difference in −2log-likelihood (−2LL) between two models and evaluates this χ2-statistic with the difference in the number of estimated parameters between the models as degrees of freedom. A p value smaller than 0.01 was considered significant. Constraints were kept, when a more restrictive model did not significantly decrease the goodness of fit, as a more parsimonious model is preferred.

Results

Measurement invariance

MI of the four scales (OPP, ATT, HYP and ADHD) of the CTRS-R was tested across gender of the student (‘boy’ or ‘girl’) and gender of the teacher (‘male’ or ‘female’) at age 7 (Age: Mean = 7.44 and SD = 0.47), age 9 (Age: Mean = 9.92 and SD = 0.53) and age 12 (Age: Mean = 12.15 and SD = 0.30), resulting in a 4 group comparison. Information on the gender of the teacher was available for 8,611 7-year-olds (boy-male: N = 322; boy-female: N = 3,918; girl-male: N = 317; girl-female: 4,054), 8,021 9-year-olds (boy-male: N = 1,050; boy-female: N = 2,841; girl-male: N = 1,111; girl-female: N = 3,019) and 5,954 12-year-olds (boy-male: N = 1,332; boy-female: N = 1,503; girl-male: N = 1,381; girl-female: N = 1,738). Table 1 shows the frequencies of the item responses and the factor loadings of the items for all scales estimated from the EFA. Factor loadings were overall relatively high. On the basis of the scree plots and eigenvalues, a one-factor solution was chosen for OPP, ATT and HYP and a two-factor solution for ADHD (attention problems (AP) and hyperactivity/impulsivity (HI)) in all age groups (see Table 1).

Table 1 Frequencies of the item responses and factor loadings as estimated in the EFA

Results for the tests of the three levels of MI are reported in Table S1. For OPP, HYP and ADHD the configural, strong and strict invariance models all showed an acceptable to good fit, based on the RMSEA and CFI, for all age groups. Differences in χ2 between the models with increasing equality constraints were rather small and, for the strong MI level, did not exceed more than three times the number of degrees of freedom. However, for the strict MI level, the difference in a χ2 for OPP at age 9 and HYP at age 7 and 12 was somewhat larger than this criterion, but these differences were accompanied by minor changes in RMSEA and CFI. Inspection of the modification indices revealed that they were larger for female teachers compared to male teachers for both boys and girls. Taken together, we could accept MI for the scales OPP, HYP and ADHD, for all ages, with respect to gender of the student and, more tentatively, for gender of the teacher. The fit of the MI models was acceptable to mediocre for ATT in 7-year-olds while the fit of the models was unacceptable for 9 and 12-year-olds. Even the models without constraints on the factor structure did not fit the data very well. Increasing MI levels led to a large decrease in model fit for all ages. Therefore, we could not accept MI across gender of the student and teacher for the ATT scale.

Gene x environment interaction models

Table 2 gives the means and standard deviations of the measurement invariant CTRS-R scales for boys and girls with the same or different male or female teachers across the three age groups. The saturated models were used to test for mean and variance differences across these groups. For OPP, there were mean and variance differences between boys and girls at all ages and variance differences across zygosity at age 7, between children sharing a classroom and children in different classrooms at age 12 and between children with the same or different male or female teachers at age 12. For HYP, there were mean and variance differences between boys and girls at all ages, mean differences across zygosity and between children sharing a classroom and children in different classrooms at age 7 and variance differences between children sharing a classroom and children in different classrooms at age 12. For ADHD, there were mean and variance differences between boys and girls at all ages and mean differences between children sharing a classroom and children in different classrooms at all ages.

Table 2 Means and standard deviations of the untransformed sum scores of the CTRS-R scales at age 7, 9 and 12

Twin correlations for each gender by zygosity group rated by the same teacher or by different teachers are given in Table 3. For all scales, MZ correlations were higher, sometimes more than twice as high, than DZ correlations, suggesting additive (and in some cases dominant) genetic effects. Only for the OPP scale were DZ correlations larger than half the MZ correlations, suggesting common environmental effects. The GxE interaction model fitting results are reported in the online supplementary materials for the OPP (Table S2), HYP (Table S3) and ADHD (Table S4) scales of the CTRS-R. The standardized estimates (Table 4) and the contribution of the variance components (Fig. 2) are given for the most parsimonious and best fitting models.

Table 3 Twin correlations for the CTRS-R scales rated by the same teacher or different teachers at age 7, 9 and 12
Table 4 Standardized estimates [95% Confidence intervals] of the total genetic (G), additive genetic (A), dominant genetic (D), common environmental (C) and unique environmental (E) effects on the four CTRS-R scales for 7, 9 and 12-year-olds in the best-fitting models
Fig. 2
figure 2

The relative contribution of the additive genetic, dominant genetic, common environmental and unique environmental effects for the most parsimonious and best fitting models for Oppositional Behavior (a), Hyperactivity (b) and Attention Deficit Hyperactivity Disorder Index (c)

Classroom sharing

Correlations between twins rated by the same teacher could not be constrained to be equal to correlations between twins with different teachers. Constraining the variance components to be equal across same and different teachers also resulted in a significant deterioration of the model fit. A model with correlated errors was fitted to the data to check whether the differences between the same teacher and different teacher groups could be explained by rater bias. For none of the scales did the correlated errors model provide a better fit. In general, the proportion of the variance explained by genetic effects (heritability) was higher, at all ages, for children taught by the same teacher (ST) than for children rated by different teachers (DT) for OPP in boys (ST 62–80 %; DT 12–57 %) and girls (ST 33–46 %; DT 25–55 %), HYP in boys (ST 76–84 %; DT 48–51 %) and girls (ST 66–75 %; DT 43–51 %) and ADHD (ST 78–88 %; DT 46–61 %).

Gender of the student

For the scales OPP and HYP, the contribution of the variance components differed between boys and girls at all ages, while this was not the case for the ADHD scale. Heritability of OPP was higher for boys (ST 62–80 %; DT 12–57 %) than girls (ST 33–46 %; DT 25–55 %). The influence of common environmental effects was, at most ages, negligible in boys (ST 0–6 %; DT 1–19 %) while it had some influence in girls (ST 9–36 %; DT 0–21 %). Heritability of HYP was slightly higher for boys (ST 76–84 %; DT 48–51 %) than girls (ST 66–75 %; DT 43–51 %). Differences between boys and girls on this scale could mainly be attributed to differences in the influence of dominant genetic effects.

Gender of the teacher

Moderation by gender of the teacher was significant for OPP at age 9 and 12, HYP at age 12 and ADHD at age 7. For OPP at age 9, the relative influence of genetic effects was larger in boys with female teachers (ST 78 %; DT 21 %) than with male teachers (ST 62 %; DT 12 %) while it was somewhat larger for girls with male teachers (ST 44 %; DT 44 %) compared to with female teachers (ST 38 %; DT 44 %). For OPP at age 12, the opposite was true; heritability was larger in boys with male teachers (ST 80 %; DT 57 %) than with female teachers (ST 66 %; DT 43 %) while heritability was somewhat larger when girls were taught by a female teacher (ST 46 %; DT 55 %) compared to when they were taught by a male teacher (ST 33 %; DT 50 %). For HYP at age 12, heritability was almost equal in boys and girls with male and female teachers, but the extent to which dominant genetic effects played a role differed across gender of the teacher. For ADHD at age 7, heritability was larger for children with male teachers (ST 88 %; DT 61 %) compared to with female teachers (ST 78 %; DT 55 %).

Discussion

Three (OPP, HYP and ADHD) of the four scales of the short Conners’ Teacher Ratings Scale—Revised (CTRS-R) (Conners 2001), used in a school setting to assess ODD and ADHD behavior, were measurement invariant across gender of the student and teacher. This means that gender differences in means and variances may be interpreted as reflecting true differences on the underlying disorder. In contrast, MI did not hold for the Inattention/Cognitive Problems (ATT) scale. Explanations for the absence of MI could be the low factor loadings and the moderate test–retest reliability of this scale. Problems with the item content have been previously suggested (Conners et al. 1998). In our sample, the internal reliability of the Inattention/Cognitive Problems scale of the short CTRS-R ranged from 0.78 to 0.82. The results of the MI analyses strongly question the reliability of this scale and its use in clinical practice. Revision of this scale is recommended as the ratings might reflect a bias instead of true differences.

Heritability of ODD and ADHD behavior, measured with the OPP, HYP and ADHD scales of the CTRS-R is substantial. Common environmental effects had some influence on ODD behavior while dominant genetic effects had an influence on ADHD behavior. The finding of common environmental effects is consistent with earlier studies of ODD behavior using parental ratings (Burt et al. 2001; Tuvblad et al. 2009). The influence is larger in girls which may be explained by the fact that girls appear to be more sensitive to reprimands from the teacher than boys. Earlier research already concluded that girls more often feel the pressure from peers or others to behave prosocially (Roberts and Strayer 1996). Girls might be more inclined to adapt their behavior when they are called upon by the teacher. In younger girls the common environment also has an influence when they do not share a classroom. Factors in the home environment that have been proposed to have an influence on ODD behavior are, for example, parental discipline and parental involvement (Frick et al. 1992) and the influence of these factors could depend on the gender of a child and decrease when a child grows older. The finding of dominant genetic effects for ADHD behavior, especially in children sharing a classroom, could also be due to rater contrast effects. Only when one teacher rates both children of a twin pair can the behavior of the children be contrasted and result in negative interaction effects. A higher rating for ADHD behavior in one of the children of a twin pair could lead to a lower rating for ADHD behavior in the co-twin. However, the variance in ADHD behavior is not significantly smaller in MZ twin pairs compared to DZ twin pairs, which disconfirms the presence of this type of rater bias. This is in accordance with the results of a study looking into mother and teacher ratings of hyperactivity. A contrast effect was found for the maternal ratings while the teacher ratings did not show this form of rater bias (Simonoff et al. 1998).

Heritability estimates for ADHD behavior are comparable to those found in studies taking differences between same and different teachers into account. For example, Merwood et al. (2013) also found differences in heritability between 12-year-old children sharing a classroom (76 %) and not sharing a classroom (49 %). One study included only twin pairs sharing a classroom and observed a heritability of 74 % (Hartman et al. 2007) while another included only twins not sharing a classroom and estimated a heritability of 46 % (Towers et al. 2000). GxE interaction was the most plausible explanation for internalizing and externalizing problems, assessed with the Teacher Report Form, in 7 to 12-year-old twin pairs of which approximately 60 % shared a classroom (Lamb et al. 2012). Other studies looking into GxE interaction for ADHD in 11–12-year-olds (Merwood et al. 2013), and hyperactivity in 7-year olds (Saudino et al. 2005) also observed that heritability was larger when children shared a classroom. On the other hand, a study in 7-year-olds did not observe a difference between children sharing a classroom and children in different classrooms in the heritability of ODD and ADHD behavior (Derks et al. 2007a), but it could be that this study did not have enough power to detect these differences in the heritability (Derks et al. 2004).

Studies towards the heritability of teacher-rated ODD behavior are scarce. The findings of gender differences and common environmental effects were in accordance with the results of a study by Hudziak et al. (2005) that was based on a subsample of the present study. In contrast with current findings, none of the heritability estimates of the maternal-rated ODD behavior differed between boys and girls (Dick et al. 2005; Tuvblad et al. 2009). The differences between parent and teacher ratings of ODD behavior could be due to the fact that children can express different behavior in the classroom than they do at home. The OPP scale of the CTRS-R takes these differences into account by including different items for the teacher survey. A study observed that, although parents rated children rather similar over time, teachers with different teaching styles rated the same children very different across grades, suggesting that behavior differed in response to different teaching styles (Vitaro et al. 1995). Another explanation is that teachers have highly informed views on general childhood behavior for both boys and girls and are better able to assess which behavior is normative for a child of a certain age and gender.

Heritability of ODD and ADHD behavior was larger in children who shared a classroom compared to those who did not. The correlated errors model did not provide a better explanation for the differences in correlations between children rated by the same and different teachers, excluding teacher bias as an explanation, and therefore these findings are in line with GxE interaction for classroom sharing. In general, the heritability of ODD and ADHD behavior was lower in children not sharing a classroom leading to a larger impact of the environment which suggests that different behavior is elicited by different classroom environments. The children are taught by different teachers, with different rules and teaching methods and have different peers. All these factors could contribute to differences between children. For example, how teachers handle disruptive behavior is related to the behavior of a child (Rydell and Henricsson 2004). The unique environmental variance also contains measurement error which might be increased when different teachers rate the two children of a twin pair as rater variance ends up in the measurement error (Hoyt 2000). An important question is which differences between classroom environments play a role. Peer problems are related to ODD and ADHD behavior (Paap et al. 2013). Genetic variance in childhood aggression is moderated by peer victimization and might also moderate the heritability of ODD and ADHD (Brendgen et al. 2008). A study towards differences between monozygotic twins in their perception of the classroom environment identified, for example, the perception of a student about the relationship with the teacher as a unique environmental factor that differed between the genetically identical twins and was linked to hyperactivity as rated by the teacher (Somersalo et al. 2002).

For one teacher characteristic, gender, we investigated whether it moderated genetic effects on behavior in the classroom. The expression of a child’s genetic vulnerability for displaying ODD and ADHD behavior at school depended in some cases on the gender of the teacher. The direction of the difference in heritability may provide an indication for one of two hypotheses. Male teachers and female teachers could provide a different learning and classroom environment with regard to, for example, structure and rules. The bioecological model (Bronfenbrenner and Ceci 1994) predicts that the heritability of a phenotype will be lower in an adverse environment because risk environments will prevent the amplification of underlying genetic differences between children while the diathesis-stress model suggests that heritability will be higher in an adverse environment due to the expression of a genetic vulnerability that is triggered by a risk environment (Rende and Plomin 1992). A same-gender teacher might be seen as a supportive environment as it is suggested to have a positive influence on the behavior and educational achievement of a child (Carrington et al. 2008). According to the bioecological model, genetic variation will be higher when children are taught by a same-gender teacher while the diasthesis-stress model predicts that heritability will be lower. However, in our study, the direction of the effects of gender of the teacher was not consistent which makes interpreting the GxE interaction findings difficult.

To summarize, three of the four scales of the short CTRS-R measuring teacher-rated ODD and ADHD behavior in 7, 9 and 12-year-olds were measurement invariant for student gender and teacher gender. Revision of the fourth scale (ATT) is highly recommended in order to be useable in clinical practice. The heritability of ODD and ADHD behavior was lower for children in different classrooms compared to children sharing a classroom, suggesting that different behavior is elicited by different classroom environments. Apparently, teachers, the classroom and/or peers are important environmental factors that influence the expression of ODD and ADHD behavior in primary school. The direction of the moderation of the heritability of ODD and ADHD behavior by gender of the teacher was not consistent, which makes interpretation difficult. Finding environmental factors with a moderating influence on the heritability ODD and ADHD might help improve learning environments at school to prevent manifestation of ODD and ADHD behavior in children with an increased genetic vulnerability for these disorders.