Evidence for Gender-Dependent Genotype by Environment Interaction in Adult Depression

Depression in adults is heritable with about 40 % of the phenotypic variance due to additive genetic effects and the remaining phenotypic variance due to unique (unshared) environmental effects. Common environmental effects shared by family members are rarely found in adults. One possible explanation for this finding is that there is an interaction between genes and the environment which may mask effects of the common environment. To test this hypothesis, we investigated genotype by environment interaction in a large sample of female and male adult twins aged 18–70 years. The anxious depression subscale of the Adult Self Report from the Achenbach System of Empirically Based Assessment (Achenbach and Rescorla in Manual for the ASEBA adult: forms and profiles, 2003) was completed by 13,022 twins who participate in longitudinal studies of the Netherlands Twin Register. In a single group analysis, we found genotype by unique environment interaction, but no genotype by common environment interaction. However, when conditioning on gender, we observed genotype by common environment interaction in men, with larger common environmental variance in men who are genetically less at risk to develop depression. Although the effect size of the interaction is characterized by large uncertainty, the results show that there is at least some variance due to the common environment in adult depression in men.

The heritability of depression in adults is estimated at around 40 % (Sullivan et al. 2000;Nivard et al. 2015). Interestingly, in adults the remaining phenotypic variance is consistently found to be solely due to the unique environment. In adolescents, however, at age 12 years variation in anxious depression is explained also by shared environmental factors, while at ages 14 and 16 these shared environmental effects were absent (Lamb et al. 2010). The absence of evidence for common environmental influences on depression after age 12 is remarkable, as it has been argued that, theoretically, at least some phenotypic variance in depression is expected to be due to the familial effects in childhood that persist into adulthood (Gatz et al. 1992). For instance, cognitive styles related to depression may be learned in the family (Monroe and Simons 1991; Mezulis et al. 2006;Ingram 2003), and familial traumatic events in childhood, such as divorce, affect children similarly (Bowlby 1977;Kessler et al. 1997;Silberg et al. 2001). Therefore, it has been argued that the recurring finding of no common environmental effects on adult depression may be spurious. Duncan et al. (2014) hypothesized that the true effects of the common environment underlying depression are masked by non-linear effects. Specifically, the effects of the common environment (C) may depend on the genotype of the subject (A) that is an A9C interaction. If not explicitly modeled, such an interaction effect is included in the estimate of the genetic variance (Molenaar et al. 1990). The question arises whether common environmental effects on adult depression can be revealed by taking into account such non-linear effects.
Another question is why some individuals develop a depression after an adverse environmental event and others do not. Both linear and non-linear effects could explain this phenomenon. Given the ongoing debate of the usefulness of genetic variant by environment interaction studies, either in a candidate gene study or in a genome-wide association study, it is important to know whether nonlinear effects are present for unique environmental effects (see e.g., Dick et al. 2015).
Therefore we investigated whether common and unique environmental variance influencing the vulnerability for adult depression can be detected by taking into account the non-linear effects of genotype by environment interaction. We also tested whether interaction effects differ between males and females. Gender differences in the prevalence of depression arise in adolescence and remain until older age (Kessler et al. 1993). The exact mechanism underlying the higher prevalence of depression in adult females is generally unknown (Piccinelli and Wilkinson 2000), although studies have indicated that environmental factors associated with depression are different for males than for females (Kendler et al. 2011;Klose and Jacobi 2004).
We analyzed data from a large sample of twins between the ages of 18 and 70 years from the Adult Netherlands Twin Register (Nivard et al. 2015). The twins completed the anxious depression subscale of the Adult Self-Report from the Achenbach System of Empirically Based Assessment (Achenbach and Rescorla 2003). We tested for unmeasured genotype by unmeasured environment interaction effects in these data using the heteroscedastic ACE model (Jinks and Fulker 1970;Molenaar et al. 2012). Using the item level data methodology (see Molenaar and Dolan 2014; Schwabe and van den Berg 2014), we modeled a latent depression factor as a function of additive genetic (A), common environment (C), unique environment (E), and non-linear effects (A9E and A9C). In addition, we extended the approach to enable tests on gender differences in these interaction effects. In the resulting model adopted here both the genetic and the environment effects are treated as latent factors. By studying the interaction at the level of the latent genetic and environmental variance, we did not require measured candidate genes and measured candidate environments. In addition, the modeling approach is insensitive to the scale properties of the data, which may otherwise result in spurious non-additive effects (Eaves et al. 1977;Molenaar and Dolan 2014;Schwabe and Van den Berg 2014).

Method Participants and measures
The Netherlands Twin Register (NTR; see Nivard et al. 2015) includes the Young NTR (YNTR; van Beijsterveldt et al. 2013) and the Adult NTR (ANTR; Willemsen et al. 2013). In the YNTR, twins have been registered at birth by their parents since 1987 (Bartels et al. 2007). When twins reach the age of 18, they are enrolled in the ANTR. The ANTR originally included adolescent and adult twins who were recruited through city councils or who volunteered through the NTR website. Here we analyze the data from all twins aged 18 years and older. The dataset comprises 6511 twin pairs (no missing: 5923) between the age of 18 and 70 with information on depression and zygosity. These pairs consist of 3146 (no missing: 2895) are MZ twins and 3365 (no missing: 3028) are DZ twins.
The twin pairs completed the anxious depression subscale of the Adult Self Report (ASR), which is part of the Achenbach System of Empirically Based Assessment (Achenbach and Rescorla 2003). The twins were asked to indicate to what degree various statements concerning anxious depressive behavior and attitudes apply to them on a 3-point scale ('not true', somewhat or sometimes true', 'very or often true'). The ASR anxious depression subscale was included in 8 of the 11 surveys that have been collected for the NTR since 1991NTR since (respectively in 1991NTR since , 1995NTR since , 1997NTR since , 2000NTR since , 2002NTR since , 2009NTR since , 2011NTR since , and 2013. The different surveys contained slightly different versions of the ASR, as over the years, the ASR has changed. However, for anxious depression there was a common set of 13 items included in all surveys that was analyzed in this project (see Appendix 1). Cronbach's alpha for these items on the various measurement occasions and twin samples ranged between 0.83 and 0.90 which is an indication for good reliability. In addition, the correlations between the sum scores based on these 13 items and the sum scores based on all items at each measurement occasion are between 0.95 and 0.98. The validity of the ASR has been established by Achenbach and Rescorla (2003).
Not all twins participated at each measurement occasion (see Table 1). For instance, 2131 twin-1 members have 'None' means that only the co-twin has data available on 1 or more measurement occasions data available at only one occasion and 1273 twin-2 members have data available at three occasions. We selected the data vector from the first occasion that has the least missing values for each twin. This data vector contains missing values due to twins not completing all items of the questionnaire. Here, we assumed that these missing values are missing at random so that we can take all missing values in the data into account in the model fitting approach described below.

Analysis
Our main objective is to test for genotype by environment interaction for males and females in the complete sample of twin pairs. However, we first needed to establish that the data are homogeneous with respect to measurement occasion and age. Below we discuss a standard measurement model and the biometric model. Next, within these models, we discuss how we assessed homogeneity of the data with respect to measurement occasion and age. Then, we introduce the gender dependent genotype by environment interactions in the biometric model.

Standard measurement and biometric model
As advocated by Van den Berg et al. (2007), we analyzed the data at item level by separating between a measurement model for the items and a biometric model for the genetic and environmental variance. In the present study, the socalled graded response measurement model for ordinal item responses was used (Samejima 1969). Using this model, we separated the measurement properties of the item scores, X i , from the underlying latent phenotypic factor (anxious-depression), denoted h. Here, p = 1, …, N is used to index the twin pairs, j = 1, 2 is used to index the twin members, and i = 1, …, n is used to index the items.
In the graded response measurement model, the observed item scores, X i , are regressed on the latent phenotypic factor, h (using a multinomial probit regression function). The intercept and slope parameters in this regression are respectively referred to by threshold and discrimination parameters. These parameters are purported to capture the measurement characteristics of the item scores. In the present case, where we have a three point scale, we have 2 threshold parameters, s i1 , and s i2 . These parameters model the relative attractiveness of the answer options, that is, the degree to which the subjects use the different answer options. In an extreme depression item, for instance ''I often think of suicide'', s i1 will likely be large, reflecting that the first answer option is attractive (most subjects score in the first answer category indicating that they do not think of suicide) and the second option is not. As the item scores are assumed to be ordered, the thresholds are also ordered, that is, s i2 should always be larger than s i1 .
The slope, or discrimination parameter a i , in the regression of the item scores on the latent phenotypic variable, models the degree to which the item scores can distinguish between subjects with different levels of h. The higher the value of a i , the better indicator the item is for h. Besides the threshold and discrimination parameters, measurement models for twin data often also include the residual correlations between the item scores of the twin 1 and twin 2 members in the MZ sample (r MZ,i ) and the DZ sample(r DZ,i ). Such a correlation may indicate shared item specific genetic and/or environmental variance or it may indicate measurement problems resulting from filling in the questionnaire by two twins together. See Fig. 1 for a graphical representation of the measurement model including the parameters, and see Appendix 2 for a more technical discussion of the measurement model. As the measurement model above captures the measurement properties of the item data in the a i , s i1 , s i2 , r MZi , and r DZi parameters, the latent phenotypic factor, h, is unaffected by the scale properties in principle. 1 In the standard phenotypic model, the phenotypic variable, h, is decomposed into an additive genetic (A), common environmental (C), and unique environmental (E) variance component and an intercept, m, that is where COR(A 1 , A 2 ) = 1 for MZ twins and COR(A 1 , A 2 ) = 0.5 for DZ twins. In addition, COR(C 1 , C 2 ) = 1 and COR(E 1 , E 2 ) = 0, and E(A) = E(C) = E(E) = 0 for all twins. Under the assumption that the genetic and environmental variance components are mutually uncorrelated, the variance of h can be decomposed as follows: where the standardized estimate of VAR(A) is referred to as the heritability, h 2 .

Homogeneity with respect to measurement occasion
To investigate whether the eight measurement occasions differed in their biometric properties, we relied on estimates of the variance components (A, C, and E) within each occasion. We thus assumed that there are no important differences between the measurement models at each occasion. We fitted the measurement model measurement model and the biometric model simultaneously to the data of each occasion. Note that this was thus a standard ACE model at the level of the latent depression factor (biometric model). On basis of the 99 % highest posterior density regions (HPD) of these variance components, we judged whether the variance components differed importantly over measurement occasion. If not, we concluded that the data collected at the separate occasions were homogenous and could be aggregated.

Full homogeneity with respect to age
As the age range of our sample is wide (18-70), we also established homogeneity of the data with respect to age. As age is an important moderator in the literature, we wanted a more explicit test on homogeneity than the one discussed above. Here, we followed Nivard et al. (2015), and created age subsamples. We used the following groups: 18-19, 20-21, 22-24, 25-34, and 35-70 years. A major consideration in creating these subsamples was that the sample sizes within each age group need to be large enough to have sufficient power to detect heterogeneity. The resulting number of twin pairs within each age category that were in the analysis (i.e., twin pairs with a full or incomplete data record) is given in Table 2. In the case that the twin members have data in different age categories (due to a twin completing the questionnaire at a different age than his/her co-twin) we omitted their data from the present analysis as it requires independent groups. However, we included their data in the aggregated data analysis. We tested if some age groups differed importantly from other age groups (e.g., more variance in the phenotypic factor due to heterogeneity). As the age groups are independent (the members of a twin pair are always in the same age group), we could conduct a multi-group analysis and test for homogeneity of the measurement model (i.e., invariant a i , s i1 , s i2 , r MZ,i , r DZ,I ; also referred to as measurement invariance, see Millsap and Yun-Tein 2004) and homogeneity of h across ages [i.e., invariant MEAN(h), VAR(h), and COR(h 1 ,h 2 )]. First, we tested for invariance of the a i , s i1 , s i2 , r MZ,i , and r DZ,i parameters in the five age groups. To this end we fitted the measurement model without the ACE decomposition, but with a correlation between h 1 and h 2 , in the five age groups. We did this separately for the MZ and DZ twins. We started with an unconstrained model (step 0) in which all parameters were free to vary over age groups. Next, step-by-step, we constrained a i (step 1a), r MZ,i , and r DZ,i (step 1b), and s i1 and s i2 (step 1c) to be equal across groups. In step 1a we allowed for differences in the variance of h between the age groups (i.e., we allowed for the possibility that older subjects have a higher/lower variance on the phenotypic depression variable as compared to the younger subjects). To do so, we constrained VAR(h) = 1 in group 1, and estimated it freely in the remaining groups. In step 1c we The number of twin pairs with a full data record (i.e., with data available for all 13 items in both twins) are in brackets 'Agg' denotes the data aggregated over age. If the twin members of the same pair have data in two separate age categories, this pair is omitted from the age grouping to enable multi-group analysis (which requires independent groups). However, this pair is not omitted from the interaction analysis in the aggregated data, leading to data in 6511 pairs for interaction analyses allowed for a mean difference in h between the age groups (i.e., we allowed for the possibility that older subjects have on average higher/lower levels on the phenotypic depression variable as compared to the younger subjects). To do so, we constrained m = 0 in group 1, and estimated it freely in the remaining groups. If homogeneity with respect to the measurement model is tenable (i.e., step 1c), we could subsequently test for homogeneity of the population model with respect to age. To this end we tested whether the mean and variance of h (step 2a), and COR(h 1 ,h 2 ) were equal across age groups (step 2b).

Measurement invariance with respect to gender
As we compared males and females in the interaction models, we analyzed whether homogeneity of the measurement model (measurement invariance) holds for males and females. We did not need to establish homogeneity of h as we explicitly took possible differences in h into account in the interaction model. Thus, we assessed whether the parameters a i , s ic , r MZi , and r DZi were invariant over gender using the procedure from step 1a to 1c as described above. For the MZ subsample this was thus a two group analysis (males-females), and for the DZ subsample this was a three group analysis (male, female, and oppositesex pairs).

Testing for interactions in a multi-group model
Recently, within the heteroscedastic ACE methodology (Jinks and Fulker 1970;Molenaar et al. 2012) an approach was presented to enable tests on genotype by environment interactions using the model discussed above (Molenaar and Dolan 2014; Schwabe and Van den Berg 2014). Specifically, retaining the measurement model for h as discussed above, the biometric model can also be formulated as a conditional model. To this end, we condition h on A, denoted h | A. This results in for the variance decomposition. Now, a genotype by environment interaction is operationalized as an AxC interaction and an A9E interaction: that is, the variance of C and E are made a function of A. Within this function, c 0 and b 0 are the intercept parameters for log[VAR(C)] and log[VAR(E)] respectively which model the size of VAR(C) and VAR(E) at A = 0. In addition, c 1 and b 1 are the interaction parameters, which model the increase or decrease of VAR(C) and VAR(E) across A. The presence of A9C and/or A9E is established by testing whether c 1 and/or b 1 depart from 0. If so, the parameter estimates can be used to infer the direction of the interaction effect. For instance, a b 1 [ 0 denotes that the unique environmental variance is larger for subjects with a higher genetic predisposition, A. In addition, b 1 \ 0 denotes that the unique environmental variance is smaller for subjects with a higher genetic predisposition, A. The same holds for the A9C parameter, c 1 . This conceptualization of genotype by environment interaction is inspired by Jinks and Fulker (1970), who treated a genotype by environment interaction as an environmental variance that is heteroscedastic across the additive genetic factor. This conceptualization is somewhat different from that of Purcell (2002), who models genotype by environment interactions by making the variance of A a function of a measured moderator (which is not necessarily purely environmental). For a more technical discussion of the biometric model, see Appendix 2.

Gender effects
As we wanted to allow for gender effects in the aggregated data analysis, an extension of the model by Molenaar and Dolan (2014) was necessary. To account for gender differences in VAR(C|A), VAR(E|A), we used the following parameterization where GENDER pj is coded 0 if twin j from twin pair p is a male and 1 if it is a female. In this way, the new parameters c 0,female and b 0,female account for differences in the intercept parameters c 0,overall and b 0,overall in the female group as compared to the male group. Similarly, the new parameters c 1,female and b 1,female account for differences in the A9C and A9E parameters, c 1,overall and b 1,overall , in the female group as compared to the male group. Thus, the AxE parameter b 1 in the male group is equal to b 1,overall , and the A9E parameter b 1 in the female group is equal to b 1,overall ? b 1,female . The same holds for the AxC parameter. In the model above, the intercept parameter, m, captures a possible mean difference in h between males and females.
To account for gender differences in VAR(A) we analogously defined: that is, exp(x 0,overall ) is the variance of A in the male group, and exp(x 0,overall ? x female ) is the variance of A in the female group.

Identification and estimation
To identify the model, traditional scale and location constraints were imposed on h (see Molenaar and Dolan 2014).
We identified the scale of h by fixing a 1 = 1 for the MZ and DZ twin samples. In single group applications, the location of h was fixed by imposing m = 0. As discussed above, in the multi-group model including gender, m was a free parameter in the female group and fixed to 0 in the male group. We used a Bayesian approach to model fitting (Eaves and Erkanli 2003). Specifically, we implemented the model in the open-source OpenBUGS software package (Lunn et al. 2009). To this end, we extended the implementation by Molenaar and Dolan (2014) to include the multi-group components as discussed above. The adapted script is available from the website of the first author. Using this script, one can draw samples from the posterior distribution of the parameters using Markov Chain Monte Carlo (MCMC) sampling. From these samples one can determine the parameter means and HPD regions which can be used for statistical inference. The parameters of interest to be estimated were: a i , s ic . r MZi , r DZ,i b 0,overall , b 1,overall , c 0,overall , c 1,overall , and x 0,overall in the single group analysis. For the multi-group gender analysis we additionally estimated x 0,female , b 0,female , c 0,female , b 1,female , c 1,female , and m. As we used a Bayesian model fitting approach, we specified prior distributions for the parameters. Following Molenaar and Dolan (2014), we used an uniform distribution between -5 and 5 for all parameter except s ic , r MZi , and r DZ,i . For s i1 we used a uniform distribution between -? and s i2 ; and for s i2 we used an uniform distribution between s i2 and ?. These priors were meant to ensure that s i2 is larger than s i1 as discussed above. For r MZi and r DZi we specified a uniform prior between 0 and 1 for ffiffiffiffiffiffiffiffi r MZ i p and ffiffiffiffiffiffiffi ffi r DZ i p to prevent sign switching. Note that the missing data in our dataset provide no problem for parameter estimation: In the MCMC procedure these values are considered parameters and are included in the sampling routine. For more technical details concerning the implementation of the model see Molenaar and Dolan (2014).

Homogeneity with respect to measurement occasion
Estimates for the contributions of heritability, common and unique environment are given in Table 3. Note that these estimates are based on data from the same that were selected for the interaction analyses as described above. Hence the number of participants in Table 3 is smaller than the total number of twins who took part at each measurement occasion. As can be seen, estimates for heritability tend to be higher than the estimates of 0.4 that are commonly found. Also, there are no differences across measurement occasion in the scaled contributions of the A, C, and E factors to the latent phenotypic depression factor, that is, all 99 % HPD regions overlap. We therefore aggregated the data over the measurement occasions, to obtain a sufficiently large sample size for the heteroscedastic ACE model fitting.

Full homogeneity with respect to age
The RMSEA model fit statistic for the MZ and DZ twin samples is depicted in Table 4 for the different models. As can be seen, all RMSEA values were well below the 0.05 criterion of good model fit (Schermelleh-Engel et al. 2003). Although in both the MZ and DZ twin samples the invariance of s i1 and s i2 was associated with a small deterioration of model fit by 0.002 RSMEA points, there was no obvious source of misfit as indicated by the modification indices (the largest modification index was 11.68 for s i1 of item 4 in the first age group). We therefore concluded that measurement invariance is tenable. Table 5 contains the estimated means and variances of h in the different age groups in step 1c. As can be seen, only the mean in the 25-34 age group of the DZ sample was significantly different from zero at a 0.01 significance level. For the variances, only the variance of h in the 20-21 age group of the MZ sample departed significantly from 1. In addition, restricting the means of h in all age groups to equal 0 and all variances of h to be equal to 1 (step 1d) did not deteriorate the model fit, see Table 4. Finally, we tested the latent phenotypic twin correlation, COR(h 1 , h 2 ) to be equal across age groups (step 1e). As can be seen from Table 4, a model with equal latent phenotypic correlations across age groups improved the RMSEA in both the MZ and DZ twin samples. We therefore concluded that there was no overall age effect detectable. That is, either there is no age effect in the data or the age effect is very small. In both cases we can safely conclude that age did not confound the analysis on the aggregated data as reported below.

Measurement invariance with respect to gender
We started with a baseline model (step 0) in which all measurement model parameters a i , s i1 , s i2 , r MZ,i , and r DZ,I were allowed to differ across males and females. Next, we fitted the models from step 1a, 1b, and 1c to the data as discussed above. The results are in Table 6. As can be seen all models fitted well according to the 0.05 criterion. However, in step 1c, the model fit deteriorated notable in both the MZ and DZ twin samples. The modification indices suggested that s i1 of item 3 ('I cry a lot') accounts for this misfit (the modification index equaled 112.20 in the male MZ sample). Indeed, as can be seen from the table in step 1c 0 , freeing this parameter improved the model fit. Results showed that for both the MZ and DZ twins, the threshold parameter s i1 of item 3 was estimated to be much larger for the males as compared to the females indicating that the males tend to use the lower category too often as compared to the females (or similarly, females use the lower category too little as compared to the males). In the final model (step 1c 0 ), the mean difference on h between males and females (i.e., parameter m) was estimated to be 0.47 (se 0.04) in the MZ sample and 0.47 (se 0.03) in the DZ sample. In addition, the variance in the female group was estimated to be 1.02 (se 0.03) in the MZ sample and 1.06 (se 0.03) in the DZ sample indicating that there was no variance difference between males and females (the male variance was fixed to 1).
This final model without s i1 for item 3 fitted acceptable as compared to the other models. In addition, there was no obvious source of misfit as judged by the   modification indices (the largest modification index equaled 12.17 for s i1 of item 4 in the male MZ sample). We concluded that measurement invariance was tenable for all items except item 3. As item 3 was not associated with the same measurement properties for males and females, we omitted this item from the remaining analysis to ensure a meaningful comparison.

Results of the interaction model
We drew 20,000 samples from the posterior parameter distribution of which we discarded the first 10,000 as burnin. From the Gelman and Rubin (1992) diagnostic (based on two chains) and trace plots of the parameters this number of samples appeared to be sufficient to ensure that the chains converged to their stationary distribution. See Fig. 2 for example trace plots of the interaction parameters, b 1,overall and c 1,overall of the full interaction model including gender. We fitted a model without gender differences (i.e., model M1) and a model with gender differences (i.e., model M2, the full gender interaction model). The parameter estimates of the measurement model parameters for model M2 are in Table 7. As can be seen, these correlations were notably smaller in the DZ twin group. This indicates that some item specific genetic and/or shared environmental variance may underlie the scores. By means of the residual correlation, we accounted for this common variance.
In Table 8, the parameter estimates of the interaction parameters in model M1 and M2 are depicted. As can be seen in model M1 without gender differences, b 1 departed from 0 and was positive indicating the presence of A9E the variance of E increasing for increasing levels of A. In addition, the c 1 parameter did not depart from 0, indicating the absence of A9C in the full sample.
As can be seen in Table 8, when gender differences were taken into account (model M2), a different pattern of results emerged. That is, for both males and females, A9E was present with positive b 1 , but for males, there was evidence for A9C as the HPD region of c 1 did not include 0, while for females there was no evidence for A9C. The mean difference between males and females in the latent phenotypic factor, h, was hardly affected by taking the A9E and A9E interactions into account. That is, m in the female sample was estimated to be 0.51 (99 % HPD 0.45; 0.59) which was about equal to the estimate reported above in the case of no interactions. It can also be seen from Table 8 that the results from the females follow the results from the entire sample (i.e., M1), while the results from the males are different from the entire sample. We will return to this point in the discussion.
Results in terms of the contributions of heritability, common and unique environment are given in Table 9. Note that these estimates are based on the marginal variance of C and E, as the conditional variance differs across A. The marginal variance for C and E can be calculated using exp(b 0 ? 0.5 9 b 1 2 ) and exp(c 0 ? 0.5 9 c 1 2 ) respectively (see Hessen and Dolan 2009). As can be seen from the Table by taking into account the gender differences in the interactions (M2) the heritability (h 2 ) drops from 0.54 in the full sample to 0.35 in the male group. In addition, the contribution of the common environment increases from 0.04 in the full sample to 0.22 in the male group. It should be noted however that the uncertainty in this estimate is relatively large, reflected by the wide 99 % HPD region which runs from 0.09 to 0.37. But at least we can conclude that there is some contribution of the common environment to depression for males.
From the results in Table 9 it can be calculated that 33 % of the heritability in males is due to genotype by environment interaction (i.e., 1 -0.35/0.52; see Molenaar et al. in press). It is clear that this percentage is due to A9C to a large extent, however the exact amount of A9C variance in the male group is difficult to assess. That is, theoretically, the distinction between the effects of A9C and A9E interactions is clear: when not included into the model, regular genetic covariance structure analysis cannot distinguish between an additive genetic factor A and an A9C interaction factor, or between an environmental factor E and an A9E interaction factor (Molenaar et al. 1990). In practical applications however, the parameter estimates for A9C and A9E are correlated which complicate quantification of the exact amount of A9C and A9E variance in the data (see Molenaar et al. 2012).

Discussion
We studied whether an additive genetic by unique environment interaction (A9E) and/or an additive genetic by common environment interaction (A9C) play any role in adult depression. In a measurement model for categorical item scores, the depression phenotype was operationalized as a latent variable. In a first set of analyses, omitting interaction effects, we found heritability estimates of around 0.5-0.6, which are somewhat larger than those commonly found using an MDD diagnosis or symptom count sum score (i.e., around 0.3-0.4; Sullivan et al. 2000;Nivard et al. 2015). This discrepancy is both of interest and expected, as latent variables always contain less measurement error as compared to observed measures (see Van den Berg et al. 2007).
The present undertaking was aimed at testing the hypothesis that common environmental variance in depression is masked by interaction effects (Duncan et al. 2014). In a single group analysis of the complete sample, we found that the unique environmental variance is larger for individuals with a higher predisposition to develop depression (i.e., higher A factor score). However, we obtained no evidence for A9C interaction in the single a This parameter has been constrained for identification purposes. In addition, item 3 was omitted from the analysis as it violated measurement invariance across gender Table 8 Parameter estimates (99 % highest posterior density region) of the A9E and A9C parameters in the aggregated data analysis using a model without (M1) Table 9 The proportion of variance in the latent depression phenotype explained by the additive genetic factor (heritability; h 2 ), unique environment (e 2 ) and common environment (c 2 ) in the full genotype-by-environment interaction model group analysis. Next, in a multi-group analysis, we took gender differences into account. We found A9E in both males and females similarly as in the single group analysis, but additionally, there was A9C for the male group. Specifically, in males, common environmental variance is smaller in twins with a higher genetic predisposition to develop depression. The marginal contribution of C increased from 0.04 in the full sample to 0.22 for males by taking gender differences in AxC into account. Although the uncertainty in this estimate is large, it can be concluded that in males, at least some common environment variance is masked by non-linearity due to an A9C interaction. As mentioned above, the results of the female sample follow the results of the entire sample while the results of the males are different from the results in the entire sample. As interaction effects result in non-normality, the present method detects specific departures from bivariate normality in the latent phenotypic factor h that are due to A9C (see Molenaar et al. 2012). In the entire sample (collapsing over gender), the non-normality due to A9C in the male sample is masked due to the females that score higher on h. The distribution of h in the entire sample is thus approximately normal. In the male sample, the distribution of h departs from normality resulting in different estimates of the parameters as compared to the entire sample. As there is no A9C interaction in the female sample, the distribution of h is approximately normal and the results follow the results from the entire sample.
As we argued in this paper, using a measurement model for the item data in testing for genotype by environment interactions may solve the scaling issues commonly encountered in genotype by environment interaction research (Eaves 2006). However, some common limitations of genotype by environment interaction research remain (see Molenaar et al. in press). That is, the presence of a non-linear genotype by environment correlation may conflate the genotype-by-environment interaction. In addition, a genotype-by-environment interaction may spuriously arise if the twin sample is unrepresentative of the population (e.g., the higher phenotypes are underrepresented). Note however that these shortcomings are not unique to the item level approach used here, as they are also problematic in, for instance, the popular genotype by measured environment approach (Purcell 2002).
It would be of interest to model the complete longitudinal dataset including all items at all measurement occasions. That is, the measurement model approach adopted here allows for so-called 'item linking' (Kolen and Brennan 2004). In addition, extending the genotype by environment model with a longitudinal component would allow inclusion of the data from all measurement occasions into the analysis, resulting in the largest power possible to detect an interaction effect. However, such an analysis is currently impossible as the required longitudinal genotype by environment models are not yet developed and because of the large sample size, the mathematically complex model, and the tremendous number of missing data in the complete dataset, a full longitudinal item linking approach is numerically intractable.
The present approach provides what may be considered an omnibus test of A9C and A9E, as the interactions are modeled at the level of the latent variables A, C, and E. We consider this an advantage as we do not need to include measured moderators (candidate genes, environmental variables). We emphasize that a failure to detect A9C or A9E using the present method should be interpreted as a result pertaining to A, C, and E. We do not consider the absence of say A9C in females necessarily incompatible with the presence of an interaction detected with a measured moderator, as the power to detect the effect of an interaction with a measured moderator may be greater than the power to detect A9C. The question of which mechanisms underlie gender differences in depression is important. With the present results we hope to have provided a point of departure for further research into the etiology of differences between males and females in the development of depression. Most importantly, we found some empirical evidence for the claim by Duncan et al. (2014) that effects of the common environment underlying depression are masked by non-linear effects. It is therefore advisable to account for these non-linearity when studying the genetic and environmental underpinning of depression.
creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
DZi in the DZ sample. As the residual polychoric variance of X i (i.e., the polychoric variance of X i conditional on h) is constrained to be equal to 1r MZi for MZ and to 1r DZi for DZ twins (see the equation above), the residual (polychoric) correlation between X i1 and X i2 (i.e., the polychoric correlation between X i1 and X i2 conditional on h 1 and h 2 ) is equal to r MZi and r DZi respectively, see Molenaar and Dolan (2014).
In the biometric model, h is submitted to the ACE decomposition, that is, where COR(A 1 , A 2 ) = 1 for MZ twins and COR(A 1 , A 2 ) = 0.5 for DZ twins. In addition, COR(C 1 , C 2 ) = 1 and COR(E 1 , E 2 ) = 0 for all twins. Under the assumption that the genetic and environmental variance components are mutually uncorrelated, the variance of h can be decomposed as follows As discussed in the main text, genotype by environment interactions are operationalized by conditioning h on the additive genetic factor A. The conditional variance is given by As the variance of C and E are now conditional on A, we can make them a function of A, that is, The variance of C and E now depend on the level of A, that is, the variance due to the environment depends on the genotypic factor. Note that because COR(C 1 , C 2 ) = 1 by definition, imposing VAR C ð Þ ¼ exp c 0 þ c 1 A ð Þimplies that COV C 1 ; C 2 ð Þ ¼ exp c 0 þ c 1 :5A Á . Now we can condition on h and specify the distribution of the observed data: X $ cat P X pij ¼ 0jh pj À Á ; . . .; P X pij ¼ Qjh pj À Á Â Ã where the probabilities in the categorical distribution can be determined using the graded response model above. See Molenaar and Dolan (2014) for a discussion on how these distributions are exactly implemented in OpenBUGS (Lunn et al. 2009).