The classical twin model (CTM) is often approached from a structural equation modeling (SEM) framework (Bentler and Stein 1992; Boomsma and Molenaar 1986; Heath et al. 1989; Neale and Cardon 1992; Rijsdijk and Sham 2002). In this framework, it is a one-level model with family as level one sampling unit. The analysis of twin data can, however, also be approached from a multilevel model (MLM) perspective. MLMs were developed specifically for the analysis of clustered data (Goldstein 2011; Laird and Ware 1982; Longford 1993; Paterson and Goldstein 1991). Classical examples are children (level 1 units), who are clustered in classes (level 2) within schools (level 3; Sellström and Bremberg 2006). Other examples are fMRI measures (level 1) that are clustered in individuals (level 2), who are clustered in scanner type (level 3; Chen et al. 2012), or biomarker data (level 1) that are clustered in measurement batches (level 2; Scharpf et al. 2011). The classical twin design is based on data that also have natural clustering, namely, twins are clustered within pairs. For this reason, the MLM framework can accommodate the CTM (Guo and Wang 2002; McArdle and Prescott 2005; Rabe-Hesketh et al. 2008; Van den Oord 2001). Hunter (2020) provides a detailed account of the CTM in the MLM framework with example code and several extensions. While the MLM specification of the CTM is equivalent to the SEM approach, it also has some interesting, yet underexplored, advantages. In this paper we aim to elaborate on these advantages, and to provide an empirical illustration of a multilevel twin model, where we study the clustering of children’s height in geographical regions in the Netherlands, and consider the role therein of genetic ancestry.
In the SEM approach to the CTM, the covariance structure of twin-pairs is modelled to decompose phenotypic variance into multiple components that represent genetic and non-genetic influences. Given the biometrical underpinning of the twin model (Eaves et al. 1978; Falconer and MacKay 1996; Fisher 1918), the phenotypic variance can be decomposed into additive genetic variance (A), non-additive or dominance genetic variance (D), common environmental variance (C), and unique environmental variance (E) components. Variance decomposition is based on the premise that monozygotic (MZ) twins share 100% of their DNA and dizygotic (DZ) twins share on average 50% of their segregating genes. Hence, additive and non-additive genetic variance is fully shared by MZ twins, whereas additive and non-additive variance components are shared for 50% and 25% by DZ twins. In the CTM, all influences that are not captured by segregating genetic variants are labeled as “environment”. These influences can be categorized as common environment (i.e., shared by twins from the same family) or unique or unshared environment (i.e., creating variation among members from the same family). These are also referred to as between and within family environmental influences. The full ACDE model is not identified when analyzing one phenotype per twin, and only three of the four components can be simultaneously estimated. In this SEM approach to modeling twin data, the variance decomposition is based on the bivariate data observed in twin pairs (i.e., one phenotype for twin 1, and one for twin 2, which are both level 1 units).
In the MLM framework the phenotypic variance can be decomposed into a within-pair (level 1) and a between-pair (or family; level 2) components. This requires reparameterization of the model into level 1 and level 2 variance components. Because the E component captures variance that is not shared by twins, this component is an individual level 1 variance component. The C component is by definition shared by twins, regardless of zygosity, and is a family level 2 variance component. The A component, however, is more complicated, as it is a level 2 component in MZ twin pairs, but both a level 1 and a level 2 component in DZ twin pairs. To account for this, the A-component is divided into two orthogonal components, unique additive (AU) and common additive (AC). Here, AU is a first-level component representing the A variance at the individual level (within pairs or within families), while AC is a second-level component (between pairs or between families), representing the A variance at the twin-pair level. These definitions are consistent with the classical notations in which AC refers to within family genetic variance known as A1 (Boomsma and Molenaar 1986; Martin and Eaves 1977), or the average breeding value variance (Barton et al. 2017), while AU refers to the between family genetic variance known as A2 (Boomsma and Molenaar 1986; Martin and Eaves 1977), or the segregating genetic variance (Barton et al. 2017). In MZs, the AU variance component is 0, since all the variance explained by A is shared by both twins from a pair. For DZ twins, the variance of both AC and AU are constrained to equal 0.5, since on average 50% of the A variance is shared by the individuals and 50% of the A variance is unique for the individual.
An important, yet underexplored, advantage of the MLM approach, is the possibility to include higher-level variables in which lower-levels are nested. By including these higher-level variables, we can identify variance components which are attributable to higher-level clustering. Such clusters may be a consequence of data acquisition or design, e.g., clustering of biomarker data that are measured in batches, or clustering of brain imaging data by fMRI scanner type. They may also occur naturally, for example, families in regions, neighborhoods or schools. If the higher-level variable is not included in the variance decomposition models, the variance that it explains will be captured as part of the C-component, since both twins, regardless of zygosity, share the higher-level variable (i.e., the twin pair is nested in the higher-level variable).
Within the SEM framework, higher-level variables can be included in the model as a fixed effect on the individual level (i.e., covariate) by means of (linear) regression. For nominal covariates (i.e., factors in the ANOVA sense), this approach requires the variable to be dummy coded, which may be impractical, for example when the number of assays for a biomarker or the number of schools that twins are enrolled in is large. In the MLM framework, however, the higher-level variable is treated as a random rather than a fixed effect, and this reduces the number of parameters to one single variance component. That is, given a factor with L categories, the fixed effects approach requires L-1 additional parameters, whereas the random effects approach requires one additional parameter (a variance component). In addition, the MLM approach is more suitable than the SEM framework in dealing with unequal group sizes (Gelman 2005). Finally, an MLM approach allows us to evaluate the contribution of the higher-level component to the C-component, as estimated in the standard twin model. This can be achieved by comparing the C-component estimate of the two-level model (i.e., the standard twin model) to the estimate of the three-level model.
In this paper, we illustrate the use of multilevel twin models by investigating the regional clustering of children’s height with twin data from the Netherlands Twin Register (Boomsma et al. 1992; Ligthart et al. 2019). Height serves as an indicator of the general development of a country, and is known to decrease in times of scarcity and increase in times of prosperity (Baten and Blum 2014; Baten and Komlos 1998). Also, children’s height is an indicator of overall development, where height is associated with cognitive development and school achievement (Karp et al. 1992; Spears 2012). In 7-year-old children, resemblance between family members for height is explained by additive genetic (approximately 60%) and common environmental (approximately 20%) factors (Jelenkovic et al. 2016; Silventoinen et al. 2004, 2007).
In the Netherlands, the association between height and geographical region is well established (Abdellaoui et al. 2013), which makes this a clustering variable of interest. Inhabitants of different geographical region may display genetic and environmental differences. Location is associated with genetic differences (e.g. Abdellaoui et al. 2019) and differences in social and cultural traditions, diet, socio-economic status, and living circumstances (e.g., rural vs urban, e.g. Colodro-Conde et al. 2018). By analyzing height and geographical region data in a three-level MLM, we can determine whether variance in children’s height is associated with geographic region, and estimate the proportion of the common environmental or between-family variance that can be explained by these regional effects.
In a subsample of 7-year-old participants, we investigated the extent to which regional clustering may be due to genetic ancestry by including the first three genetic principal components (PCs; Hotelling 1933). The genetic PCs are obtained through principal component analysis of the covariance matrix of the genotype Single Nucleotide Polymorphism (SNP) data (Reich et al. 2008). In the Netherlands, the first genetic PC is associated with a north–south height gradient (Abdellaoui et al. 2013; Boomsma et al. 2014). This gradient is likely a result of social, geographical and historical divisions between the north and the south. Southern regions were conquered by the Roman empire, adopted Catholicism, and were geographically separated from the northern regions by five large rivers in the Netherlands (Schalekamp 2009). This first Dutch PC also shows a strong correlation with the European PC that differentiates northern from southern European populations (1000 Genomes PC4; Abdellaoui et al. 2013). The second PC is associated with the east–west division of the Netherlands. This PC may reflect differences between rural and urban environments, since the east of the Netherlands is characterized by less populous and rural areas, while the west includes the largest concentration of urban areas in the Netherlands. Alternatively, it could also be a result of geographical separation by the IJssel river or the Veluwe hillridge. The third PC is associated with the more central regions of the country (Abdellaoui et al. 2013). By adding the PCs to our models, we assessed the role of genetic ancestry of individuals between regions.
In this paper, we first considered regional clustering of children’s height in a large data set of MZ and DZ twins (N = 7436). Secondly, we considered the model within a subgroup of children who were genotyped on genome-wide SNP arrays (N = 1375). Subsequently, we determined whether the region effects represent genetic ancestry. And finally, we analyzed the relationship between the three PCs and height in 7-year-old children, and included the genetic PCs that show an association as an individual level (level 1) covariate in the model.