Skip to main content

Design and Analysis of Association Studies using Pooled DNA from Large Twin Samples

Evidence is mounting that multiple genes are involved in complex traits and that these each account for very small proportions of the overall phenotypic variance. Association studies of many markers in 1000s of individuals will be required to identify such genes. A number of large twin cohorts have already been collected and provide a valuable resource for carrying out studies that are robust to the effect of population stratification. Technologies based on microarrays will soon allow 1,000,000 SNPs to be typed at one time, however financial considerations prevent most researchers from using these approaches to genotype all individuals. Recently, microarrays have been shown to give accurate allele frequency measurements in pooled DNA samples and provide a simple way to select the best markers for individual genotyping. This drastically reduces the cost and workload of large scale association studies. One limitation of this methodology relates to the analytical procedures which have only been developed to allow comparison of two pools e.g. case/control pools. In this paper we use meta-regression to analyze pooled DNA data allowing the allele frequency in each pool to be related to the average quantitative phenotypic measure of the individuals whose DNA were used to construct the pools. Alongside this we describe a technique that can be used to determine the power for such studies. We present results from some preliminary investigations of different pooling strategies that can be applied to large twin samples and demonstrate that the method retains a large proportion of the power available from individual genotyping.

This is a preview of subscription content, access via your institution.

Fig. 1.


  1. Antoniades L., et al. (2003). Association of birth weight with osteoporosis and osteoarthritis in adult twins. Rheumatology (Oxford) 42(6): 791–796

    Article  CAS  Google Scholar 

  2. Bader J. S., et al. (2001). Efficient SNP-based tests of association for quantitative phenotypes using pooled DNA. Genescreen 1:143–150

    Article  Google Scholar 

  3. Bader J. S., Sham P. (2002). Family-based association tests for quantitative traits using pooled DNA. Eur. J. Hum. Genet. 10(12): 870–878

    Article  PubMed  CAS  Google Scholar 

  4. Barratt B. J., et al. (2002). Identification of the sources of error in allele frequency estimations from pooled DNA indicates an optimal experimental design. Ann. Hum. Genet. 66(Pt 5–6): 393–405

    Article  PubMed  CAS  Google Scholar 

  5. Butcher L. M., et al. (2004). Genotyping pooled DNA on microarrays: a systematic genome screen of thousands of SNPs in large samples to detect QTLs for complex traits. Behav. Genet. 34(5): 549–555

    Article  PubMed  Google Scholar 

  6. Butcher L. M. et al. (2005). SNPs, microarrays, and pooled DNA: identification of four loci associated with mild mental impairment in a sample of 6000 children. Hum. Mol. Genet.14: 1315–1325

    Article  PubMed  CAS  Google Scholar 

  7. Downes K., et al. (2004). SNP allele frequency estimation in DNA pools and variance components analysis. Biotechniques 36(5): 840–845

    PubMed  CAS  Google Scholar 

  8. Fulker D. W., et al. (1999). Combined linkage and association sib-pair analysis for quantitative traits. Am. J. Hum. Genet. 64(1): 259–267

    Article  PubMed  CAS  Google Scholar 

  9. Hirschhron J. N., Daly M. J. (2005). Genome-wide association studies for common genes and complex traits. Nat. Rev. Genet. 6: 95–108

    Article  PubMed  CAS  Google Scholar 

  10. Jawaid A., et al. (2002). Optimal selection strategies for QTL mapping using pooled DNA samples. Eur. J. Hum. Genet. 10(2):125–132

    Article  PubMed  CAS  Google Scholar 

  11. Johnson, A. C., and Thomopoulos, N. T. (2002). Characteristics and tables of the doubly-truncated normal distribution, Proceedings of POM High Tech. 18 pp

  12. Le Hellard S., et al. (2002). SNP genotyping on pooled DNAs: comparison of genotyping technologies and a semi automated method for data storage and analysis. Nucleic Acids Res. 30(15): e74

    Article  PubMed  Google Scholar 

  13. Mardia K. V. et al. (1979). Multivariate analysis. London, Academic Press

    Google Scholar 

  14. Norton N., et al. (2004). DNA pooling as a tool for large-scale association studies in complex traits. Ann. Med. 36(2): 146–152

    Article  PubMed  CAS  Google Scholar 

  15. Posthuma D., et al. (2003). Theory and practice in quantitative genetics. Twin Res. 6(5): 361–376

    Article  PubMed  Google Scholar 

  16. Purcell S., et al. (2003). Genetic Power Calculator: design of linkage and association genetic mapping studies of complex traits. Bioinformatics 19(1): 149–150

    Article  PubMed  CAS  Google Scholar 

  17. Risch N., Merikangas K. (1996). The future of genetic studies of complex human diseases. Science 273(5281): 1516–1517

    PubMed  Article  CAS  Google Scholar 

  18. Risch N., Teng J. (1998). The relative power of family-based and case–control designs for linkage disequilibrium studies of complex human diseases I. DNA pooling. Genome Res. 8(12): 1273–1288

    PubMed  CAS  Google Scholar 

  19. Salonen, J. T. (2004). Unraveling the genetics of complex disease using Affymetrix DNA analysis products. American Society of Human Genetics (lunchtime-meeting)

  20. Setakis E. (2003). Statistical analysis of the GAMES studies. J. Neuroimmunol. 143(1–2): 47–52

    Article  PubMed  CAS  Google Scholar 

  21. Sham P., et al. (2002). DNA Pooling: a tool for large-scale association studies. Nat. Rev. Genet. 3(11): 862–871

    Article  PubMed  CAS  Google Scholar 

  22. Thompson S. G., Sharp S. J. (1999). Explaining heterogeneity in meta-analysis: a comparison of methods. Stat. Med. 18(20): 2693–2708

    Article  PubMed  CAS  Google Scholar 

  23. Zou G., Zhao H. (2004). The impacts of errors in individual genotyping and DNA pooling on association studies. Genet. Epidemiol. 26(1): 1–10

    Article  PubMed  Google Scholar 

Download references


We are grateful to Robert Plomin (also SGDP, Institute of Psychiatry) for his helpful comments on the manuscript and support of the work. Jo Knight is funded by an MRC advanced training fellowship.

Author information



Corresponding author

Correspondence to Jo Knight.


Appendix A: Derivation of \(\alpha\) and \(\beta\) and E(M)

The component \({Q}=\alpha +\beta M\), where \(\alpha\) and \(\beta\) are constants such that Q has mean 0 and variance 1. These constants can be derived from the mean ( \(\mu\)) and variance( \(\sigma^{2}\)) of M as \(\alpha = -\mu /\surd \sigma\) and \(\beta = 1/\surd\sigma\). They can then be used to calculate E(M). (E(M)= \((E(Q)-\alpha)\)/β).

Expected mean of \({M}={p}^{2}+1/2\times2{pq}\) (Frequency of the homozygous+1/2 heterozygotes) \(={p}^{2} + {p(1-p)}={p(p+1-p)}={p}\)

Expected variance of M is the standard variance of a proportion \(={p(1-p)/2n}\)

where n=1 as we consider only 1 person and the 2 is present as there are two alleles.

Therefore \(\alpha=-p/\surd[p(1-p)/2]\) and \(\beta=1/\surd [p(1-p)/2]\)

E(M) can be calculated from E(Q) as follows:

$$E(M) = p + \sqrt {{{p(1 - p)} \over {2}}} E(Q)$$

Appendix B: Calculation of E(Q) for MZ twins, DZ twins and DZ sums and differences

E(Q|X) is approximately qX, as we have the phenotype of both twins we calculate E(Q) using multivariate laws (Mardia et al., 1979). This incorporates the correlation between the phenotype of the twins. As Q would be equal for MZ twins we only need to calculate one value.

$$\eqalign{E(Q\vert X_{1} ,X_{2} ) = \left[ \begin{array}{ll}q&q \end{array} \right] \left[ \begin{array}{ll}1 &r_{M}\\ r_{M}& 1 \end{array} \right]^{ - 1} \left[ \begin{array}{l} X_{1}\\ X_{2}\end{array} \right]\cr = {{1} \over {1 - r^{2}_{M}}}\left[ \begin{array}{ll} q&q\end{array}\right] \left[ \begin{array}{ll} 1&- r_{M}\\ - r_{M}& 1 \end{array} \right]\left[ \begin{array}{l} X_{1}\\ X_{2}\end{array} \right]\cr ={{1} \over {1 - r^{2}_{M}}}\left[ q - qr_{M} ,-qr_{M} + q \right] \left[ \begin{array}{l} X_{1}\\ X_{2}\end{array} \right]\cr ={{q} \over {1 - r^{2}_{M}} }\left[ 1 - r_{M},1 - r_{M} \right] \left[ \begin{array}{l} X_{1}\\ X_{2}\end{array} \right]\cr ={{q(1 - r_{M})} \over {(1 + r_{M})(1 - r_{M})}}\left[ \begin{array}{l} X_{1}\\ X_{2} \end{array} \right]\cr={{q(X_{1} + X_{2})} \over {(1 + r_{M})}}}$$

When calculating both the Qs for a pair of DZ twins we use the phenotype of both twins but put most weight on the twin whose phenotype is being calculated.

$$\eqalign{E(Q_{1},Q_{2} \vert X_{1} ,X_{2}) = \left[ \begin{array}{ll} q&q/2\\ q/2& q \end{array} \right]\left[ \begin{array}{ll} 1& r_{D}\\ r_{D}& 1 \end{array} \right]^{ - 1} \left[ \begin{array}{l} X_{1}\\ X_{2} \end{array} \right]\cr ={{1} \over {1 - r^{2}_{D}}}\left[ \begin{array}{ll} q& q/2\\ q/2& q \end{array} \right]\left[ \begin{array}{ll} 1 &- r_{D}\\- r_{D}& 1 \end{array} \right]\left[ \begin{array}{l} X_{1}\cr X_{2} \end{array} \right]\cr ={{1} \over {1 - r^{2}_{D}} }\left[ \begin{array}{ll} q -{{rq} \over {2}}& {{q} \over {2}} - rq\\ {{q} \over {2}} - rq& q - {{rq} \over {2}} \end{array} \right]\left[ \begin{array}{l} X_{1}\\ X_{2} \end{array} \right]\cr ={{q} \over {2(1 - r^{2}_{D})}}\left[ \begin{array}{ll} 2 - r& 1 - 2r\\ 1 - 2r& 2 - r \end{array} \right]\left[ \begin{array}{l} X_{1}\\ X_{2}\end{array} \right]\cr ={{q} \over {2(1 - r_{D} ^{2})}}\left[ \begin{array}{l} (2 - r_{D})X_{1} + (1 - 2r_{D} )X_{2}\\(1 - 2r_{D} )X_{1} + (2 - r_{D} )X_{2} \end{array} \right]}$$

Once Q 1 and Q 2 have been calculated the average of the phenotypes can be calculated.

$$\eqalign{E\left({{Q_{1} + Q_{2}} \over {2}}\vert X_{1} ,X_{2} \right) = {{q} \over {4(1 - r_{D} ^{2})}} \times ((2 - r_{D})X_{1} + (1 - 2r_{D})X_{2} ) + ((1 - 2r_{D})X_{1} + (2 - r_{D} )X_{2})\cr ={{q} \over {4(1 - r_{D} ^{2})}} \times ((2 - r_{D} ) + (1 - 2r_{D} ))X_{1} + ((1 - 2r_{D} ) + (2 - r_{D} ))X_{2}\cr ={{q} \over {4(1 - r_{D} ^{2})}} \times 3((1 - r_{D} )X_{1} + (1 - r_{D} )X_{2})\cr ={{3q((1 - r_{D})X_{1} + (1 - r_{D} )X_{2} )} \over {4(1 - r_{D} )(1 + r_{D})}} = {{3q(X_{1} + X_{2} )} \over {4(1 + r_{D} )}}}$$

The phenotype differences can also be calculated

$$\eqalign{E(Q_{1} - Q_{2} \vert X_{1} ,X_{2}) = {{q} \over {2(1 - r_{D} ^{2})}} \times ((2 - r_{D} )X_{1} + (1 - 2r_{D} )X_{2} ) - ((1 - 2r_{D} )X_{1} + (2 - r_{D} )X_{2})\cr ={{q} \over {2(1 - r_{D} ^{2} )}} \times ((2 - r_{D} ) - (1 - 2r_{D} ))X_{1} + ((1 - 2r_{D} ) - (2 - r_{D} ))X_{2})\cr ={{q} \over {2(1 - r_{D} ^{2})}} \times (1 + r_{D} )X_{1} + (1 + r_{D} )X_{2})\cr ={{q(1 + r_{D})X_{1} + (1 + r_{D})X_{2})} \over {2(1 - r_{D} )(1 + r_{D})}} = {{q(X_{1} + X_{2} )} \over {2(1 - r_{D})}}}$$

Appendix C: Derivation of equations in Table I

To derive the standard deviations of the selection variables we use the law relating to variances of a sum, i.e. the variance of a sum is the variance of each component plus twice the covariance. The phenotypes are standardized so the variances are equal to one and the covariance is equal to the correlation which is known. For MZ twins (where the selection variable is the sum of the two phenotypes) the standard deviation can therefore be shown to be the square root of \(2(1+{r}_{M})\), when both DZ twins were selected on the basis of the summed variable the standard deviation can be shown to be the square root of \(2(1+{r}_{D})\). When only one DZ twin is used to investigate the between effect we need to calculate the variance of the weighted phenotype. As the weighted phenotype is not standardized and the covariance is not known the variance requires calculation.

$$\hbox{Variance of} \begin{array}{l} [(2 - r_{D})X_{1} + (1 - 2r_{D} )X_{2} ),((1 - 2r_{D} )X_{1} + (2 - r_{D} )X_{2})]\\ = (2 - r_{D} )^{2} X_{1} ^{2} + 2X_{1} X_{2} (1 - 2r_{D} ) \times (2 - r_{D} ) + (1 - 2r_{D} )X^{2}_{2} \quad \hbox{where} \quad X_{1} ^{2} = X_{2} ^{2} = 1\quad \hbox{and}\quad X_{1} X_{2} = R_{D}\\ = (2 - r_{D} )^{2} + 2r_{D} (1 - 2r_{D} ) \times (2 - r_{D} ) + (1 - 2r_{D} ) = (1 - r^{2}_{D} )(5 - 4r_{D}) \end{array} $$

The method of calculation of the variance of the sum can also be applied to calculate the standard deviation of the differences between the allele frequencies of the DZ twins \(2(1-{r}_{D})\).

These standard deviations are used to standardize the selection variables. The expected allele frequencies are then calculated by inserting the standardized selection variable in to the appropriate equation for calculation of expected Q (Eqs. (2)(5)). As the selection variables have been standardized the rest of each (Eqs. (2)(5)) has to be multiplied by the standard deviation so that the expected allele frequency remains unchanged. The final equations used to calculate the expected allele frequencies are given in the last column of Table I.

Appendix D: Simulated allele frequencies for DZ twins

  1. 1.

    Simulations were performed using the same population parameters that were used to derive allele frequencies theoretically (p=0.5, r D =0.25, r M =0.5, q 2=0.01)

  2. 2.

    For twin 1 two uniform (0,1) random numbers were generated (variables C 1 and C 2). These were converted to either to 1 (if the random number was less than p, the allele frequency) or 0 (if the random number was greater than or equal p). These numbers represent the identity of the two alleles in Twin 1.

  3. 3.

    A variable indicating the total allele frequency within the individual was generated by adding C 1 and C 2 together and dividing by 2 (Q 1).

  4. 4.

    The first stage in the generation of the allele frequency of the co-twin (Q 2) was the generation of random numbers for variables C 3 and C 4 in the same fashion as for variables C 1 and C 2.

  5. 5.

    Another uniform (0,1) random variable used to indicate whether the individual shared 0, 1 or 2 alleles according to whether its value was \(\le0.25,>0.25 \&\le0.75,>0.75 ({U}_{\rm IBD})\).

  6. 6.

    Where variable U IBD was equal to or greater than ≤0.25 the genotype of the co-twin was set to equal the genotype of the first twin (Q 1=Q 2). Where U IBD was greater than 0.25 but less than 0.75 the co-twin’s allele frequency was determined by the variables C 1 and C 3 and Q 2 was calculated as the average of these two variables. Where U IBD was greater than 0.75 the co-twin’s alleles were considered to be represented by C 3 and C 4 and Q 2 was calculated accordingly.

  7. 7.

    Both Q 1 and Q 2 were standardized to have a mean of 0 and a variance of 1 by subtracting p and then dividing by \(\surd ((p*(1-p))/2)\) .

  8. 8.

    The inverse normal transformations of two independent uniform (0,1) random numbers were generated. These standard normal variates were used to represent the environmental influences on the trait of the two twins, designated as E 1 and E 2.

  9. 9.

    Similarly, the inverse normal transformations of two other independent uniform (0,1) random numbers were generated. The first normal variate was used to represent the residual polygenic background for twin 1 (G 1), and a linear combination of the two normal variates was used to represent the polygenic background of twin 2 (G 2), where the weights were 0.5 and \(\surd0.75\), so that G 1 and G 2 have variances 1 and covariance 0.5.

  10. 10.

    The trait scores of each individual was generated using the biometric model \({X}=q{Q} + g{G}+e{E}\).

  11. 11.

    After generating the required number of twin pairs by repeating steps 1–9, samples of twins are taken according to the desired sampling schemes, and the allele frequencies in these samples calculated.

A similar process was used to calculate allele frequencies for MZ twins with adjustments made to account for the fact that they share all their DNA.

Appendix E: Standardized selection variables when only one twin was used in the construction of the pools

When only one DZ twin is used to construct the pools, twins with the most extreme weighted sum of the phenotype are selected. This is a correlated bivariate distribution for which we need to calculate both the standardized scores (step 2) that represent given proportions of the distribution and the expected score of the most extreme twin from any given section (steps 3–9). For both of these calculations we require the correlation.

  1. 1.

    The correlation can be determined if the DZ correlation of the DZ phenotypes is known. In our simulations a DZ correlation of 0.5 was used resulting in a correlation between the weighted phenotypes of 0.6875.

  2. 2.

    STATA was used to generate a look up table in which the standardized score which represents any proportion of the bivariate distribution with a correlation of 0.6875 could be found. The calculation involves subtracting the joint cumulative probability of the bivariate distribution from twice the probability of the normal distribution for any given score (Eq. (10)).

    $${P}(\hbox{Max}{(X_{1},X_{2})}>{t})={P}({X_{1}}>{t} \hbox{ or } {X_{2}}>{t}) = {P}({X_{1}}>{t})+{P}({X_{2}}>{t}) -{P}({X_{1}}>{t} \hbox{ and } {X_{2}}>{t})$$

Expected Score for Extreme Twin

There is no straightforward way to calculate the mid-interval score of the selected twin (X 1) so instead we calculated the expected weighted sum of the phenotype of the most extreme twin given values for the other twin (X 2) that ranged between 5 standard deviations above and below the mean at interval of 0.1 standard deviations. The mid-interval expected score across sections of the distribution used for the construction of pools was then determined by a sum weighted by the probability of X 1 multiplied by the probability of X 2. Equation E.2 shows the expected score of the most extreme twin assuming it is between the pool selection thresholds and above the score of the other twin (Eq. E.2).

$${E}({X_{1}}\vert {X_{1}}>{X_{2}};{t_{1}}<{X_{1}}<{t_{2}}){P}({X_{1}}>{X_{2}}) + {E}({X_{2}}\vert {X_{2}}>{X_{1}},{t_{1}}<{X_{2}}<{t_{2}}){P}({X_{2}}>{X_{1}})$$

By symmetry \({E}({X_{1}}\vert {X_{1}}>{X_{2}};{t_{1}}<{X_{1}}<{t_{2}}) ={E}({X_{2}}\vert{X_{2}}>{X_{1}},{t_{1}}<{X_{2}}<{t_{2}})\), and \({P}({X_{1}}>{X_{2}})={P}({X_{2}}>{X_{1}})=0.5\). Therefore the expectation is simply \({E}({X_{1}}\vert {X_{1}}>{X_{2}};{t_{1}}<{X_{1}}<{t_{2}})\).

  1. 3.

    To calculate the expected score we first made a list of scores for X 2 at interval of 0.1 between −5 and 5 standard deviations from the mean.

  2. 4.

    The difference between the probability and density for scores from 0.05 above and below each value of X 2 was determined and these two differences were used to determine the mid-interval score for X 2 using Eq. (7).

  3. 5.

    X 1 is not a standard normal distribution but for each value of X 2 the expected mean of X 1 is rX 2 (where r is the correlation between X 1 and X 2) and the expected standard deviation is (sqrt(1−r 2)).

  4. 6.

    These values can be used to work out the standard normal variates for the thresholds given the distribution of X 1. The lower threshold is (t 1rX 2)/s when X 2<t 1 and (X 2rX 2)/s if X 2>t 1(when X 1>t 1 X 1>X 2 is always true). The upper threshold is (t 2rX 2)/s.

  5. 7.

    The expected value of X 1 is then the mid-interval value between these two scores.

  6. 8.

    The conditional value of X 1 is then calculated by retransforming the value for X 1 back to its distributional conditional on X 2=X 1*sd mean.

  7. 9.

    The average score for the larger sections of the distribution from which individuals were selected for pool construction is then derived as a weighted sum of the standard normal of X 1 weighted by the probability of X 2 and the conditional score of X 1.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Knight, J., Sham, P. Design and Analysis of Association Studies using Pooled DNA from Large Twin Samples. Behav Genet 36, 665–677 (2006).

Download citation


  • Association
  • micro-arrays
  • pooling
  • power
  • statistical methodology
  • twins