Evidence is mounting that multiple genes are involved in complex traits and that these each account for very small proportions of the overall phenotypic variance. Association studies of many markers in 1000s of individuals will be required to identify such genes. A number of large twin cohorts have already been collected and provide a valuable resource for carrying out studies that are robust to the effect of population stratification. Technologies based on microarrays will soon allow 1,000,000 SNPs to be typed at one time, however financial considerations prevent most researchers from using these approaches to genotype all individuals. Recently, microarrays have been shown to give accurate allele frequency measurements in pooled DNA samples and provide a simple way to select the best markers for individual genotyping. This drastically reduces the cost and workload of large scale association studies. One limitation of this methodology relates to the analytical procedures which have only been developed to allow comparison of two pools e.g. case/control pools. In this paper we use metaregression to analyze pooled DNA data allowing the allele frequency in each pool to be related to the average quantitative phenotypic measure of the individuals whose DNA were used to construct the pools. Alongside this we describe a technique that can be used to determine the power for such studies. We present results from some preliminary investigations of different pooling strategies that can be applied to large twin samples and demonstrate that the method retains a large proportion of the power available from individual genotyping.
Similar content being viewed by others
References
Antoniades L., et al. (2003). Association of birth weight with osteoporosis and osteoarthritis in adult twins. Rheumatology (Oxford) 42(6): 791–796
Bader J. S., et al. (2001). Efficient SNPbased tests of association for quantitative phenotypes using pooled DNA. Genescreen 1:143–150
Bader J. S., Sham P. (2002). Familybased association tests for quantitative traits using pooled DNA. Eur. J. Hum. Genet. 10(12): 870–878
Barratt B. J., et al. (2002). Identification of the sources of error in allele frequency estimations from pooled DNA indicates an optimal experimental design. Ann. Hum. Genet. 66(Pt 5–6): 393–405
Butcher L. M., et al. (2004). Genotyping pooled DNA on microarrays: a systematic genome screen of thousands of SNPs in large samples to detect QTLs for complex traits. Behav. Genet. 34(5): 549–555
Butcher L. M. et al. (2005). SNPs, microarrays, and pooled DNA: identification of four loci associated with mild mental impairment in a sample of 6000 children. Hum. Mol. Genet.14: 1315–1325
Downes K., et al. (2004). SNP allele frequency estimation in DNA pools and variance components analysis. Biotechniques 36(5): 840–845
Fulker D. W., et al. (1999). Combined linkage and association sibpair analysis for quantitative traits. Am. J. Hum. Genet. 64(1): 259–267
Hirschhron J. N., Daly M. J. (2005). Genomewide association studies for common genes and complex traits. Nat. Rev. Genet. 6: 95–108
Jawaid A., et al. (2002). Optimal selection strategies for QTL mapping using pooled DNA samples. Eur. J. Hum. Genet. 10(2):125–132
Johnson, A. C., and Thomopoulos, N. T. (2002). Characteristics and tables of the doublytruncated normal distribution, Proceedings of POM High Tech. 18 pp
Le Hellard S., et al. (2002). SNP genotyping on pooled DNAs: comparison of genotyping technologies and a semi automated method for data storage and analysis. Nucleic Acids Res. 30(15): e74
Mardia K. V. et al. (1979). Multivariate analysis. London, Academic Press
Norton N., et al. (2004). DNA pooling as a tool for largescale association studies in complex traits. Ann. Med. 36(2): 146–152
Posthuma D., et al. (2003). Theory and practice in quantitative genetics. Twin Res. 6(5): 361–376
Purcell S., et al. (2003). Genetic Power Calculator: design of linkage and association genetic mapping studies of complex traits. Bioinformatics 19(1): 149–150
Risch N., Merikangas K. (1996). The future of genetic studies of complex human diseases. Science 273(5281): 1516–1517
Risch N., Teng J. (1998). The relative power of familybased and case–control designs for linkage disequilibrium studies of complex human diseases I. DNA pooling. Genome Res. 8(12): 1273–1288
Salonen, J. T. (2004). Unraveling the genetics of complex disease using Affymetrix DNA analysis products. American Society of Human Genetics (lunchtimemeeting)
Setakis E. (2003). Statistical analysis of the GAMES studies. J. Neuroimmunol. 143(1–2): 47–52
Sham P., et al. (2002). DNA Pooling: a tool for largescale association studies. Nat. Rev. Genet. 3(11): 862–871
Thompson S. G., Sharp S. J. (1999). Explaining heterogeneity in metaanalysis: a comparison of methods. Stat. Med. 18(20): 2693–2708
Zou G., Zhao H. (2004). The impacts of errors in individual genotyping and DNA pooling on association studies. Genet. Epidemiol. 26(1): 1–10
Acknowledgments
We are grateful to Robert Plomin (also SGDP, Institute of Psychiatry) for his helpful comments on the manuscript and support of the work. Jo Knight is funded by an MRC advanced training fellowship.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: Derivation of \(\alpha\) and \(\beta\) and E(M)
The component \({Q}=\alpha +\beta M\), where \(\alpha\) and \(\beta\) are constants such that Q has mean 0 and variance 1. These constants can be derived from the mean ( \(\mu\)) and variance( \(\sigma^{2}\)) of M as \(\alpha = \mu /\surd \sigma\) and \(\beta = 1/\surd\sigma\). They can then be used to calculate E(M). (E(M)= \((E(Q)\alpha)\)/β).
Expected mean of \({M}={p}^{2}+1/2\times2{pq}\) (Frequency of the homozygous+1/2 heterozygotes) \(={p}^{2} + {p(1p)}={p(p+1p)}={p}\)
Expected variance of M is the standard variance of a proportion \(={p(1p)/2n}\)
where n=1 as we consider only 1 person and the 2 is present as there are two alleles.
Therefore \(\alpha=p/\surd[p(1p)/2]\) and \(\beta=1/\surd [p(1p)/2]\)
E(M) can be calculated from E(Q) as follows:
Appendix B: Calculation of E(Q) for MZ twins, DZ twins and DZ sums and differences
E(QX) is approximately qX, as we have the phenotype of both twins we calculate E(Q) using multivariate laws (Mardia et al., 1979). This incorporates the correlation between the phenotype of the twins. As Q would be equal for MZ twins we only need to calculate one value.
When calculating both the Qs for a pair of DZ twins we use the phenotype of both twins but put most weight on the twin whose phenotype is being calculated.
Once Q _{1} and Q _{2} have been calculated the average of the phenotypes can be calculated.
The phenotype differences can also be calculated
Appendix C: Derivation of equations in Table I
To derive the standard deviations of the selection variables we use the law relating to variances of a sum, i.e. the variance of a sum is the variance of each component plus twice the covariance. The phenotypes are standardized so the variances are equal to one and the covariance is equal to the correlation which is known. For MZ twins (where the selection variable is the sum of the two phenotypes) the standard deviation can therefore be shown to be the square root of \(2(1+{r}_{M})\), when both DZ twins were selected on the basis of the summed variable the standard deviation can be shown to be the square root of \(2(1+{r}_{D})\). When only one DZ twin is used to investigate the between effect we need to calculate the variance of the weighted phenotype. As the weighted phenotype is not standardized and the covariance is not known the variance requires calculation.
The method of calculation of the variance of the sum can also be applied to calculate the standard deviation of the differences between the allele frequencies of the DZ twins \(2(1{r}_{D})\).
These standard deviations are used to standardize the selection variables. The expected allele frequencies are then calculated by inserting the standardized selection variable in to the appropriate equation for calculation of expected Q (Eqs. (2)–(5)). As the selection variables have been standardized the rest of each (Eqs. (2)–(5)) has to be multiplied by the standard deviation so that the expected allele frequency remains unchanged. The final equations used to calculate the expected allele frequencies are given in the last column of Table I.
Appendix D: Simulated allele frequencies for DZ twins

1.
Simulations were performed using the same population parameters that were used to derive allele frequencies theoretically (p=0.5, r _{ D }=0.25, r _{ M }=0.5, q ^{2}=0.01)

2.
For twin 1 two uniform (0,1) random numbers were generated (variables C _{1} and C _{2}). These were converted to either to 1 (if the random number was less than p, the allele frequency) or 0 (if the random number was greater than or equal p). These numbers represent the identity of the two alleles in Twin 1.

3.
A variable indicating the total allele frequency within the individual was generated by adding C _{1} and C _{2} together and dividing by 2 (Q _{1}).

4.
The first stage in the generation of the allele frequency of the cotwin (Q _{2}) was the generation of random numbers for variables C _{3} and C _{4} in the same fashion as for variables C _{1} and C _{2}.

5.
Another uniform (0,1) random variable used to indicate whether the individual shared 0, 1 or 2 alleles according to whether its value was \(\le0.25,>0.25 \&\le0.75,>0.75 ({U}_{\rm IBD})\).

6.
Where variable U _{IBD} was equal to or greater than ≤0.25 the genotype of the cotwin was set to equal the genotype of the first twin (Q _{1}=Q _{2}). Where U _{IBD} was greater than 0.25 but less than 0.75 the cotwin’s allele frequency was determined by the variables C _{1} and C _{3} and Q _{2} was calculated as the average of these two variables. Where U _{IBD} was greater than 0.75 the cotwin’s alleles were considered to be represented by C _{3} and C _{4} and Q _{2} was calculated accordingly.

7.
Both Q _{1} and Q _{2} were standardized to have a mean of 0 and a variance of 1 by subtracting p and then dividing by \(\surd ((p*(1p))/2)\) .

8.
The inverse normal transformations of two independent uniform (0,1) random numbers were generated. These standard normal variates were used to represent the environmental influences on the trait of the two twins, designated as E _{1} and E _{2}.

9.
Similarly, the inverse normal transformations of two other independent uniform (0,1) random numbers were generated. The first normal variate was used to represent the residual polygenic background for twin 1 (G _{1}), and a linear combination of the two normal variates was used to represent the polygenic background of twin 2 (G _{2}), where the weights were 0.5 and \(\surd0.75\), so that G _{1} and G _{2} have variances 1 and covariance 0.5.

10.
The trait scores of each individual was generated using the biometric model \({X}=q{Q} + g{G}+e{E}\).

11.
After generating the required number of twin pairs by repeating steps 1–9, samples of twins are taken according to the desired sampling schemes, and the allele frequencies in these samples calculated.
A similar process was used to calculate allele frequencies for MZ twins with adjustments made to account for the fact that they share all their DNA.
Appendix E: Standardized selection variables when only one twin was used in the construction of the pools
When only one DZ twin is used to construct the pools, twins with the most extreme weighted sum of the phenotype are selected. This is a correlated bivariate distribution for which we need to calculate both the standardized scores (step 2) that represent given proportions of the distribution and the expected score of the most extreme twin from any given section (steps 3–9). For both of these calculations we require the correlation.

1.
The correlation can be determined if the DZ correlation of the DZ phenotypes is known. In our simulations a DZ correlation of 0.5 was used resulting in a correlation between the weighted phenotypes of 0.6875.

2.
STATA was used to generate a look up table in which the standardized score which represents any proportion of the bivariate distribution with a correlation of 0.6875 could be found. The calculation involves subtracting the joint cumulative probability of the bivariate distribution from twice the probability of the normal distribution for any given score (Eq. (10)).
$${P}(\hbox{Max}{(X_{1},X_{2})}>{t})={P}({X_{1}}>{t} \hbox{ or } {X_{2}}>{t}) = {P}({X_{1}}>{t})+{P}({X_{2}}>{t}) {P}({X_{1}}>{t} \hbox{ and } {X_{2}}>{t})$$(E.1)
Expected Score for Extreme Twin
There is no straightforward way to calculate the midinterval score of the selected twin (X _{1}) so instead we calculated the expected weighted sum of the phenotype of the most extreme twin given values for the other twin (X _{2}) that ranged between 5 standard deviations above and below the mean at interval of 0.1 standard deviations. The midinterval expected score across sections of the distribution used for the construction of pools was then determined by a sum weighted by the probability of X _{1} multiplied by the probability of X _{2}. Equation E.2 shows the expected score of the most extreme twin assuming it is between the pool selection thresholds and above the score of the other twin (Eq. E.2).
By symmetry \({E}({X_{1}}\vert {X_{1}}>{X_{2}};{t_{1}}<{X_{1}}<{t_{2}}) ={E}({X_{2}}\vert{X_{2}}>{X_{1}},{t_{1}}<{X_{2}}<{t_{2}})\), and \({P}({X_{1}}>{X_{2}})={P}({X_{2}}>{X_{1}})=0.5\). Therefore the expectation is simply \({E}({X_{1}}\vert {X_{1}}>{X_{2}};{t_{1}}<{X_{1}}<{t_{2}})\).

3.
To calculate the expected score we first made a list of scores for X _{2} at interval of 0.1 between −5 and 5 standard deviations from the mean.

4.
The difference between the probability and density for scores from 0.05 above and below each value of X _{2} was determined and these two differences were used to determine the midinterval score for X _{2} using Eq. (7).

5.
X _{1} is not a standard normal distribution but for each value of X _{2} the expected mean of X _{1} is rX _{2} (where r is the correlation between X _{1} and X _{2}) and the expected standard deviation is (sqrt(1−r ^{2})).

6.
These values can be used to work out the standard normal variates for the thresholds given the distribution of X _{1}. The lower threshold is (t _{1}−rX _{2})/s when X _{2}<t _{1} and (X _{2}−rX _{2})/s if X _{2}>t _{1}(when X _{1}>t _{1} X _{1}>X _{2} is always true). The upper threshold is (t _{2}−rX _{2})/s.

7.
The expected value of X _{1} is then the midinterval value between these two scores.

8.
The conditional value of X _{1} is then calculated by retransforming the value for X _{1} back to its distributional conditional on X _{2}=X _{1}*sd mean.

9.
The average score for the larger sections of the distribution from which individuals were selected for pool construction is then derived as a weighted sum of the standard normal of X _{1} weighted by the probability of X _{2} and the conditional score of X _{1}.
Rights and permissions
About this article
Cite this article
Knight, J., Sham, P. Design and Analysis of Association Studies using Pooled DNA from Large Twin Samples. Behav Genet 36, 665–677 (2006). https://doi.org/10.1007/s1051900590169
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s1051900590169