Background

With the advent of genome wide association studies large bio-banks will become the stepping stones to tomorrow's medicine. The genome wide association study (GWAS) has recently proved its value with replicable findings for diseases such as coronary artery disease, prostate cancer, and diabetes [1], and many more discoveries are expected in the near future. To determine these associations, studies have relied on large populations of well characterized cases and controls often combining several study populations in order to approach statistical significance [1]. Once these associations have been found in well characterized target cohorts, the findings must be replicated in a population based cohort before translation into clinical practice [2]. These cohorts will come from the large population based cohorts being collected in places like the Marshfield Clinic [3] and Vanderbuilt [4] and Kaiser Permenente [5].

For these biorepositories to be of use rigorous quality control and assurance must be maintained to minimize sample identification errors. The quality of the nation's biobanks is such a priority that the National Cancer Institute has established an Office of Biorepositories and Biospecimen Research with the specific mission to develop standards and determine the best practices and protocols for the burgeoning field [5]. Erroneous sample identification and mistaken sample handling can stay hidden for many years compromising results; this problem has plagued researchers that work with cell culture for years [6, 7].

One strategy for reducing errors in sample identification is to increase the automation of sample acquisition, storage and handling. The PMRP repository used a semi-automated system of sample handling for recruitment in 2001 which included the Marshfield Clinic's practice management and laboratory information systems for subject recruitment and sample collection. PMRP currently uses the Nautilus® (Thermo Fischer, PA, USA) biospecimen tracking system with an integrated plate reader for tracking of sample storage [8]. Other larger biorepositories such as the UK biobank have chosen to automate sample collection and storage to a larger degree in order to further reduce the possibility of incorrect sample identification [9]. While automation does rule out some possible sample identification mistakes it does not eliminate all potential error from the process as technicians are still involved in sample acquisition and processing. In addition, automation does not inform the quality of the specimen to help research personnel determine the usefulness of the sample nor does it provide a process for correcting and identifying errors that have occurred during previous sample collections. Even when samples are collected, processed, and aliquotted with minimal human handling, results from collaborating institutions must be checked to ensure no identification errors occurred in transferring either the samples or the data between institutions. Therefore it is imperative that an initial characterization of a sample includes both unique identifiers and measures of quality. This is particularly true when genotypic and phenotypic information are being combined in large association studies.

One method of identification would be to rely on markers developed for forensic applications. A number of different SNP panels have been developed to identify subjects; these panels range from a 21 SNP multiplex to a 52 SNP multiplex that has been validated in different populations [1013]. However, the use of forensic panels for sample identification may create ethical and legal challenges. Increasingly biorepositories have also chosen to create their own SNP panels for quality control and assurance. The Wellcome Trust used a 38 SNP panel for quality control of the samples used for the genome wide association study of seven common diseases [14]. While any of these panels could be used for sample identification and tracking, they are not useful in characterizing the population in relation to polymorphisms already known to be associated with disease.

In an effort to balance the needs of quality control and genomic research we have designed and validated a panel of polymorphisms for individual sample identification that have been implicated in a wide range of diseases. In this paper we describe our process for designing the panel as well as the ability of this panel to uniquely discriminate samples using known allele frequencies. We also demonstrate how this panel may be used in the future by researchers investigating common diseases such as heart disease and obesity.

Methods

SNP panel design

The inclusion criteria for a polymorphism to be considered for use in the fingerprinting assay were as follows. First the polymorphism must have a confirmed minor allele frequency of approximately 0.20 or greater in a European Caucasian population, this allele frequency was chosen based on a theoretical assay of 34 polymorphisms, an average allele frequency of 0.20 was needed to uniquely identify 20,000 subjects. Second, the polymorphism it must be medically relevant; defined as having been associated with at least one disease state in multiple populations. Finally, the polymorphism must be easily detectable using the Sequenom platform (Sequenom, San Diego, CA, USA).

In order to obtain an initial list of potential polymorphisms all of the investigators within the Marshfield Clinic Research Foundation were asked to submit polymorphisms that were of interest. Garnering a list of 116 polymorphisms, these were tested with the above criteria before proceeding to assay design. In addition, this list was supplemented with polymorphisms from the PharmGKB [15] and Genetic Association Database [16].

The minor allele frequency in a Caucasian population was then confirmed using the published allele frequencies in dbSNP [17] to ensure that the polymorphism was common enough to be considered for inclusion in the panel. A Medline [18] search for disease associations was then performed for each polymorphism that met the frequency requirement. Any polymorphism that had an association with at least one disease in two separate preferably Caucasian populations was then allowed to progress to the next step.

Finally each polymorphism was masked using SNPmasker and evaluated for ease of adaptation to the Sequenom platform. The remaining polymorphisms were used in the Sequenom Assay Design 3.1 [19] program to develop a large multiplex assay. Several assays with between 33–39 polymorphisms as well as the sex marker AMG were created. A single assay of 40 total polymorphisms with at least one polymorphism on each chromosome was chosen for validation.

This study was reviewed and approved by the Marshfield Clinic Institutional Review Board.

Allele detection

Allele detection was carried out using the MALDI-TOF mass spectrometry method of allele determination on a Sequenom platform. Briefly this method of detection depends on the different molecular weights of the polymorphic regions to determine the allele. A multiplexed PCR reaction is carried out on genomic DNA to amplify regions of interest. These products are then annealed with primers that are in the region directly adjacent to the polymorphism and a single base pair primer extension reaction is performed to generate the allele specific products. The products are then placed on a proprietary chip and analyzed using MALDI-TOF mass spectrometry and Sequenom Typer 3.4 software [20] to make the allele determination. A list of the PCR and detection primers is provided in additional file 1. Any of the automatic genotype determinations that were flagged as lower confidence or that the program failed to automatically determine were then double checked by a human to determine if manual allele calls could be made.

Assay validation

The 40 polymorphism fingerprinting assay was tested initially on a panel of 28 Caucasian HapMap samples from the CEPH population provided by Coriell. Our fingerprinting assay contained 32 polymorphisms that had been previously genotyped in this population; we compared our results with those released through the dbSNP Genotyping detail [21] or the cancer500 web site [22] to ensure genotyping accuracy. Further validation for accuracy and precision was completed by repeated assays using anonymous DNA extracted with our protocol from a Caucasian population. One final population of a further set of 21 CEPH samples was used to ensure accuracy of both the automated and manual genotype calls. Polymorphisms that were not automatically determined at least 80% of the time were excluded from the assay. The frequencies of the alleles and genotypes obtained with the assay were compared to the expected frequencies from dbSNP. Finally, the polymorphisms in the assay were checked for disagreement with Hardy Weinberg equilibrium; any polymorphism that failed was excluded from the final assay.

Determination of individual discriminatory power

The expected allele frequencies were used to determine the potential power of the assay using the following equations for the probability of identity. We calculated the probability of identity [P(ID)] for the assay using two different equations. The first equation assumes the population is random and uses the product rule with Hardy-Weinberg equilibrium assumptions [23], and the second equation is more conservative and assumes the population is composed of siblings [24]. We performed these calculations assuming independent assortment of the alleles.

(1)
(2)

Results

Polymorphism selection and assay design

116 polymorphisms were submitted by investigators from the Marshfield Clinic Research Foundation, 6 polymorphisms were taken from the PharmGKB VIP list and polymorphisms in chromosome 9, 18, 21, and 22 were obtained from the Disease Association Database providing an additional 12 polymorphisms. All of these polymorphisms were then evaluated for the inclusion criteria (Figure 1). 27 candidates failed because the allele frequency in a Caucasian population was below 20%. Another 21 polymorphisms failed to pass the criteria of medical relevance. Finally, three more polymorphisms were eliminated after masking because of incompatibility with the Sequenom system. The remaining 77 polymorphisms were placed in the Sequenom Assay Design program and a single 40 polymorphism assay was created. The assay includes 39 polymorphisms that are medically relevant (see additional file 2 for the references used to determine medical relevance) with at least one polymorphism on each chromosome and the AMG gender marker. The polymorphisms used in the assay as well as the chromosome and associated gene are provided along with at least one associated disease state (Table 1).

Table 1 Initial design of unique fingerprint assay using medically relevant polymorphisms
Figure 1
figure 1

Schematic representation of the development the DNA fingerprinting assay. Schematic representation of the inclusion criteria for a polymorphism to be considered for the fingerprinting assay and the number of polymorphisms that failed each step.

Assay validation

A total of 3 polymorphisms (rs1693782, rs4444903, and rs1800012) did not achieve 80% automated call rate and accuracy in the assay. The finalized validated assay consists of 37 polymorphisms, with the AMG sex marker and 36 medically relevant polymorphisms with representation on each chromosome (Table 1).

Since one of the goals of this assay is to provide a quality control mechanism for genotyping using a whole genome scan we assessed the panel for polymorphisms that were represented on either the Affymetrix 6.0 chip (with over a million polymorphisms) or the Illumina 550K bead set (with over half a million polymorphisms). The fingerprint encompasses 16 polymorphisms that are on the Illumina 550K assay and 12 that are on the Affymetrix 6.0 chip assay with 7 polymorphisms in common to both platforms (Table 2).

Table 2 Fingerprinting polymorphisms on either the Illumina or Affymetrix platform

Power of discrimination

Preliminary analysis of a hypothetical SNP panel suggested that 34 polymorphisms of an average minor allele frequency of 0.20 along with a sex marker would be needed to individually identify a biorepositiory of 20,000 samples. Here we have created a panel of 36 polymorphisms with an average MAF of 0.354667 and a sex marker. For the validated assay we obtained a P(ID) of 6.132 × 10-15 and a Psib(ID) of 3.077 × 10-8.

To further ensure the assay could be used to uniquely identify our population we calculated the P(ID) and Psib(ID) of the assay assuming that only polymorphisms located farther than 50 megabases apart segregated independently. Finally we calculated the probability of identity using only the polymorphisms found on the Illumina and Affymetrix platforms, using only the polymorphisms limited to one or the other of these platforms individual identification of samples should be possible. Even using the most conservative Psib(ID) from the polymorphisms on the Affymetrix platform, 3.0 × 10-3, meaning 60 individuals in the population may not have unique patterns (Table 3).

Table 3 Probability of identity for the validated fingerprinting assay

Common Disease Coverage

The second goal of the fingerprinting panel is to provide a resource for investigating the genetic components of common diseases. In order to ensure that the fingerprint assay developed here encompasses a wide range of diseases we compared the coverage of the assay with the 28 focus areas from healthy people 2010 [25]. Many of the focus areas such as cancer, heart disease, diabetes, obesity, and respiratory disease are well represented in the fingerprinting assay. Heart disease, the most prevalent disease in the PMRP population and a high priority disease for healthy people 2010 is represented with 10 polymorphisms that may be important in the development or severity of disease (table 4). Based on the prevalence of these diseases in the population, the minimal detectable odds ratio for these polymorphisms is between 1.5 and 3.2 with the most prevalent disease (heart disease) having the lowest detectable odds ratio of 1.5 [26] (table 4).

Table 4 Common diseases in the PMRP cohort and polymorphism coverage of these diseases.

As would be expected from a collection of medically relevant polymorphisms many of the genes represented in the assay share common pathways or substrates. Examples of pathways with more than one represented gene include the renin-angiotensinogen pathway with three genes (ACE, AGTR1, EGF), and the cholesterol maintenance pathway with four genes in the assay (APOB, CETP, LDLR, LPL, and FABP2). In addition, neurotransmitter pathways are represented by four different genes; two serotonin receptors (COMT, DRD3) and two genes in the dopamine pathway (HTR2A, HTR1B). Furthermore three genes in the assay use folate as a substrate (MTHFD1, ENOSF1, CBS).

Discussion

As we usher in the era of genomic medicine, population based biorepositories linked to growing medical records will become important for translating genetic discoveries into medical practice. For these repositories to be of use, stringent quality control and assurance mechanisms must be established. Furthermore these repositories become the ideal place to test genetic associations that have been discerned in case control cohorts to determine the disease burden in an unselected population. In this paper we described the development of a single assay of 36 medically relevant polymorphisms and a sex marker that will be used to uniquely identify samples in the Personalized Medicine Research Project cohort and simultaneously provide investigators with information on medically relevant polymorphisms for a wide array of diseases. Furthermore we determined which of these polymorphisms were present on the two most popular platforms used for whole genome association, Illumina and Affymetrix, to ensure the panel would be adequate for determining sample handling errors in shipping and genotyping.

Quality control and assurance of biorepositories involves an integrated program of both minimization of potential errors and testing for potential sample problems to quickly rectify errors. The PMRP biorepository incorporates automation through an in house developed system for specimen acquisition and the Nautilis® system for specimen tracking. However, automation is only one aspect of a comprehensive sample assurance program. This newly developed assay will be used for determining the ability of the DNA to be used for genotyping. In addition it will allow us to determine an estimate of error rates in our sample acquisition. We will compare the genotyped sex with the expected sex from the medical record to ensure a match. In addition, we will use known trios and pedigrees to compare individual samples for instances of non-maternity and non-paternity as potential indicators of sample identification error. Because of the unique ability of PMRP to re-contact subjects, as samples are exhausted or potential errors are identified, new samples can be acquired and verified using existing information to confirm sample identity. This ensures that each sample can be correctly attributed to an individual each time a new sample is drawn.

The number of polymorphisms used for our assay compares favourably to panels that have been used for other applications. In order to ensure that our goal of using the fingerprinting assay for individual identification of the PMRP cohort was met, we determined the P(ID) and Psib(ID) of the assay. Even using the most conservative estimates of the assay we can correctly identify approximately three million samples. SNP panels used for forensic applications have ranged from 20 to 52 SNPs [10, 11, 13]. A test of SNP panels used to identify cell lines included panels of 80, 60, 40, and 20 SNPs with the conclusion that 20 SNPs did not discriminate adequately however any of the most frequent 40 polymorphisms would provide adequate power of discrimination. From this assumption an assay of 34 polymorphisms with individual identification ability was developed [12].

Our assay is unique not only because we used only medically relevant polymorphisms but also because we developed the assay with a polymorphism on every chromosome and included polymorphisms on a number of popular genotyping platforms. Even using the most conservative estimate of Psib(ID) and only the polymorphisms on the Affymetrix platform we would have few potential incorrectly identified samples, and these samples could easily be further identified by ascertaining the genotype calls polymorphisms in linkage disequilibrium with polymorphisms that are on the fingerprinting assay. This ensures that any sample handling errors that are potentially introduced in shipping the samples or receiving data from other laboratories can be detected when genotyping results are entered into our database. By comparing returned genotypes with previously determined genotypes provides a quality check for any data that is generated externally. Using the assay in this manner provides an opportunity to correct sample errors or questionable calls from genotyping before samples are linked with phenotypic medical information.

Because of the dual nature of the assay several potential limitations arise. The stringent dual criteria of medical relevance and high minor allele frequency have the potential to limit the usefulness of the assay for both goals. One of the limitations of the assay we developed is that only the European Caucasian MAF was considered as a selection criterion. The PMRP biospecimens are overwhelmingly Caucasian therefore this is appropriate for our application, however this emphasis may decrease the applicability of this assay to other populations. Using the allele frequencies for Asian and African populations from dbSNP 6 polymorphisms in the assay fell below 10% MAF in Africans and 2 polymorphisms were below 10% in Asian populations. Identification panels developed for forensic applications generally consider a large number of different ethnicities when creating the panel to ensure correct identifications across populations [10, 11, 13]. At this time, we have no reason to believe that the choice of medically relevant polymorphisms inherently biased the assay toward the Caucasian population and preliminary analysis of estimated allele frequencies in other populations would suggest the fingerprinting assay to be robust for other populations. Myles et al. [27] recently tested allele frequencies across populations for disease associated SNPs and determined they vary no more across populations than randomly chosen SNPs. However any polymorphism can have varying allele frequencies among different populations and until the allele frequencies for our polymorphisms are verified in other populations we cannot determine how useful it would be for non-Caucasians.

Another limitation for the assay is that the medical relevance of the assay could be limited. Many medically relevant polymorphisms are less frequent in the population than 20%. For instance polymorphisms in the ApoE alleles are highly medically relevant with many pleiotropic effects on disease [28], however they have a minor allele frequency of 0.18 and 0.02 [21], and therefore were excluded from consideration. In addition, because of the evolving nature of gene disease discovery it is possible that some of the polymorphisms we included in the assay will later be determined to have little medical value. For instance, the polymorphism chosen in the RET gene while associated with disease, has only been associated in small studies [29, 30]. Furthermore, newly associated medically relevant polymorphisms such as the cancer susceptibility locus on chromosome 8q24 [31] were not included in the assay because assay construction began before this region was known. Finally, because we focused on including polymorphisms from a range of diseases, it is likely that investigators will need to perform more genotyping for any comprehensive study of disease-gene interactions.

By using a limited number of commonly occurring polymorphisms as an identification panel and maintaining strict access to individual genotyping results, the panel avoids the potential legal and ethical burden of using a forensic panel, does not expose the population to increased identification risk, and may provide insight into population wide allele frequencies. As with any population wide genotyping, the manner of reporting the genotypes must be taken into consideration to minimize the risk of unintended identification. The genotypes created by this fingerprinting assay will be publicly reported only as population wide allele frequencies. Disease associated polymorphisms have been reported in this manner for a number of different populations, most recently the NHANES III population [32] and the CLUE I and CLUEII populations [33]. Even with the recent revelations regarding the ability to discern individual DNA contribution to a mixture, the number of potentially reported polymorphisms from this assay, 36, is well under the smallest estimate of 10,000 needed to discern an individual in a mixture of 0.1%[34]. Individual genotypes are stored in a secure server and not publically available. The PMRP consent form and study guidelines require any investigator that requests individual genomic information for any reason to have an IRB approved protocol before data is released [3].

Despite the potential limitations of the panel, we have demonstrated that an identification panel can be created using only medically relevant polymorphisms. Such panels could be created with disease specific polymorphisms and used in the future not only to identify samples but potentially to access an individual's genetic risk of a disease We are currently using this panel for QA/QC purposes on the DNA samples in the PMRP repository. In the future we intend to test these polymorphisms for association with disease in the entire PMRP population to determine their medical relevance in our population. In the future this panel or other rationally designed panels can be used on large populations not only as a quality control measure but also to begin to translate the results from the large genome wide studies into clinical practice. Because this panel includes polymorphisms that are relevant in a wide range of diseases the information gained from genotyping the entire PMRP population will be used to help facilitate population based research.

Conclusion

We created a single panel of 36 somatic polymorphisms and a sex marker with the ability to discriminate individuals in a large biorepository (of potentially over 100,000,000 samples) using only common medically relevant polymorphisms. These polymorphisms are currently represented on a number of common genotyping platforms making QA/QC flexible enough to accommodate a large number of studies. In addition this panel can serve as a resource for investigators who are interested in the effects of disease in a population particularly for common diseases.