The cancer cell karyotype is often complex and can include a range of molecular alterations that span mutations at the single nucleotide level to extensive rearrangements involving whole chromosomes. The activation of oncogenes as the result of DNA amplifications and the inactivation of tumor suppressor genes as the result of DNA deletions can both contribute to the cancer cell phenotype. With the recent identification of large scale copy number polymorphisms (CNPs) in the human genome as well, it is increasingly clear that a detailed understanding of the role of genomic alterations and structure will be important in the context of both the normal and disease state [18]. Over the years many experimental approaches have been described that have increased our knowledge of the cancer genome. These include genome-wide approaches such as array comparative genomic hybridization (array CGH) to cDNA clones [9, 10], bacterial artificial chromosomes (BACs), P1 artificial chromosomes (PACs) [11, 12], and long oligonucleotides [1315], restriction landmark genome scanning (RLGS) [16], spectral karyotyping (SKY) [17], molecular subtraction such as RDA [18], digital karyotyping [19, 20], and end sequence profiling (ESP) [21] as well as more focused approaches such as high-throughput quantitative PCR (QPCR) and fluorescence in situ hybridization (FISH) [22]. While no single experimental approach allows the comprehensive analysis of all types of chromosomal aberrations, array-based approaches offer the greatest potential for high resolution genome-wide scans.

High density DNA oligonucleotide arrays using light-directed parallel chemical synthesis allow unprecedented levels of genetic information to be captured in single experiments [2325]. The completion of the human genome sequence, coupled with the emergence of single nucleotide polymorphisms (SNPs) as the most common form of genetic variation among individuals, has led to a variety of applications for high density genotyping arrays. In the past, these arrays have been used in traditional loss of heterozygosity (LOH) analysis using standard approaches of multiplex PCR for DNA target generation [2628]. More recently, a DNA target generation method using complexity reduction by single primer PCR, termed whole genome sampling assay (WGSA), was developed for simultaneous genotyping of over 10, 000 SNPs on a single array [29, 30]. This array has been used for hierarchical tumor clustering based on LOH patterns with human lung cancer cell lines [31], the characterization of LOH progression in samples from children with acute lymphoblastic leukemia who relapse after chemotherapy [32], and for a case-control study of esophageal squamous cell carcinoma (ESCC) [33]. Furthermore, the array has also been shown to accurately detect genome-wide DNA copy number changes [3436]. By coupling SNP genotypes with copy number information, detailed insight into genomic structure can be gleaned. For example, genomic regions displaying LOH can be differentiated into regions with hemizygous deletions and regions with no change in copy number, i.e. copy neutral events, and genomic regions undergoing copy number loss without LOH can also be detected [34, 37, 38]. Allelic imbalance, of which LOH is one example, can also occur when one allele is preferentially amplified relative to the other allele. The coupling of genotypic information with copy number information from a single array allows genome-wide allele-typing to be carried out [37, 39, 40]. This type of combined analysis can not be made using approaches such as array CGH (reviewed in [41]) and thus underscores the potential power of identifying novel genomic alterations using high density SNP genotyping arrays.

Recently, the WGSA assay has been extended to allow highly accurate SNP genotyping of over 100,000 SNPs from two arrays [42]. With an average inter-marker distance of 23.6 kb, the arrays provide dense enough coverage to enable whole-genome association studies [43]. In this report we describe a novel algorithm termed CARAT (C opy Number A nalysis with R egression A nd T ree) that uses probe intensity information from the GeneChip® Mapping 100 K set for genome-wide allele-specific copy number estimations. CARAT is predicated on the use of the highly accurate genotypes derived from the array to evaluate allelic dose responses on a SNP-by-SNP basis, thereby allowing the copy number output for each allele to be determined. We show using DNA samples from established cell lines that different types of genetic alterations (amplifications, deletions, and LOH) are readily detectable using an allele-specific copy number approach. Thus the coupling of SNP genotypes with allele-specific copy number information may provide new insight into complex genomic alterations, such as regions undergoing allelic imbalance due to differential allelic amplification.

Results and discussion

We have previously described the use of the 10 K SNP genotyping array for chromosomal copy number analysis [34, 35]. Recently, the ability to genotype 100 K SNPs on a set of arrays has become available and these arrays have been used for high resolution copy number analysis [44]. As with the 10 K array, the 100 K array set uses the WGSA target preparation scheme in which single primer PCR amplification of specific fractions of the genome is carried out. The primary difference with the 100 K WGSA method is the use of two separate restriction enzymes that each generates a higher complexity fraction estimated to be ~300 Mb. In this report we describe a new algorithm called CARAT. In CARAT, a complex normalization scheme that incorporates both restriction fragment and probe sequence information is applied on individual arrays to reduce any systematic error and to increase comparability across experiments. Probes for each SNP are tested for the ability to support an allelic dosage response using a set of normal individuals in which the 'AA', 'AB', and 'BB' genotypes intrinsically represent zero, one, and two copies of the 'B' allele and two, one, and zero copies of the 'A' allele. Probes displaying a strong dosage response are employed in a regression framework to estimate allele-specific copy number. For any target sample, the sum of the copy number estimates from the two alleles is compared against the reference set to derive a significance measure of the deviation from the diploid state. Smoothing is used on the estimated copy number and its corresponding significance to further reduce the experimental and technical noise. Regression trees [45] are applied on the smoothed result to partition the genome into regions with different copy numbers and to assign an overall significance to such changes.

WGSA 100 K arrays perform robustly for SNP genotyping, with call rates, reproducibility, and accuracy greater than 99%, 99.7%, and 99.7% respectively [42]. Since CARAT does rely on genotype calls, any SNPs with systematic errors in the calls could potentially bias the results. In order to prevent any such bias, only genotypes with stringent confidence rank scores are used, and SNPs that do not meet this criterion are scored as "no calls". Although the majority of steps in CARAT do not make use of "no call" SNPs, there are several steps that do use them, in which case they are always compared against all genotypes to reduce any systematic bias in the analysis.

Among the full complement of over 116 K SNPs, 91,908 (79.1%) display a high allelic dose response as defined by a linear correlation greater than 0.8 between the target concentration and chip intensity. Among these SNPs, 51097 (55.6%) incorporated information from all 20 perfect match (PM) probes (10 PM 'A' allele (PMa) and 10 PM 'B' allele (PMb)), 31857 (34.6%) incorporated information from 15 ~ 19 probes, 8268 (8.9%) incorporated information from 10 ~ 14 probes, 682 (0.74%) incorporated information from 5 ~ 9 probes, 4 (0.004%) incorporated information from 3 or 4 out of 20 probes, and no SNPs used less than 3 probes. This subset of 91, 908 SNPs, with an average inter-marker distance of 30.5 kb, was used in CARAT for copy number estimations.

The performance of CARAT was evaluated with a set of test samples that included 90 normal individuals, DNA samples with varying numbers of X chromosomes (1X to 5X), and several human breast cancer cell lines that harbor both low level and high level copy number alterations. None of these test samples have any overlap with the 128 training samples that are used to establish and tune the CARAT models.

The relationship between DNA copy number and fluorescent intensity of the SNP hybridization signal was evaluated using genomic DNA derived from cell lines with a defined number of X chromosomes (1X to 5X). Among the 91,908 selected SNPs, 1,955 map to the X chromosome. A normal 2X (NA15029) female sample was used as the reference for comparisons to the 1X, 3X, 4X, and 5X samples. The results are summarized in Figure 1. Panels a-d show that there is a high linear correlation among the sample pairs, and only X-chromosome SNPs (labeled in red) show intensity profile shifts across the four panels while the autosomal SNPs (labeled in black) remain static. Panel e indicates that there is a strong linear relationship between the log transformed copy number and the log transformed intensity. These results show that the 100 K WGSA PCR fractions maintain a nice dose response between the input template copy number and the post hybridization SNP fluorescent intensity.

Figure 1
figure 1

Panels a-d show the standardized ln(PMa) + ln(PMb) intensity for the 1X, 3X, 4X, and 5X DNA samples relative to the intensity of the 2X DNA sample. Black data points correspond to autosomal SNPs and red data points correspond to the 1,955 X-chromosome SNPs. The blue line in each panel represents the Y = X line. Panel e shows the relationship between the natural log-transformed copy number and the natural log-transformed intensity. The x-axis is the natural log-transformed copy number and the y axis is the average ln(PMa) + ln(PMb) intensity across 1,955 SNPs. The blue line is the regression using the average intensity as the response and the natural log-transformed copy number as the predictor.

Table 1 summarizes the true positive rates for detection of X chromosome changes using the 1X, 3X, 4X and 5X DNA samples along with the false positive rate of detection of autosomal SNPs deviating from the diploid state using the test set of 90 normal individuals. Values are computed for all samples at several different stages of CARAT and at various significance thresholds. The results indicate that the addition of the kernel smoothing and the tree partitioning steps improves the true positive rate and decreases the false positive rate; only at the most stringent significance cut-off does the false positive rate exceed the expected value. Moreover, with the regression tree partition function, CARAT defines the alterations on the X chromosome as a single region for all four samples with a very high significance. The overall copy number estimates (and significance) for the X chromosome using the 1X to 5X samples are: 1X:0.92 (1.99 × 10-4), 3X:3.21 (8.13 × 10-6), 4X:4.36 (6.15 × 10-12) and 5X:5.74 (1.50 × 10-16).

Table 1 Estimation of true positive and false positive rates under varying significance thresholds using 1X to 5X samples and 90 normal test samples.

These 90 HapMap CEPH samples (30 trios) thus served as an independent test set to evaluate the accuracy of the SNP copy number estimations as well as the algorithm's false-positive rate (FPR). These samples were assumed to represent normal diploid genomes which do not harbor extensive genomic deletions or amplifications. Although these samples could contain copy number polymorphisms, they are relatively rare and were not considered in this analysis, which potentially could lead to an overestimation of the true false positive rate. There were 89,953 autosomal SNPs among the total of 91,908 selected SNPs that were examined across the 90 individuals for a total of 8,095,770 data points; X chromosome SNPs were excluded due to copy number differences between males and females. One possible explanation for the higher-than-expected false positive rate at the stringent p-value of less than 10-6 is that they are not false positives but rather true and significant copy number polymorphisms occurring in normal people. For example, there were 167 data points identified with a significance level less than 10-6. Among them, 72 SNPs were derived from a common amplified region on chromosome 8 from two samples originating from the same trio, namely NA12802 (child) and NA12814 (father), with each sample showing the same 36 significant SNPs. Although this amplified region (~16.34–16.85 Mb) has not been independently verified using QPCR, it does partially overlap with a BAC clone (RP11-90I3) from 8p22 that has detected a CNP [2] and thus may represent a CNP that is transmitted through generations. The copy number estimation of each autosomal SNP across these 90 test samples also has relatively low variation as shown in the upper panel of Figure 2. The mean copy number estimate across all autosomal SNPs ranges from 1.951 to 2.032 and is similar whether using only kernel smoothing or kernel smoothing combined with regression trees. However, by adding the regression tree as the final partition step, the standard deviation is dramatically reduced by an average of 81.4%, and the range changes from (0.149, 0.367) to (0.019, 0.037). The lower panel shows the proportion of the genome on a per-sample basis that does not contain any significant changes. Using regression trees, there are many more regions identified as diploid as compared to using kernel smoothing only, indicating an improvement in the apparent false positive rate.

Figure 2
figure 2

The upper panel shows the mean autosomal SNP copy number and the associated standard deviation using kernel smoothing alone and kernel smoothing combined with the tree partition for each of the 90 normal samples in the independent test set. The solid lines correspond to the mean estimation and the dotted lines represent the mean plus or minus one standard deviation. The lower panel shows the proportion of the genome (autosomal chromosomes only) that is determined to be in the normal diploid state for the 90 individuals. The blue colored lines in both panels represent results using kernel smoothing alone while the red colored lines represent results from kernel smoothing combined with the regression tree partition.

Receiver Operating Characteristic (ROC) curves were used to evaluate the overall sensitivity (true positive fraction) and specificity (true negative fraction) of different stages of CARAT. The curves are calculated using 1,955 X chromosome SNPs, with the false positive rate estimated by averaging the individual false positive signals across the 47 female samples present in the total set of 90 normal individuals. Figure 3 shows the ROC curves derived from different stages of the algorithm using DNA samples with differing numbers of X-chromosomes; Table 2 summarizes the area under those curves depicted in Figure 3. The most significant improvement comes from the adjustment based on fragment length, GC content and reference mean; the AUC (Area Under the Curve) increases about 50% for the 1X, 3X and 4X samples and 21.5% for the 5X sample. The improvement from probe-selection is relatively modest, resulting in an overall increase across the samples of about 5%. Although adding kernel smoothing in stage 4 and tree partitions in stage 5 does not substantially increase the AUC, these steps are nevertheless critical. These two steps drive the AUC towards 1, ensuring high sensitivity while keeping the specificity extremely low, which is a necessity since nearly 92 K SNPs are simultaneously being examined. In the tree partitioning step (stage 5), the ROC curves are ideal for the 4X and the 5X samples, rendering an AUC of 1. For the 1X and the 3X samples, the ROC curves are not smooth but rather step-functions that achieve a 100% true positive rate with a minimum false positive rate. This occurs because for each case the regression tree step successfully identifies the variations on the X chromosome as one altered region and assigns the region a high significance score that rarely occurs in normal female samples.

Figure 3
figure 3

Each panel shows a series of ROC curves derived from different stages of CARAT using samples with X chromosome alterations. Stage 1: Single point analysis that contains no probe selection, no intensity adjustment on fragment length and GC content; and no intensity adjustment on the reference mean. Stage 2: Stage 1 plus probe selection. Stage 3: Stage 2 plus intensity adjustment on the fragment length and GC content and intensity adjustment on the reference mean. Stage 4: Stage 3 plus kernel smoothing with a 100 kb window. Stage 5: Stage 4 plus genome partitioning using the regression tree. This figure should be viewed in conjunction with Table 2 which summarizes the area under the ROC curves.

Table 2 Area under the ROC curve derived from different stages of the CARAT method.

The previous DNA samples with variable X chromosome content provided a means to evaluate the algorithm using large alterations that span the length of an entire chromosome. In order to better evaluate the performance of CARAT when the alterations were of low level copy number changes that did not span entire chromosomes, as well as evaluating CARAT relative to other methods, a series of experiments were carried out. These experiments included QPCR on 69 SNPs chosen from the cancer cell line SK-BR-3, QPCR around the HER2/neu region using three cancer cell lines; and allele-specific TaqMan on nine SNPs across two cell lines coupled with DNA sequence analysis. All experimental results show a high correlation with CARAT-derived estimates, indicating that the algorithm in combination with the Mapping 100 K array set can detect chromosomal copy number changes in an accurate and quantitative manner.

We used QPCR as an independent method to determine the total copy number of 69 autosomal SNPs from SK-BR-3. These results were then compared to copy number output from CARAT and two additional algorithms used for Mapping 100 K copy number analysis, namely dCHIP [46] and CNAG [47]. These SNPs are derived from regions of SK-BR-3 that display copy number gains and losses as well as regions with no detectable changes, covering 16 of the 22 autosomes and more than 60 different regions; 10 of the 69 SNPs have a copy number between 1.5 and 2.5, indicating no major alterations from diploidy; 14 of the 69 SNPs have been excluded from CNAG because the SNPs reside on restriction fragments shorter than 500 bp and are resistant to the compensations used in CNAG (Table 3). Figure 4 summarizes the comparison of the correlations between the copy number derived from the three algorithms at different stages and the copy number derived from QPCR. The results show that the correlation values across the three methods are not significantly different. However, both CNAG and dCHIP under-estimate the total DNA copy number, although to different extents. In CNAG, neither the averaging across neighboring points nor the HMM procedure leads to a significant increase in the correlation. In contrast, the HMM step in dCHIP and the kernel smoothing step in CARAT do improve their respective correlations. In Table 4, the performance of the three methods is examined by evaluating the sensitivity and specificity using the same 69 QPCR results. In CNAG, because the estimation is biased towards the normal diploid state, it achieves perfect specificity while demonstrating substantially lower sensitivity compared to the other two methods. Although dCHIP and CARAT have similar performances with one another, CARAT has a higher sensitivity in the single-point estimation step and the smoothing step while dCHIP has higher specificity in the smoothing step. In dCHIP, the averaging and HMM steps steadily improve both the sensitivity and the specificity while in CARAT the sensitivity remains the same while the specificity is substantially increased through the three stages. Neither dCHIP nor CNAG has a significance measure associated with the estimated copy number at the single SNP level. Thus, in an attempt to compare the three algorithms, only copy number output from CARAT has been used, rather than the combination of copy number output and the associated p-values. The only exception to this is the analysis of the tree partitioning step in which algorithm-based true negatives are defined as SNPs with p-value > 0.005 and algorithm-based true positives are defined as SNPs with p-value < 0.005. Although the p-values from the regression tree step may not have a direct probabilistic interpretation, they nonetheless are derived from individual p-value estimates and thus serve as confidence scores that measures how significantly the region deviates from the diploid state. The use of a significance level rather than a copy number value as a threshold to differentiate altered regions from normal regions is appropriate with CARAT, and provides a more accurate estimation of the true performance of CARAT. In this case, CARAT achieves perfect specificity of one and a very high specificity resulting in overall superior performance.

Table 3 Detailed information on the 69 SNPs with q-PCR result on SK-BR-3.
Figure 4
figure 4

These nine panels show comparisons among CARAT, dCHIP and CNAG qPCR results of 69 autosomal SNPs from the human breast cancer cell line SK-BR-3. In each scatter plot the x-axis is the copy number derived from QPCR and the y-axis is the copy number derived from one of the three algorithms. ΔCt denotes the difference between the normal DNA sample versus SK-BR-3. The threshold cycle (Ct) is the cycle number at which the reporter fluorescence passes a fixed threshold above baseline. A positive ΔCt suggests an amplification while a negative ΔCt suggests a deletion. The copy number of SK-BR-3 based on QPCR is inferred as 2(ΔCt + 1). The red points are the 55 SNPs that were included in the CNAG analysis; the black points are the 14 additional SNPs that were included in dCHIP and CARAT analysis but were excluded from CNAG. Correlations are calculated for each of these two different SNP sets. The blue line in each panel represents the Y = X line. Panels (a), (b), and (c) compare single point analysis across the three methods; panels (d), (e), and (f) compare smoothing across neighboring points; panels (g), (h), and (i) compare genome partitioning across the three methods.

Table 4 Comparison among CARAT, dCHIP and CNAG using QPCR results.

Additional verification of DNA copy number changes detected by CARAT was done using the highly characterized region on chromosome 17q12 harboring the ERBB2 (HER2/neu) proto-oncogene that is amplified in nearly 30% of breast cancers [48]. Figure 5 shows a comparison of chromosome 17 for three human breast cancer cell lines. The genomic region near HER2/neu appears amplified in the two cancer cell lines SK-BR-3 (panel a) and ZR-75-30 (panel e) with moderate to very strong significance (significance data not shown) and does not appear amplified in MCF-7 (panel c). This is consistent with published CGH results that show SK-BR-3 and ZR-75-30, but not MCF-7, contain gains in 17q12 [49] as well as with ERBB2-specific FISH showing amplification in SK-BR-3 (45 signals per cell) but not MCF-7 (2.5 signals per cell) [50]. Quantitative PCR was carried out with a HER2/neu primer pair and confirmed the copy number increase in two of the three cell lines (Table 5). The estimated HER2/neu copy number by QPCR for SK-BR-3, MCF-7, and ZR-75-30 is 12.4, 0.8, and 27.7 respectively. While the array set does not contain SNPs within the HER2/neu gene, the SNPs which flank the locus are SNPs 1720794 and 1738376. CARAT results for these SNPs are also summarized in Table 5 and confirm that the region surrounding HER2/NEU is amplified in two of the three cell lines. In Figure 5 all three cell lines show LOH in this region. Based on CARAT, MCF-7 shows one copy loss at the HER2/neu locus itself and proximal to the locus while there is no apparent copy number change distal to the locus. Additionally, SK-BR-3 and ZR-75-30 both show differential amplification of one allele relative to the other, resulting in allelic imbalance. These regions also serve to underscore how genotypic information can complement copy number information in the detection of complex structural alterations in regions exhibiting LOH. In Figure 5 panels b, d and f, results from CARAT are also consistent with additional regional copy-number increases observed by CGH using metaphase chromosomes in MCF-7 (17q22-q24; ~47.5–68.4 Mb), SK-BR-3 (17q24-qter; ~59.9––78.8 Mb), and ZR-75-30 (17cen-q24; ~22.8–68.4 Mb) [49].

Figure 5
figure 5

Three human breast cancer cell lines are represented by panels a-b (SK-BR-3), panels c-d (MCF-7), and panels e-f (ZR-75-30). The X-axis in all six panels is the physical position of SNPs along chromosome 17. The vertical lines just above the X-axis of each panel represent heterozygous (green) and homozygous (red) genotype calls. The Y-axis in all six panels is the estimated copy number. The points are derived from the kernel smoothing step and the solid horizontal lines are derived from the regression tree. Black colored lines indicate total copy number, the blue colored lines indicate the allele with the higher copy number estimate and the purple colored lines indicate the allele with the lower copy estimate. The vertical black line proximal to 40 Mb indicates the location of the HER2/neu gene. The panels on the left (panels a, c, and e) show an enlarged view of the genomic region harboring HER2/neu while the panels on the right (panels b, d, and f) show a larger view of the chromosome.

Table 5 Comparison of QPCR and CARAT on HER2/neu locus across three cell lines.

We chose 9 SNPs distributed across five different chromosomes for TaqMan analysis as an independent verification of allelic copy number information. These SNPs were identified by CARAT and represent various types of alterations. Allelic copy number results from CARAT and TaqMan for these SNPs across two cell lines are summarized in Table 6. TaqMan reactions for each SNP were done with genomic DNA from SK-BR-3 and ZR-75-30 as well as with normal DNA samples representing AA, AB, and BB genotypes that serve as positive controls for allele dosage. There are a total of 36 allele-specific copy number estimates when combining results for nine SNPs from the two cancer cell lines on both alleles. In general, there is a high linear correlation between the allelic copy number estimates using the algorithm and the allelic copy number derived from TaqMan reactions (Cor = 0.87). Among the 36 data points, there are 12 examples with a TaqMan-determined copy number lower than 0.5 and thus may indicate the loss of an allele. 10 out of these 12 examples also show a CARAT copy number estimation lower than 0.5, indicating a strong consistency between the two methods. These 12 examples can be further separated into four categories: (1) normal homozygous SNP (one allele present in two copies, the other allele absent), which includes SNP 1724728 and SNP 1736669 from ZR-75-30; (2) homozygous deletion (both alleles absent), which includes SNP 1670177 from SK-BR-3; (2) hemizygous deletion (one allele absent, one allele present at one copy), which includes SNPs 1724728 and 1718017 from SK-BR-3 and SNPs 1726250, 112706 and 1670177 from ZR-75-30; (4) hemizgyous deletion and one allele amplification (one allele absent and the other amplified), which includes SNP 1700191 from both samples, and SNP 1693987 from SK-BR-3. There are also 9 examples with a TaqMan-determined copy number higher than 2.5 indicating putative allelic amplification; all of these 9 examples also have a CARAT copy number estimation higher than 2.5. Some examples are explained by category (4) described above, while the remaining examples all appear as asymmetric amplifications (one allele remains intact, one allele amplified), including SNPs 1726250, 1746553, 1710029 from SK-BR-3 and SNPs 1710029 and 1718017 from ZR-75-30. When the TaqMan-determined total copy number is less than 1 or greater than 3, the CARAT determined p-value is always very significant (< 0.0001) with a single exception of SNP 112706 from ZR-75-30 (p-value 0.002).

Table 6 Allele specific Taqman results

In addition to allelic TaqMan reactions, direct DNA sequencing was carried out on PCR amplicons from both cell lines for seven of the SNPs. Several examples are shown in Figure 6. Panels a and d represent sequence traces using a forward primer for SNP 1693987 from SK-BR-3 and ZR-75-30 respectively. The polymorphic nucleotide in the sense strand is either C (allele A) or T (allele B). SK-BR-3 shows a clear blue peak representing the A allele while ZR-75-30 shows a clear red peak representing the B allele. Both of these base calls are consistent with the predominant allele identified by both CARAT and TaqMan. The copy number of allele B (SNP 1693987) from SK-BR-3 is below 0.5 copies based on CARAT and TaqMan while the copy number of allele A is greater than six with both methods. The DNA sequence trace however does not detect the presence of the minor allele. In contrast, the signal from the minor allele can be detected in the case of SNP 1718017 as shown in panels b, c, e, and f. The polymorphic nucleotide in the sense strand is either G (allele A) or T (allele B). Sequence traces using the forward primer show that in both cell lines the major allele is the A allele (G) as indicated by the black peak. However, ZR-75-30 also shows a smaller red peak indicating the presence of allele B (T). The tracings using the reverse primer also confirm the major allele is the A allele (G) in both cell lines, and ZR-75-30 again shows a minor green peak corresponding to allele B (T). There is no clear detection of the minor allele in the sequence traces from SK-BR-3 (panel b and c) which is consistent with both the CARAT (0.15 copies) and TaqMan (0 copies) results. In ZR-75-30, the ratio of the A allele peak height to the B allele peak height is 3.3 in the forward traces and 4.9 in the reverse traces, which are in general agreement with the allele ratios of 3.2 by CARAT and 5.0 by TaqMan. Thus the DNA sequencing results for this SNP confirm the CARAT and TaqMan results which suggested that allele B was present in at least one to two copies in ZR-75-30 but not in SK-BR-3.

Figure 6
figure 6

DNA sequencing traces surrounding the polymorphic nucleotide are shown in each panel. The SNP corresponds to the underlined base. Panel a and d represent tracings using the forward sequencing primer for SNP 1693987. Panels b and e represent tracings using the forward sequencing primer for SNP 1718017 while panels c and f represent tracings using the reverse sequencing primer for SNP 1718017.


We have developed an algorithm CARAT used in conjunction with the GeneChip® Mapping 100 K Set that provides accurate copy number estimates in an allele-specific manner. This algorithm makes use of the highly accurate genotypic information across a set of normal individuals to identify probes with strong allele-specific dose responses. The copy number estimation is accompanied by a significance score derived by a comparison to a reference set of normal individuals. Kernel smoothing with a Gaussian kernel and a relatively small bandwidth of 100kb is applied on the individual estimates in an attempt to achieve a balance between resolution and noise reduction. Regression trees are applied at the final stage as a method to partition the genome into regions that share the same copy number and to assign an overall copy number and significance to every region that alters from the diploid state. This partitioning step further reduces the random variability from SNP to SNP and increases the interpretability of the output. Although regression trees are conceptually simple, they solve the complex issue of how to define genomic regions with similar alterations. The assumption under regression trees is that different regions of the feature space have a constant outcome. With a series of recursive binary splits, they efficiently and accurately stratify the feature space into groups such that the random deviation from the fitted constant is minimized [51]. In the application of regression trees to DNA copy number analysis, the feature space is one dimensional and corresponds to the physical location on the chromosome while the outcome is the unknown copy number. The non-parametric nature of the tree method thus uncouples it from the many assumptions associated with particular distributions, which is especially appropriate for this array-platform since the behavior of probe intensity can be complex and difficult to summarize. The kernel smoothing step used for noise reduction and the tree partitioning step used for genome segmentation provide an optimal combination that renders high performance along with simple interpretation of the output [52]. This combination of information allows genomic alterations that lead to allelic imbalance to be characterized in a manner that is not currently possible by approaches such as CGH. Additionally, allelic copy number potentially allows examples of both whole chromosome and segmental uniparental disomy to be identified as well as genome-wide assessments of monoallelic amplification [53].

There are a number of alternative statistical methods that have been used to analyze array data for the purpose of copy number variation detection. Several approaches have used Hidden Markov Models (HMMs) [1, 47, 54, 55]. Although in general the Markov chain framework does fit genome-wide copy number variation, determination of the specific parameters in the model can depend on the patterns of variation in the samples. Thus the performance hinges on how well the actual distribution of copy number variation from experimental samples such as cancer cells, which is largely unknown, agrees with the distribution hypothesized by the model. In this study we compared CARAT with two methods that use HMMs, namely dCHIP and CNAG. The performance between dCHIP and CARAT is similar, while CNAG tends to bias towards the normal diploid stage. In addition, dCHIP can not offer allele specific information in contrast to CARAT and CNAG. However, the allele specific estimation in CNAG is only feasible in matched pairs of samples and then only considers those SNPs that are called heterozygous in the normal matched sample. CARAT is free of these constraints and allele specific copy number can be estimated on any SNP with any sample.

Additional approaches include change-point analysis [56, 57] or posterior log likelihood [58] to partition the genome into normal versus changed regions. These approaches assume that the intensity variability of probes corresponding to sub-regions of the genome is similar. However, using WGSA and the high density arrays, we observe substantial variation in the intensities of different SNPs. This can result from differences in SNP probe sequences as well as the restriction fragment target sequences. Regression on the probe GC content and the restriction fragment length stabilizes SNP variability and improves sample-to-sample comparability. In addition, the use of a large normal reference set enables the intensity distribution on diploid genomes to be directly estimated at an individual SNP level, thereby improving the accuracy of the model. There is also an algorithm that uses a hierarchical clustering scheme along the chromosome to identify changes. Here the signal threshold is set by directly controlling the false discovery rate (FDR), providing researchers with a high level of confidence regarding their findings [59]. The challenge with such an approach is that a desirable FDR level can preclude the detection of moderate changes that only span a short stretch of the genome. This issue is also relevant to our algorithm in that the p-value threshold which separates significance from insignificance is determined empirically with the test set of normal individuals and with ROC analyses using the 1X to 5X samples; however, there still exists a balance between detection power versus false positive rate. In addition, kernel smoothing across neighboring SNPs can sacrifice single point resolution. The smoothing window chosen is 100kb with a Gaussian kernel where the points near the window boundary has minimum weight, rendering an average resolution of no lower than 100kb. Although this resolution is high compared to traditional CGH, it is nevertheless sub-optimal compared to the average of 30kb resolution of single point analysis. These issues in part should be off-set by new advancements that allow the resolution of the high density arrays to be further increased through a decrease in feature size and increase in target DNA complexity resulting in the capability to simultaneously genotype over 500, 000 SNPs using a pair of arrays.


Cell lines & DNA samples

All human breast cancer cell lines (MCF-7, SK-BR-3, and ZR-75-30) were obtained from American Type Culture Collection (ATCC). Genomic DNAs were isolated using QIAGEN QIAmp DNA Blood Mini Kit. DNA samples used as controls in allelic TaqMan analysis as well as DNA samples derived from cell lines containing 3X (NA04626), 4X (NA01416), and 5X (NA06061) chromosomes were purchased from NIGMS Human Genetic Cell Repository, Coriell Institute for Medical Research (Camden, NJ).


The whole genome sampling assay (WGSA) was performed using an earlier version of the final protocol. Briefly, 250 ng genomic DNA is digested in 20 μl with 10 U of either Xba I or Hind III restriction enzyme (New England Biolabs) at 37°C for 2 hr followed by heat inactivation at 70°C for 20 min. The digested DNA is ligated in 25 μl with 0.25 μM Xba I adaptors (5'-ATTATGAGCACGACAGACGCCTGATCT-3' and 5'-pCTAGAGATCAGGCGTCTGTCGTGCTCATAA-3') or Hind III adaptors (5'-ATTATGAGCACGACAGACGCCTGATCA-3' and 5'-pAGCTAGATCAGGCGTCTGTCGTGCTCATAA-3') and 250 units T4 DNA ligase (New England Biolabs) at 16°C for 2 hr followed by heat inactivation at 70°C for 20 min. DNA amplification is carried out by PCR under the following conditions: each 100 μl reaction contains 25 ng adapter-ligated genomic DNA, 1 μM primer (5'-ATTATGAGCACGACAGACGCCTGATCT-3'), 300 μM dNTPs, 1 mM MgSO4, 5 U Pfx polymerase (Invitrogen Corporation) in 1× Pfx Amplification buffer with 1× PCR enhancer (Invitrogen Corporation). Thermal cycling is performed with 94°C for 3 min, followed by 30 cycles of 94°C/30 sec, 60°C/45 sec, 68°C/1 min, and a final extension at 68°C for 7 min. PCR products are purified and concentrated with a QIAGEN mini-elute plate and then spectrophotometrically quantitated using absorbance at 260 nm. 40 μg PCR products are fragmented in 55 μl with 0.2 units DNase I (Affymetrix) at 37°C for 30 min, followed by heat inactivation at 95°C for 15 min. The fragmented DNA products are labeled in 70 μl reactions containing 1× TdT buffer with 105 units TdT (Promega) and 0.214 mM DLR (Affymetrix) at 37°C for 2 hr, followed by heat inactivation at 95°C for 15 min. DNA hybridization to the GeneChip® Human Mapping 50 K Array Xba 240 and GeneChip® Human Mapping 50 K Array Hind 240, washing, staining, and scanning were performed exactly as the manufacturers' instructions (Affymetrix). SNP genotype calls are made automatically using a likelihood-based model [60]

Quantitative PCR and TaqMan assays

Quantitative PCR was performed using ABI Prism 7700 Sequence Detection System (ABI). PCR primers were designed by using Primer Express 1.5 software (ABI) and were synthesized by Operon Biotechnologies, Inc. Reactions were prepared using the SYBR-Green PCR Core Reagents kit (ABI). 69 autosomal SNPs were selected and tested. 25 μl reactions containing 25 ng genomic DNA were set up for each SNP. Normal human genomic DNA was purchased from Roche Applied Sciences. Conditions for amplification were as follows: 1 cycle of 50°C for 2 min, followed by 1 cycle of 95°C for 10 min, then followed by 35 cycles of 95°C for 20 sec, 56°C for 30 sec, and 72°C for 30 sec. Threshold cycle numbers were obtained by using Sequence Detector v1.7a software. For all 69 SNPs, Roche human genomic DNA was used as the normal control. All reactions were done in duplicate and threshold cycle numbers were averaged. DNA amounts were measured by UV spectrophotometry and were normalized to LINE-1 elements [19]. Relative quantitation was carried out using the comparative Ct method (ABI User Bulletin #2, 1997). Quantitative PCR assays for HER2/NEU were done as described except that the annealing temperature was 60°C. The primer sequences used for HER2/neu were (Fw) 5'-GAACTGGTGTATGCAGATTGC-3' and (Rv) 5'-AGCAAGAGTCCCCATCCTA-3'.

Primers and probes for allelic TaqMan analysis of 9 SNPs were ordered via the Assays-by-Design Service (ABI). TaqMan reactions contained 10 ng genomic DNA in a 25 μl reaction volume containing 1.25 U of Taq Gold DNA polymerase, 5 μM MgCl2, 250 μM of dNTPs, 1 μM each of PCR primer, and both the FAM and VIC labeled TaqMan probes for the SNP at 0.1 μM final concentration. The amplification conditions consisted of an initial incubation step at 95°C for 10 min, followed by 40 cycles at 92°C for 15 sec, 60°C for 1 min using an ABI Prism 7700 sequence detection system. The DNA amounts were normalized to LINE-1 elements. Each SNP was tested with three normal DNA samples that represented AA, AB, and BB genotypes. We estimated the allele-specific copy number with a linear model:

copy number = η0 + η1 × 2ΔCt     (1)

The parameters of such a model (i.e. η 0 and η 1 ) were estimated using the three normal samples that represented AA, AB and BB genotypes; "copy number" is the inherent two, one, or zero doses of the A allele and zero, one, or two doses of the B allele of the AA, AB, and BB genotypes from the corresponding normal samples; "ΔCt" is the TaqMan Ct difference between samples. There are in total 18 such models being fit (9 SNPs × 2 alleles per SNP). In general this linear framework fits very well with a mean R-square value of 0.993 and standard deviation of 0.009 across all 18 models. After η 0 and η 1 have been estimated for each allele of the nine SNPs, these 18 models are used to predict copy number from the Ct values associated with the experimental samples (SK-BR-3 and ZR-75-30).

DNA sequence analysis

PCR primer pairs were designed for a subset of the SNPs. The primer pair sequences (Fw 5'-3'/Rv 5'-3') were SNP1724728 (GCTGAGGCTCTGGGAGTTC/ATGGAACTGCTGGAGGTTTG); SNP1726250 (ACATGGGCTGCAATATCCTC/GGGAGGTGGAAGAGAAAACC); SNP1700191 (GGGCAAAGGATCTGAATAAGC/CACATGCAGGTTTTTGTGTG); SNP1710029 (GACTGCCACAGTGGAAAGG/CCGTAGGCCTTCACTAGCAG); SNP1693987 (TTTTGGCCTTTGAGGCTATG/GGGTTCACCTTCCACACTTG); SNP1718017 (GAATCAGGTCACCAACATGG/AGTTCACAGCAAAGCACCAG); SNP1670177 (CCTCACAAAGAAGATTTGACCTG/TTGTCTTTCGGTCTTTGTGG). PCR products were sequenced by dideoxy DNA sequencing using the individual PCR primers as sequencing primers. Sequencing chromatograms were visualized using Chromas 2.3.


The following samples were used as a reference set during CNAG [61] analysis: NA17011, NA17101, NA17115, NA17201, and NA17214. The following samples were used as a reference set during dCHIP [62] analysis: NA15029, NA15385, NA15590, NA17011, NA17052, NA17053, NA17101, NA17115, NA17144, NA17172, NA17201, NA17214, NA17253, and NA17279. Default parameter settings are used for both methods.

Data analysis

Intensity transformation and standardization

The basic premise underlying copy number estimation is that the natural log-transformed chip intensity, following adjustments on SNP-specific affinity and non-specific hybridization, is linearly related to the natural log of the DNA target copy number:

ln(C + δ m ) = αm,0+ αm,1(I m - A m ) + ε     (2)

where m = 1,....,M is the SNP index, I m is the natural log-transformed probe intensity on SNP m, αm,0, which is always a negative value, represents the quantity to be subtracted due to the SNP-specific optical background, α m,1 is the scaling factor, δm is the non-specific hybridization, C is the DNA target concentration (i.e. copy number), A m is the affinity term determined by probe and target fragment sequences, and ε is the random noise. The allele specific copy number estimation (Eq 9 and 10) is based on this fundamental assumption. The only major difference is the affinity term A m in Eq 2, which has already been estimated and regressed out using a quadratic regression model with probe GC content and fragment length as the predictors (Eq 5 and 6). To better understand the details of the method, the main steps of the algorithm are summarized in a flowchart (Figure 7). The algorithm implements two rounds of standardization (Eq 3, 7, and 8). The first is applied on the natural log-transformed raw intensity (Eq 3) and establishes the comparability across samples. The second is applied prior to the copy number estimation, and realigns the target intensity according to the mean from the reference pool (Eq 7 and 8), thereby eliminating any systematic intensity shift that has not been adjusted by the previous standardization (Eq 3) or the affinity-based correction (Eq 5 and 6). This algorithm also employs a probe selection procedure (Eq 4) following the first round of standardization in which only probes that show a strong dosage response across samples (as described in Eq 2) are selected for further analysis. Kernel smoothing is applied on the estimated copy number at the level of individual SNPs to further reduce the experimental and technical noise. Regression trees are applied on the smoothed result to partition each chromosome into regions with different copy numbers, to assign significance to each region, and to increase the interpretability of the overall results. All the parameters in CARAT are optimized using a training set with 128 individuals. The training set (Coriell Repositories) consists of 42 African Americans, 20 Asians, 42 Caucasians and 24 samples from the polymorphism discovery panel [63]. Among them, 71 are females and 57 are males. The information from the training set including the intensity and genotype are publicly available upon request. Researchers can also use their own training set for CARAT.

Figure 7
figure 7

The CARAT algorithm is summarized as a flow chart, indicating the major steps in both the training set and the test set. "CN" refers to copy number. The black dotted line indicates how and where the information from the training set is used in the test set.

Each SNP on the 100 K array set is represented by 40 unique features (probes): 10 perfect match (PM) probes and 10 mismatch (MM) probes for both the A and B alleles. The natural log-transformation of the raw intensity is first applied at the probe level for all SNPs. After the transformation, standardization is performed based on MM probe intensities that best represent background signals. This achieves a standard Gaussian distribution of the background intensity to increase the comparability across chips. For each array with a single DNA sample, background intensity is defined as the MMa probe intensity for all SNPs with 'BB' genotype calls and the MMb probe intensity for all SNPs with 'AA' genotype calls. All PMa probes are linearly transformed such that under the same transformation the MMa probes for SNPs with 'BB' genotype calls have a variance of one and a mean of zero; all PMb probes are linearly transformed such that under the same transformation the MMb probe intensity on SNPs with 'AA' genotype calls have a variance of one and a mean of zero.

S a , l m n = ln ( P M a , l m n ) μ ^ a , l σ ^ a , l , S b , l m n = ln ( P M b , l m n ) μ ^ b , l σ ^ b , l ( 3 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqabeqacaaabaGaem4uam1aaSbaaSqaaiabdggaHjabcYcaSiabdYgaSjabd2gaTjabd6gaUbqabaGccqGH9aqpdaWcaaqaaiGbcYgaSjabc6gaUjabcIcaOiabdcfaqjabd2eannaaBaaaleaacqWGHbqycqGGSaalcqWGSbaBcqWGTbqBcqWGUbGBaeqaaOGaeiykaKIaeyOeI0ccciGaf8hVd0MbaKaadaWgaaWcbaGaemyyaeMaeiilaWIaemiBaWgabeaaaOqaaiqb=n8aZzaajaWaaSbaaSqaaiabdggaHjabcYcaSiabdYgaSbqabaaaaOGaeiilaWcabaGaem4uam1aaSbaaSqaaiabdkgaIjabcYcaSiabdYgaSjabd2gaTjabd6gaUbqabaGccqGH9aqpdaWcaaqaaiGbcYgaSjabc6gaUjabcIcaOiabdcfaqjabd2eannaaBaaaleaacqWGIbGycqGGSaalcqWGSbaBcqWGTbqBcqWGUbGBaeqaaOGaeiykaKIaeyOeI0Iaf8hVd0MbaKaadaWgaaWcbaGaemOyaiMaeiilaWIaemiBaWgabeaaaOqaaiqb=n8aZzaajaWaaSbaaSqaaiabdkgaIjabcYcaSiabdYgaSbqabaaaaaaakiaaxMaacaWLjaWaaeWaaeaacqaIZaWmaiaawIcacaGLPaaaaaa@75A8@

with μ ^ a , l MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWF8oqBgaqcamaaBaaaleaacqWGHbqycqGGSaalcqWGSbaBaeqaaaaa@3231@ and σ ^ a , l MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcamaaBaaaleaacqWGHbqycqGGSaalcqWGSbaBaeqaaaaa@323E@ the sample estimation under the assumption

In (MM a, lmn | SNPm is autosomal and called genotype BB on sample l) ~ N (μ a,l , σ a,l )

with μ ^ b , l MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWF8oqBgaqcamaaBaaaleaacqWGIbGycqGGSaalcqWGSbaBaeqaaaaa@3233@ and σ ^ b , l MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcamaaBaaaleaacqWGIbGycqGGSaalcqWGSbaBaeqaaaaa@3240@ the sample estimation under the assumption

In (MM b, lmn | SNPm is autosomal and called genotype AA on sample l) ~ N (μ b,l , σ b,l )

l = 1,..., L is the sample index; m = 1,..., M is the SNP index; n = 1,..., N is the probe index. μ ^ a , l MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWF8oqBgaqcamaaBaaaleaacqWGHbqycqGGSaalcqWGSbaBaeqaaaaa@3231@ , μ ^ b , l MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWF8oqBgaqcamaaBaaaleaacqWGIbGycqGGSaalcqWGSbaBaeqaaaaa@3233@ , σ ^ a , l MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcamaaBaaaleaacqWGHbqycqGGSaalcqWGSbaBaeqaaaaa@323E@ , σ ^ b , l MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcamaaBaaaleaacqWGIbGycqGGSaalcqWGSbaBaeqaaaaa@3240@ are sample specific parameters; and are subject to change for any future experiments. Following natural log-transformation and standardization, the 20 PM probe intensities in conjunction with the genotype information is then analyzed for copy number information

Probe Selection

PM probes which display a strong dosage response are selected for use in the algorithm. Each SNP has three possible genotypes: AA, AB and BB, which each respectively contains two, one, or zero doses of the A allele and zero, one, or two doses of the B allele. This provides an inherent positive control to examine dosage performance at the individual probe level on a SNP-by-SNP basis. Probe intensity information from the normal reference set is compared with genotypic information from the same individuals. Features with a linear correlation greater than 0.6 between the known allelic dosages based on the genotype calls and the probe intensity are selected. Intensity across selected probes is averaged and used in subsequent calculations.

A m = { n | Cor > 0.6 between S a;rmn and genotype G rm r = 1,..., R is the reference set}

B m = { n | Cor > 0.6 between S b;rmn and genotype G rm r = 1,..., R is the reference set}

S ¯ a , l m = n A m S a , l m n # { A m } ; S ¯ b , l m = n B m S b , l m n # { B m } ( 4 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqabeqacaaabaGafm4uamLbaebadaWgaaWcbaGaemyyaeMaeiilaWIaemiBaWMaemyBa0gabeaakiabg2da9maalaaabaWaaabuaeaacqWGtbWudaWgaaWcbaGaemyyaeMaeiilaWIaemiBaWMaemyBa0MaemOBa4gabeaaaeaacqWGUbGBcqGHiiIZcqWGbbqqdaWgaaadbaGaemyBa0gabeaaaSqab0GaeyyeIuoaaOqaaiabcocaJiabcUha7jabdgeabnaaBaaaleaacqWGTbqBaeqaaOGaeiyFa0haaiabcUda7aqaaiqbdofatzaaraWaaSbaaSqaaiabdkgaIjabcYcaSiabdYgaSjabd2gaTbqabaGccqGH9aqpdaWcaaqaamaaqafabaGaem4uam1aaSbaaSqaaiabdkgaIjabcYcaSiabdYgaSjabd2gaTjabd6gaUbqabaaabaGaemOBa4MaeyicI4SaemOqai0aaSbaaWqaaiabd2gaTbqabaaaleqaniabggHiLdaakeaacqGGJaWicqGG7bWEcqWGcbGqdaWgaaWcbaGaemyBa0gabeaakiabc2ha9baaaaGaaCzcaiaaxMaadaqadaqaaiabisda0aGaayjkaiaawMcaaaaa@6C2A@

SNPs that do not have at least one selected probe for both PMa and PMb probe sets were excluded from further analysis. A m and B m are parameters determined by the training set and are fixed for any given training set.

Regression on probe GC content and restriction fragment length

Variation of the SNP intensities can in part be explained by properties of the probe and restriction fragment sequences [47, 64]. These properties include but are not limited to the length and GC content of the restriction fragment target, GC content of the probe sequences, and secondary structure of the probe and target sequences. An evaluation of these factors identified the GC content of the probes and the restriction fragment length as main contributors to variability in probe intensities using the 100 K WGSA assay. Linear regression, which included linear and square terms of both variables, was applied to reduce the intensity variations.

S ¯ a , l = ( S ¯ a , l 1 , S ¯ a , l 2 , , S ¯ a , l m , , S ¯ a , l M ) ; S ¯ b , l = ( S ¯ b , l 1 , S ¯ b , l 2 , , S ¯ b , l m , , S ¯ b , l M ) S ¯ a , l = β a , 0 + β a , 1 + X a , 1 + β a , 2 X a , 1 2 + β a , 3 X 2 + β a , 4 X 2 2 + ε a S ¯ b , l = β b , 0 + β b , 1 X b , 1 + β b , 2 X b , 1 2 + β b , 3 X 2 + β b , 4 X 2 2 + ε b ( 5 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqaaeWabaaabaGafm4uamLbaebadaWgaaWcbaGaemyyaeMaeiilaWIaemiBaWgabeaakiabg2da9iabcIcaOiqbdofatzaaraWaaSbaaSqaaiabdggaHjabcYcaSiabdYgaSjabigdaXaqabaGccqGGSaalcuWGtbWugaqeamaaBaaaleaacqWGHbqycqGGSaalcqWGSbaBcqaIYaGmaeqaaOGaeiilaWIaeSOjGSKaeiilaWIafm4uamLbaebadaWgaaWcbaGaemyyaeMaeiilaWIaemiBaWMaemyBa0gabeaakiabcYcaSiablAciljabcYcaSiqbdofatzaaraWaaSbaaSqaaiabdggaHjabcYcaSiabdYgaSjabd2eanbqabaGccqGGPaqkcqGG7aWocqqGGaaicuWGtbWugaqeamaaBaaaleaacqWGIbGycqGGSaalcqWGSbaBaeqaaOGaeyypa0JaeiikaGIafm4uamLbaebadaWgaaWcbaGaemOyaiMaeiilaWIaemiBaWMaeGymaedabeaakiabcYcaSiqbdofatzaaraWaaSbaaSqaaiabdkgaIjabcYcaSiabdYgaSjabikdaYaqabaGccqGGSaalcqWIMaYscqGGSaalcuWGtbWugaqeamaaBaaaleaacqWGIbGycqGGSaalcqWGSbaBcqWGTbqBaeqaaOGaeiilaWIaeSOjGSKaeiilaWIafm4uamLbaebadaWgaaWcbaGaemOyaiMaeiilaWIaemiBaWMaemyta0eabeaakiabcMcaPaqaaiqbdofatzaaraWaaSbaaSqaaiabdggaHjabcYcaSiabdYgaSbqabaGccqGH9aqpiiGacqWFYoGydaWgaaWcbaGaemyyaeMaeiilaWIaeGimaadabeaakiabgUcaRiab=j7aInaaBaaaleaacqWGHbqycqGGSaalcqaIXaqmaeqaaOGaey4kaSIaemiwaG1aaSbaaSqaaiabdggaHjabcYcaSiabigdaXaqabaGccqGHRaWkcqWFYoGydaWgaaWcbaGaemyyaeMaeiilaWIaeGOmaidabeaakiabdIfaynaaDaaaleaacqWGHbqycqGGSaalcqaIXaqmaeaacqaIYaGmaaGccqGHRaWkcqWFYoGydaWgaaWcbaGaemyyaeMaeiilaWIaeG4mamdabeaakiabdIfaynaaBaaaleaacqaIYaGmaeqaaOGaey4kaSIae8NSdi2aaSbaaSqaaiabdggaHjabcYcaSiabisda0aqabaGccqWGybawdaqhaaWcbaGaeGOmaidabaGaeGOmaidaaOGaey4kaSIae8xTdu2aaSbaaSqaaiabdggaHbqabaaakeaacuWGtbWugaqeamaaBaaaleaacqWGIbGycqGGSaalcqWGSbaBaeqaaOGaeyypa0Jae8NSdi2aaSbaaSqaaiabdkgaIjabcYcaSiabicdaWaqabaGccqGHRaWkcqWFYoGydaWgaaWcbaGaemOyaiMaeiilaWIaeGymaedabeaakiabdIfaynaaBaaaleaacqWGIbGycqGGSaalcqaIXaqmaeqaaOGaey4kaSIae8NSdi2aaSbaaSqaaiabdkgaIjabcYcaSiabikdaYaqabaGccqWGybawdaqhaaWcbaGaemOyaiMaeiilaWIaeGymaedabaGaeGOmaidaaOGaey4kaSIae8NSdi2aaSbaaSqaaiabdkgaIjabcYcaSiabiodaZaqabaGccqWGybawdaWgaaWcbaGaeGOmaidabeaakiabgUcaRiab=j7aInaaBaaaleaacqWGIbGycqGGSaalcqaI0aanaeqaaOGaemiwaG1aa0baaSqaaiabikdaYaqaaiabikdaYaaakiabgUcaRiab=v7aLnaaBaaaleaacqWGIbGyaeqaaaaakiaaxMaacaWLjaWaaeWaaeaacqaI1aqnaiaawIcacaGLPaaaaaa@ED36@

X a,1 is the probe GC content averaged across the selected PMa probes; X b,1 is the probe GC content averaged across the selected PMb probes; and X 2 is the restriction fragment length. The regression coefficients are sample-specific and thus are re-estimated for each new sample.

S ˜ a , l = β ^ a , 0 + ε ^ a S ˜ b , l = β ^ b , 0 + ε ^ b S ˜ a , l = ( S ˜ a , l 1 , S ˜ a , l 2 , , S ˜ a , l m , , S ˜ a , l M ) ; S ˜ b , l = ( S ˜ b , l 1 , S ˜ b , l 2 , , S ˜ b , l m , , S ˜ b , l M ) ( 6 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakqaabeqaauaabeqabiaaaeaacuWGtbWugaacamaaBaaaleaacqWGHbqycqGGSaalcqWGSbaBaeqaaOGaeyypa0dcciGaf8NSdiMbaKaadaWgaaWcbaGaemyyaeMaeiilaWIaeGimaadabeaakiabgUcaRiqb=v7aLzaajaWaaSbaaSqaaiabdggaHbqabaaakeaacuWGtbWugaacamaaBaaaleaacqWGIbGycqGGSaalcqWGSbaBaeqaaOGaeyypa0Jaf8NSdiMbaKaadaWgaaWcbaGaemOyaiMaeiilaWIaeGimaadabeaakiabgUcaRiqb=v7aLzaajaWaaSbaaSqaaiabdkgaIbqabaaaaaGcbaGafm4uamLbaGaadaWgaaWcbaGaemyyaeMaeiilaWIaemiBaWgabeaakiabg2da9iabcIcaOiqbdofatzaaiaWaaSbaaSqaaiabdggaHjabcYcaSiabdYgaSjabigdaXaqabaGccqGGSaalcuWGtbWugaacamaaBaaaleaacqWGHbqycqGGSaalcqWGSbaBcqaIYaGmaeqaaOGaeiilaWIaeSOjGSKaeiilaWIafm4uamLbaGaadaWgaaWcbaGaemyyaeMaeiilaWIaemiBaWMaemyBa0gabeaakiabcYcaSiablAciljabcYcaSiqbdofatzaaiaWaaSbaaSqaaiabdggaHjabcYcaSiabdYgaSjabd2eanbqabaGccqGGPaqkcqGG7aWocqqGGaaicuWGtbWugaacamaaBaaaleaacqWGIbGycqGGSaalcqWGSbaBaeqaaOGaeyypa0JaeiikaGIafm4uamLbaGaadaWgaaWcbaGaemOyaiMaeiilaWIaemiBaWMaeGymaedabeaakiabcYcaSiqbdofatzaaiaWaaSbaaSqaaiabdkgaIjabcYcaSiabdYgaSjabikdaYaqabaGccqGGSaalcqWIMaYscqGGSaalcuWGtbWugaacamaaBaaaleaacqWGIbGycqGGSaalcqWGSbaBcqWGTbqBaeqaaOGaeiilaWIaeSOjGSKaeiilaWIafm4uamLbaGaadaWgaaWcbaGaemOyaiMaeiilaWIaemiBaWMaemyta0eabeaakiabcMcaPiaaxMaacaWLjaWaaeWaaeaacqaI2aGnaiaawIcacaGLPaaaaaaa@9E1C@

The residuals plus the constant term were used as adjusted intensity in the coming steps with the effects due to the probes and the fragment being regressed out.

Regression adjustment

Following the standardization and regression on the probe GC content and the restriction fragment length, a further correction of systematic intensity deviations was done by a regression on the reference set mean intensities. The reference mean intensity for a given probe set (PMa or PMb) and genotype (AA, AB or BB) was calculated for each SNP. For a given test sample, two regressions are performed in this adjustment step: one for PMa and one for PMb. In each regression, the PM intensity on the test sample across all SNPs is regressed against the average PM intensity of the reference samples that shares the same genotype as the test sample. With the estimated regression coefficients, the test sample is linearly transformed by subtracting the intercept then dividing by the slope such that after the transformation, the regression line of the test sample intensity against the average reference intensity is Y = X.

Rm,AA= {r | G rm = AA; r = 1, ..., R};     Rm,AB= {r | G rm = AB; r = 1, ..., R};

Rm,BB= {r | G rm = BB; r = 1, ..., R}; Rm,all= {r | G rm = AA, AB or BB ; r = 1, ..., R};

S ˜ a , l = α a , 0 + α a , 1 × U a , l + ε ; S ˜ b , l = α b , 0 + α b , 1 × U b , l + ε U a , l = ( U a , l 1 , U a , l 2 , , U a , l m , , U a , l M ) ; U b , l = ( U b , l 1 , U b , l 2 , , U b , l m , , U b , l M ) ( 7 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakqaabeqaauaabeqabiaaaeaacuWGtbWugaacamaaBaaaleaacqWGHbqycqGGSaalcqWGSbaBaeqaaOGaeyypa0dcciGae8xSde2aaSbaaSqaaiabdggaHjabcYcaSiabicdaWaqabaGccqGHRaWkcqWFXoqydaWgaaWcbaGaemyyaeMaeiilaWIaeGymaedabeaakiabgEna0kabdwfavnaaBaaaleaacqWGHbqycqGGSaalcqWGSbaBaeqaaOGaey4kaSIae8xTduMaei4oaSdabaGafm4uamLbaGaadaWgaaWcbaGaemOyaiMaeiilaWIaemiBaWgabeaakiabg2da9iab=f7aHnaaBaaaleaacqWGIbGycqGGSaalcqaIWaamaeqaaOGaey4kaSIae8xSde2aaSbaaSqaaiabdkgaIjabcYcaSiabigdaXaqabaGccqGHxdaTcqWGvbqvdaWgaaWcbaGaemOyaiMaeiilaWIaemiBaWgabeaakiabgUcaRiab=v7aLbaaaeaacqWGvbqvdaWgaaWcbaGaemyyaeMaeiilaWIaemiBaWgabeaakiabg2da9iabcIcaOiabdwfavnaaBaaaleaacqWGHbqycqGGSaalcqWGSbaBcqaIXaqmaeqaaOGaeiilaWIaemyvau1aaSbaaSqaaiabdggaHjabcYcaSiabdYgaSjabikdaYaqabaGccqGGSaalcqWIMaYscqGGSaalcqWGvbqvdaWgaaWcbaGaemyyaeMaeiilaWIaemiBaWMaemyBa0gabeaakiabcYcaSiablAciljabcYcaSiabdwfavnaaBaaaleaacqWGHbqycqGGSaalcqWGSbaBcqWGnbqtaeqaaOGaeiykaKIaei4oaSJaeeiiaaIaemyvau1aaSbaaSqaaiabdkgaIjabcYcaSiabdYgaSbqabaGccqGH9aqpcqGGOaakcqWGvbqvdaWgaaWcbaGaemOyaiMaeiilaWIaemiBaWMaeGymaedabeaakiabcYcaSiabdwfavnaaBaaaleaacqWGIbGycqGGSaalcqWGSbaBcqaIYaGmaeqaaOGaeiilaWIaeSOjGSKaeiilaWIaemyvau1aaSbaaSqaaiabdkgaIjabcYcaSiabdYgaSjabd2gaTbqabaGccqGGSaalcqWIMaYscqGGSaalcqWGvbqvdaWgaaWcbaGaemOyaiMaeiilaWIaemiBaWMaemyta0eabeaakiabcMcaPiaaxMaacaWLjaWaaeWaaeaacqaI3aWnaiaawIcacaGLPaaaaaaa@B51C@
U a , l m = { r R m , A A S ˜ a , r m # { R m , A A } G l m = A A r R m , A B S ˜ a , r m # { R m , A B } G l m = A B r R m , B B S ˜ a , r m # { R m , B B } G l m = B B r R m , a l l S ˜ a , r m # { R m , a l l } G l m = N o c a l l U b , l m = { r R m , A A S ˜ b , r m # { R m , A A } G l m = A A r R m , A B S ˜ b , r m # { R m , A B } G l m = A B r R m , B B S ˜ b , r m # { R m , B B } G l m = B B r R m , a l l S ˜ b , r m # { R m , a l l } G l m = N o c a l l I a , l = S ˜ a , l α ^ a , 0 α ^ a , 1 ; I b , l = S ˜ b , l α ^ b , 0 α ^ b , 1 ( 8 ) I a , l = ( I a , l 1 , I a , l 2 , , I a , l m , , I a , l M ) ; I b , l = ( I b , l 1 , I b , l 2 , , I b , l m , , I b , l M ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakqGabeqaaa=gbaqbaeqabiGaaaqaaiabdwfavnaaBaaaleaacqWGHbqycqGGSaalcqWGSbaBcqWGTbqBaeqaaOGaeyypa0ZaaiqaaeaafaqabeabcaaaaeaadaWcaaqaamaaqafabaGafm4uamLbaGaadaWgaaWcbaGaemyyaeMaeiilaWIaemOCaiNaemyBa0gabeaaaeaacqWGYbGCcqGHiiIZcqWGsbGudaWgaaadbaGaemyBa0MaeiilaWIaemyqaeKaemyqaeeabeaaaSqab0GaeyyeIuoaaOqaaiabcocaJiabcUha7jabdkfasnaaBaaaleaacqWGTbqBcqGGSaalcqWGbbqqcqWGbbqqaeqaaOGaeiyFa0haaaqaaiabdEeahnaaBaaaleaacqWGSbaBcqWGTbqBaeqaaOGaeyypa0JaemyqaeKaemyqaeeabaWaaSaaaeaadaaeqbqaaiqbdofatzaaiaWaaSbaaSqaaiabdggaHjabcYcaSiabdkhaYjabd2gaTbqabaaabaGaemOCaiNaeyicI4SaemOuai1aaSbaaWqaaiabd2gaTjabcYcaSiabdgeabjabdkeacbqabaaaleqaniabggHiLdaakeaacqGGJaWicqGG7bWEcqWGsbGudaWgaaWcbaGaemyBa0MaeiilaWIaemyqaeKaemOqaieabeaakiabc2ha9baaaeaacqWGhbWrdaWgaaWcbaGaemiBaWMaemyBa0gabeaakiabg2da9iabdgeabjabdkeacbqaamaalaaabaWaaabuaeaacuWGtbWugaacamaaBaaaleaacqWGHbqycqGGSaalcqWGYbGCcqWGTbqBaeqaaaqaaiabdkhaYjabgIGiolabdkfasnaaBaaameaacqWGTbqBcqGGSaalcqWGcbGqcqWGcbGqaeqaaaWcbeqdcqGHris5aaGcbaGaei4iamIaei4EaSNaemOuai1aaSbaaSqaaiabd2gaTjabcYcaSiabdkeacjabdkeacbqabaGccqGG9bqFaaaabaGaem4raC0aaSbaaSqaaiabdYgaSjabd2gaTbqabaGccqGH9aqpcqWGcbGqcqWGcbGqaeaadaWcaaqaamaaqafabaGafm4uamLbaGaadaWgaaWcbaGaemyyaeMaeiilaWIaemOCaiNaemyBa0gabeaaaeaacqWGYbGCcqGHiiIZcqWGsbGudaWgaaadbaGaemyBa0MaeiilaWIaemyyaeMaemiBaWMaemiBaWgabeaaaSqab0GaeyyeIuoaaOqaaiabcocaJiabcUha7jabdkfasnaaBaaaleaacqWGTbqBcqGGSaalcqWGHbqycqWGSbaBcqWGSbaBaeqaaOGaeiyFa0haaaqaaiabdEeahnaaBaaaleaacqWGSbaBcqWGTbqBaeqaaOGaeyypa0JaemOta4Kaem4Ba8MaeeiiaaIaem4yamMaemyyaeMaemiBaWMaemiBaWgaaaGaay5EaaaabaGaemyvau1aaSbaaSqaaiabdkgaIjabcYcaSiabdYgaSjabd2gaTbqabaGccqGH9aqpdaGabaqaauaabeqaeiaaaaqaamaalaaabaWaaabuaeaacuWGtbWugaacamaaBaaaleaacqWGIbGycqGGSaalcqWGYbGCcqWGTbqBaeqaaaqaaiabdkhaYjabgIGiolabdkfasnaaBaaameaacqWGTbqBcqGGSaalcqWGbbqqcqWGbbqqaeqaaaWcbeqdcqGHris5aaGcbaGaei4iamIaei4EaSNaemOuai1aaSbaaSqaaiabd2gaTjabcYcaSiabdgeabjabdgeabbqabaGccqGG9bqFaaaabaGaem4raC0aaSbaaSqaaiabdYgaSjabd2gaTbqabaGccqGH9aqpcqWGbbqqcqWGbbqqaeaadaWcaaqaamaaqafabaGafm4uamLbaGaadaWgaaWcbaGaemOyaiMaeiilaWIaemOCaiNaemyBa0gabeaaaeaacqWGYbGCcqGHiiIZcqWGsbGudaWgaaadbaGaemyBa0MaeiilaWIaemyqaeKaemOqaieabeaaaSqab0GaeyyeIuoaaOqaaiabcocaJiabcUha7jabdkfasnaaBaaaleaacqWGTbqBcqGGSaalcqWGbbqqcqWGcbGqaeqaaOGaeiyFa0haaaqaaiabdEeahnaaBaaaleaacqWGSbaBcqWGTbqBaeqaaOGaeyypa0JaemyqaeKaemOqaieabaWaaSaaaeaadaaeqbqaaiqbdofatzaaiaWaaSbaaSqaaiabdkgaIjabcYcaSiabdkhaYjabd2gaTbqabaaabaGaemOCaiNaeyicI4SaemOuai1aaSbaaWqaaiabd2gaTjabcYcaSiabdkeacjabdkeacbqabaaaleqaniabggHiLdaakeaacqGGJaWicqGG7bWEcqWGsbGudaWgaaWcbaGaemyBa0MaeiilaWIaemOqaiKaemOqaieabeaakiabc2ha9baaaeaacqWGhbWrdaWgaaWcbaGaemiBaWMaemyBa0gabeaakiabg2da9iabdkeacjabdkeacbqaamaalaaabaWaaabuaeaacuWGtbWugaacamaaBaaaleaacqWGIbGycqGGSaalcqWGYbGCcqWGTbqBaeqaaaqaaiabdkhaYjabgIGiolabdkfasnaaBaaameaacqWGTbqBcqGGSaalcqWGHbqycqWGSbaBcqWGSbaBaeqaaaWcbeqdcqGHris5aaGcbaGaei4iamIaei4EaSNaemOuai1aaSbaaSqaaiabd2gaTjabcYcaSiabdggaHjabdYgaSjabdYgaSbqabaGccqGG9bqFaaaabaGaem4raC0aaSbaaSqaaiabdYgaSjabd2gaTbqabaGccqGH9aqpcqWGobGtcqWGVbWBcqqGGaaicqWGJbWycqWGHbqycqWGSbaBcqWGSbaBaaaacaGL7baaaeaacqWGjbqsdaWgaaWcbaGaemyyaeMaeiilaWIaemiBaWgabeaakiabg2da9maalaaabaGafm4uamLbaGaadaWgaaWcbaGaemyyaeMaeiilaWIaemiBaWgabeaakiabgkHiTGGaciqb=f7aHzaajaWaaSbaaSqaaiabdggaHjabcYcaSiabicdaWaqabaaakeaacuWFXoqygaqcamaaBaaaleaacqWGHbqycqGGSaalcqaIXaqmaeqaaaaakiabcUda7aqaaiabdMeajnaaBaaaleaacqWGIbGycqGGSaalcqWGSbaBaeqaaOGaeyypa0ZaaSaaaeaacuWGtbWugaacamaaBaaaleaacqWGIbGycqGGSaalcqWGSbaBaeqaaOGaeyOeI0Iaf8xSdeMbaKaadaWgaaWcbaGaemOyaiMaeiilaWIaeGimaadabeaaaOqaaiqb=f7aHzaajaWaaSbaaSqaaiabdkgaIjabcYcaSiabigdaXaqabaaaaaaakiaaxMaacaWLjaWaaeWaaeaacqaI4aaoaiaawIcacaGLPaaaaeaacqWGjbqsdaWgaaWcbaGaemyyaeMaeiilaWIaemiBaWgabeaakiabg2da9iabcIcaOiabdMeajnaaBaaaleaacqWGHbqycqGGSaalcqWGSbaBcqaIXaqmaeqaaOGaeiilaWIaemysaK0aaSbaaSqaaiabdggaHjabcYcaSiabdYgaSjabikdaYaqabaGccqGGSaalcqWIMaYscqGGSaalcqWGjbqsdaWgaaWcbaGaemyyaeMaeiilaWIaemiBaWMaemyBa0gabeaakiabcYcaSiablAciljabcYcaSiabdMeajnaaBaaaleaacqWGHbqycqGGSaalcqWGSbaBcqWGnbqtaeqaaOGaeiykaKIaei4oaSJaeeiiaaIaemysaK0aaSbaaSqaaiabdkgaIjabcYcaSiabdYgaSbqabaGccqGH9aqpcqGGOaakcqWGjbqsdaWgaaWcbaGaemOyaiMaeiilaWIaemiBaWMaeGymaedabeaakiabcYcaSiabdMeajnaaBaaaleaacqWGIbGycqGGSaalcqWGSbaBcqaIYaGmaeqaaOGaeiilaWIaeSOjGSKaeiilaWIaemysaK0aaSbaaSqaaiabdkgaIjabcYcaSiabdYgaSjabd2gaTbqabaGccqGGSaalcqWIMaYscqGGSaalcqWGjbqsdaWgaaWcbaGaemOyaiMaeiilaWIaemiBaWMaemyta0eabeaakiabcMcaPaaaaa@E28F@

Where Rm,AA, Rm,AB, Rm,BB, Rm,allare the corresponding subsets of the reference samples whose genotypes are "AA", "AB", "BB", and the union of the three groups on SNP m, (m = 1 to M); U a,l and U b,l are two vectors of the average PMa, PMb intensity across all SNPs on reference samples that share the same genotype as the target sample l; G lm is the genotype of test sample l on SNP m; S ˜ a , r m MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaacamaaBaaaleaacqWGHbqycqGGSaalcqWGYbGCcqWGTbqBaeqaaaaa@3311@ and S ˜ b , r m MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaacamaaBaaaleaacqWGIbGycqGGSaalcqWGYbGCcqWGTbqBaeqaaaaa@3313@ are the PMa, PMb intensity of reference sample r (r = 1 to R), SNP m (m = 1 to M); S ˜ a , l MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaacamaaBaaaleaacqWGHbqycqGGSaalcqWGSbaBaeqaaaaa@31A2@ and S ˜ b , l MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaacamaaBaaaleaacqWGIbGycqGGSaalcqWGSbaBaeqaaaaa@31A4@ are the PMa, PMb intensity of test sample l before this adjustment step; and Ia,land Ib,lare the PMa, PMb intensity of test sample l after this adjustment step. The regression coefficients are sample dependent and thus are estimated for each specific sample.

Single point copy number prediction and significance calculation

A ln-ln model was used to estimate the copy number of each allele under the assumption that the natural log of the DNA target copy number has a linear relationship with the natural log-transformed intensity, where r = 1,...,R equals the reference set with an assumed diploid genome.

I a , m = ( I a , 1 m , I a , 2 m , , I a , r m , , I a , R m ) ; I b , m = ( I b , 1 m , I b , 2 m , , I b , r m , , I b , R m ) C a , m = ( C a , 1 m , C a , 2 m , , C a , r m , , C a , R m ) ;  C b , m = ( C b , 1 m , C b , 2 m , , C b , r m , , C b , R m ) ln ( C a , m + δ a , m ) = α a 0 , m + α a 1 , m I a , m + ε a , m ln ( C b , m + δ b , m ) = α b 0 , m + α b 1 , m I b , m + ε b , m ( 9 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqaaeabbaaaaeaacqWGjbqsdaWgaaWcbaGaemyyaeMaeiilaWIaemyBa0gabeaakiabg2da9iabcIcaOiabdMeajnaaBaaaleaacqWGHbqycqGGSaalcqaIXaqmcqWGTbqBaeqaaOGaeiilaWIaemysaK0aaSbaaSqaaiabdggaHjabcYcaSiabikdaYiabd2gaTbqabaGccqGGSaalcqWIMaYscqGGSaalcqWGjbqsdaWgaaWcbaGaemyyaeMaeiilaWIaemOCaiNaemyBa0gabeaakiabcYcaSiablAciljabcYcaSiabdMeajnaaBaaaleaacqWGHbqycqGGSaalcqWGsbGucqWGTbqBaeqaaOGaeiykaKIaei4oaSJaeeiiaaIaemysaK0aaSbaaSqaaiabdkgaIjabcYcaSiabd2gaTbqabaGccqGH9aqpcqGGOaakcqWGjbqsdaWgaaWcbaGaemOyaiMaeiilaWIaeGymaeJaemyBa0gabeaakiabcYcaSiabdMeajnaaBaaaleaacqWGIbGycqGGSaalcqaIYaGmcqWGTbqBaeqaaOGaeiilaWIaeSOjGSKaeiilaWIaemysaK0aaSbaaSqaaiabdkgaIjabcYcaSiabdkhaYjabd2gaTbqabaGccqGGSaalcqWIMaYscqGGSaalcqWGjbqsdaWgaaWcbaGaemOyaiMaeiilaWIaemOuaiLaemyBa0gabeaakiabcMcaPaqaaiabdoeadnaaBaaaleaacqWGHbqycqGGSaalcqWGTbqBaeqaaOGaeyypa0JaeiikaGIaem4qam0aaSbaaSqaaiabdggaHjabcYcaSiabigdaXiabd2gaTbqabaGccqGGSaalcqWGdbWqdaWgaaWcbaGaemyyaeMaeiilaWIaeGOmaiJaemyBa0gabeaakiabcYcaSiablAciljabcYcaSiabdoeadnaaBaaaleaacqWGHbqycqGGSaalcqWGYbGCcqWGTbqBaeqaaOGaeiilaWIaeSOjGSKaeiilaWIaem4qam0aaSbaaSqaaiabdggaHjabcYcaSiabdkfasjabd2gaTbqabaGccqGGPaqkcqGG7aWocqqGGaaicqqGdbWqdaWgaaWcbaGaemOyaiMaeiilaWIaemyBa0gabeaakiabg2da9iabcIcaOiabdoeadnaaBaaaleaacqWGIbGycqGGSaalcqaIXaqmcqWGTbqBaeqaaOGaeiilaWIaem4qam0aaSbaaSqaaiabdkgaIjabcYcaSiabikdaYiabd2gaTbqabaGccqGGSaalcqWIMaYscqGGSaalcqWGdbWqdaWgaaWcbaGaemOyaiMaeiilaWIaemOCaiNaemyBa0gabeaakiabcYcaSiablAciljabcYcaSiabdoeadnaaBaaaleaacqWGIbGycqGGSaalcqWGsbGucqWGTbqBaeqaaOGaeiykaKcabaGagiiBaWMaeiOBa4MaeiikaGIaem4qam0aaSbaaSqaaiabdggaHjabcYcaSiabd2gaTbqabaGccqGHRaWkiiGacqWF0oazdaWgaaWcbaGaemyyaeMaeiilaWIaemyBa0gabeaakiabcMcaPiabg2da9iab=f7aHnaaBaaaleaacqWGHbqycqaIWaamcqGGSaalcqWGTbqBaeqaaOGaey4kaSIae8xSde2aaSbaaSqaaiabdggaHjabigdaXiabcYcaSiabd2gaTbqabaGccqWGjbqsdaWgaaWcbaGaemyyaeMaeiilaWIaemyBa0gabeaakiabgUcaRiab=v7aLnaaBaaaleaacqWGHbqycqGGSaalcqWGTbqBaeqaaaGcbaGagiiBaWMaeiOBa4MaeiikaGIaem4qam0aaSbaaSqaaiabdkgaIjabcYcaSiabd2gaTbqabaGccqGHRaWkcqWF0oazdaWgaaWcbaGaemOyaiMaeiilaWIaemyBa0gabeaakiabcMcaPiabg2da9iab=f7aHnaaBaaaleaacqWGIbGycqaIWaamcqGGSaalcqWGTbqBaeqaaOGaey4kaSIae8xSde2aaSbaaSqaaiabdkgaIjabigdaXiabcYcaSiabd2gaTbqabaGccqWGjbqsdaWgaaWcbaGaemOyaiMaeiilaWIaemyBa0gabeaakiabgUcaRiab=v7aLnaaBaaaleaacqWGIbGycqGGSaalcqWGTbqBaeqaaaaakiaaxMaacaWLjaWaaeWaaeaacqaI5aqoaiaawIcacaGLPaaaaaa@1F05@

The parameters in the formulas were estimated using the reference set and their known genotypes. I a,rm and I b,rm are the PMa, PMb intensity of reference sample r on SNP m; C a,rm and C b,rm are the known copy numbers on the A and B alleles of reference sample r on SNP m. Since the allelic copy number can be equal to zero, for each SNP m, two small positive numbers δ a,m and δ b,m which represent the non-specific hybridization that account for the baseline intensity, were added. The values of δ a,m and δ b,m were tested over a range of 0 to 5 with 0.01 increments and the value which generated the highest linear correlation between the natural log-transformed copy number ln(Ca,m+ δa,m) (ln(Cb,m+ δb,m)) and the natural log-transformed chip intensity Ia,m(Ib,m) were selected. SNPs with the highest correlation value < 0.8 are removed from further analysis. After δa,mand δb,mwere fixed, The terms that represent the effect of optical background αa0,m, αb0,m, and the scaling factor αa1,m, αb1,m, were estimated using least square regression with the normal references as the training set. After the estimation, the final copy number equation is:

C ^ a , l m = max ( exp ( α ^ a 1 , m I a , l m + α ^ a 0 , m ) δ ^ a , m , 0 ) C ^ b , l m = max ( exp ( α ^ b 1 , m I b , l m + α ^ b 0 , m ) δ ^ b , m , 0 ) ( 10 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqabeGabaaabaGafm4qamKbaKaadaWgaaWcbaGaemyyaeMaeiilaWIaemiBaWMaemyBa0gabeaakiabg2da9iGbc2gaTjabcggaHjabcIha4jabcIcaOiGbcwgaLjabcIha4jabcchaWjabcIcaOGGaciqb=f7aHzaajaWaaSbaaSqaaiabdggaHjabigdaXiabcYcaSiabd2gaTbqabaGccqWGjbqsdaWgaaWcbaGaemyyaeMaeiilaWIaemiBaWMaemyBa0gabeaakiabgUcaRiqb=f7aHzaajaWaaSbaaSqaaiabdggaHjabicdaWiabcYcaSiabd2gaTbqabaGccqGGPaqkcqGHsislcuWF0oazgaqcamaaBaaaleaacqWGHbqycqGGSaalcqWGTbqBaeqaaOGaeiilaWIaeGimaaJaeiykaKcabaGafm4qamKbaKaadaWgaaWcbaGaemOyaiMaeiilaWIaemiBaWMaemyBa0gabeaakiabg2da9iGbc2gaTjabcggaHjabcIha4jabcIcaOiGbcwgaLjabcIha4jabcchaWjabcIcaOiqb=f7aHzaajaWaaSbaaSqaaiabdkgaIjabigdaXiabcYcaSiabd2gaTbqabaGccqWGjbqsdaWgaaWcbaGaemOyaiMaeiilaWIaemiBaWMaemyBa0gabeaakiabgUcaRiqb=f7aHzaajaWaaSbaaSqaaiabdkgaIjabicdaWiabcYcaSiabd2gaTbqabaGccqGGPaqkcqGHsislcuWF0oazgaqcamaaBaaaleaacqWGIbGycqGGSaalcqWGTbqBaeqaaOGaeiilaWIaeGimaaJaeiykaKcaaiaaxMaacaWLjaWaaeWaaeaacqaIXaqmcqaIWaamaiaawIcacaGLPaaaaaa@8FA0@

All the parameters involved in the allele-specific copy number model are fixed with a given training set.

The copy number calculation is allele specific and SNP specific. The values for the two alleles can be summed to present the total copy number. The significance of the total copy number is calculated to identify putative amplifications and deletions. The reference samples are refitted into the ln-ln linear models and predicted total copy numbers are recorded. For a given SNP m, and a given reference sample r (r = 1 to R), the predicted total copy number is:

C ^ t , r m = C ^ a , r m + C ^ b , r m = max ( exp ( α ^ a 0 , m + α ^ a 1 , m S ˜ a , r m ) δ ^ a , m , 0 ) + max ( exp ( α ^ b 0 , m + α ^ b 1 , m S ˜ b , r m ) δ ^ b , m , 0 ) ( 11 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGdbWqgaqcamaaBaaaleaacqWG0baDcqGGSaalcqWGYbGCcqWGTbqBaeqaaOGaeyypa0Jafm4qamKbaKaadaWgaaWcbaGaemyyaeMaeiilaWIaemOCaiNaemyBa0gabeaakiabgUcaRiqbdoeadzaajaWaaSbaaSqaaiabdkgaIjabcYcaSiabdkhaYjabd2gaTbqabaGccqGH9aqpcyGGTbqBcqGGHbqycqGG4baEcqGGOaakcyGGLbqzcqGG4baEcqGGWbaCcqGGOaakiiGacuWFXoqygaqcamaaBaaaleaacqWGHbqycqaIWaamcqGGSaalcqWGTbqBaeqaaOGaey4kaSIaf8xSdeMbaKaadaWgaaWcbaGaemyyaeMaeGymaeJaeiilaWIaemyBa0gabeaakiqbdofatzaaiaWaaSbaaSqaaiabdggaHjabcYcaSiabdkhaYjabd2gaTbqabaGccqGGPaqkcqGHsislcuWF0oazgaqcamaaBaaaleaacqWGHbqycqGGSaalcqWGTbqBaeqaaOGaeiilaWIaeGimaaJaeiykaKIaey4kaSIagiyBa0MaeiyyaeMaeiiEaGNaeiikaGIagiyzauMaeiiEaGNaeiiCaaNaeiikaGIaf8xSdeMbaKaadaWgaaWcbaGaemOyaiMaeGimaaJaeiilaWIaemyBa0gabeaakiabgUcaRiqb=f7aHzaajaWaaSbaaSqaaiabdkgaIjabigdaXiabcYcaSiabd2gaTbqabaGccuWGtbWugaacamaaBaaaleaacqWGIbGycqGGSaalcqWGYbGCcqWGTbqBaeqaaOGaeiykaKIaeyOeI0Iaf8hTdqMbaKaadaWgaaWcbaGaemOyaiMaeiilaWIaemyBa0gabeaakiabcYcaSiabicdaWiabcMcaPiaaxMaacaWLjaWaaeWaaeaacqaIXaqmcqaIXaqmaiaawIcacaGLPaaaaaa@9845@

For each SNP, there will be a range of variability across the normal reference samples based on their estimated copy numbers and such variability is summarized as the reference distribution under the Gaussian assumption. Target samples are compared to this reference distribution and significance is calculated accordingly [35].

C ^ t , m = ( C ^ t , 1 m , C ^ t , 2 m , , C ^ t , r m , , C ^ t , R m ) C ^ t , m N ( μ t m , σ t m 2 ) μ ^ t m = 1 R r = 1 R C ^ t , r m σ ^ t m = 1 R 1 r = 1 R ( C ^ t , r m μ ^ t m ) 2 ( 12 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqabeGabaaabaGafm4qamKbaKaadaWgaaWcbaGaemiDaqNaeiilaWIaemyBa0gabeaakiabg2da9iabcIcaOiqbdoeadzaajaWaaSbaaSqaaiabdsha0jabcYcaSiabigdaXiabd2gaTbqabaGccqGGSaalcuWGdbWqgaqcamaaBaaaleaacqWG0baDcqGGSaalcqaIYaGmcqWGTbqBaeqaaOGaeiilaWIaeSOjGSKaeiilaWIafm4qamKbaKaadaWgaaWcbaGaemiDaqNaeiilaWIaemOCaiNaemyBa0gabeaakiabcYcaSiablAciljabcYcaSiqbdoeadzaajaWaaSbaaSqaaiabdsha0jabcYcaSiabdkfasjabd2gaTbqabaGccqGGPaqkaeaafaqabeqadaaabaGafm4qamKbaKaadaWgaaWcbaGaemiDaqNaeiilaWIaemyBa0gabeaakiablYJi6iabd6eaojabcIcaOGGaciab=X7aTnaaBaaaleaacqWG0baDcqWGTbqBaeqaaOGaeiilaWIae83Wdm3aa0baaSqaaiabdsha0jabd2gaTbqaaiabikdaYaaakiabcMcaPaqaaiqb=X7aTzaajaWaaSbaaSqaaiabdsha0jabd2gaTbqabaGccqGH9aqpdaWcaaqaaiabigdaXaqaaiabdkfasbaadaaeWbqaaiqbdoeadzaajaWaaSbaaSqaaiabdsha0jabcYcaSiabdkhaYjabd2gaTbqabaaabaGaemOCaiNaeyypa0JaeGymaedabaGaemOuaifaniabggHiLdaakeaacuWFdpWCgaqcamaaBaaaleaacqWG0baDcqWGTbqBaeqaaOGaeyypa0ZaaOaaaeaadaWcaaqaaiabigdaXaqaaiabdkfasjabgkHiTiabigdaXaaadaaeWbqaaiabcIcaOiqbdoeadzaajaWaaSbaaSqaaiabdsha0jabcYcaSiabdkhaYjabd2gaTbqabaGccqGHsislcuWF8oqBgaqcamaaBaaaleaacqWG0baDcqWGTbqBaeqaaOGaeiykaKYaaWbaaSqabeaacqaIYaGmaaaabaGaemOCaiNaeyypa0JaeGymaedabaGaemOuaifaniabggHiLdaaleqaaaaaaaGccaWLjaGaaCzcamaabmaabaGaeGymaeJaeGOmaidacaGLOaGaayzkaaaaaa@A1F5@

For a given test sample l on SNP m, the total copy number estimate is:

C ^ t , l m = C ^ a , l m + C ^ b , l m = max ( exp ( α ^ a 0 , m + α ^ a 1 , m I a , l m ) δ ^ a , m , 0 ) + max ( exp ( α ^ b 0 , m + α ^ b 1 , m I b , l m ) δ ^ b , m , 0 ) ( 13 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGdbWqgaqcamaaBaaaleaacqWG0baDcqGGSaalcqWGSbaBcqWGTbqBaeqaaOGaeyypa0Jafm4qamKbaKaadaWgaaWcbaGaemyyaeMaeiilaWIaemiBaWMaemyBa0gabeaakiabgUcaRiqbdoeadzaajaWaaSbaaSqaaiabdkgaIjabcYcaSiabdYgaSjabd2gaTbqabaGccqGH9aqpcyGGTbqBcqGGHbqycqGG4baEcqGGOaakcyGGLbqzcqGG4baEcqGGWbaCcqGGOaakiiGacuWFXoqygaqcamaaBaaaleaacqWGHbqycqaIWaamcqGGSaalcqWGTbqBaeqaaOGaey4kaSIaf8xSdeMbaKaadaWgaaWcbaGaemyyaeMaeGymaeJaeiilaWIaemyBa0gabeaakiabdMeajnaaBaaaleaacqWGHbqycqGGSaalcqWGSbaBcqWGTbqBaeqaaOGaeiykaKIaeyOeI0Iaf8hTdqMbaKaadaWgaaWcbaGaemyyaeMaeiilaWIaemyBa0gabeaakiabcYcaSiabicdaWiabcMcaPiabgUcaRiGbc2gaTjabcggaHjabcIha4jabcIcaOiGbcwgaLjabcIha4jabcchaWjabcIcaOiqb=f7aHzaajaWaaSbaaSqaaiabdkgaIjabicdaWiabcYcaSiabd2gaTbqabaGccqGHRaWkcuWFXoqygaqcamaaBaaaleaacqWGIbGycqaIXaqmcqGGSaalcqWGTbqBaeqaaOGaemysaK0aaSbaaSqaaiabdkgaIjabcYcaSiabdYgaSjabd2gaTbqabaGccqGGPaqkcqGHsislcuWF0oazgaqcamaaBaaaleaacqWGIbGycqGGSaalcqWGTbqBaeqaaOGaeiilaWIaeGimaaJaeiykaKIaaCzcaiaaxMaadaqadaqaaiabigdaXiabiodaZaGaayjkaiaawMcaaaaa@97C7@

And the significance is calculated as:

p l m = min ( 1 Φ ( C ^ t , l m μ ^ t m σ ^ t m ) , Φ ( C ^ t , l m μ ^ t m σ ^ t m ) ) ( 14 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGWbaCdaWgaaWcbaGaemiBaWMaemyBa0gabeaakiabg2da9iGbc2gaTjabcMgaPjabc6gaUjabcIcaOiabigdaXiabgkHiTiabfA6agjabcIcaOmaalaaabaGafm4qamKbaKaadaWgaaWcbaGaemiDaqNaeiilaWIaemiBaWMaemyBa0gabeaakiabgkHiTGGaciqb=X7aTzaajaWaaSbaaSqaaiabdsha0jabd2gaTbqabaaakeaacuWFdpWCgaqcamaaBaaaleaacqWG0baDcqWGTbqBaeqaaaaakiabcMcaPiabcYcaSiabbccaGiabfA6agjabcIcaOmaalaaabaGafm4qamKbaKaadaWgaaWcbaGaemiDaqNaeiilaWIaemiBaWMaemyBa0gabeaakiabgkHiTiqb=X7aTzaajaWaaSbaaSqaaiabdsha0jabd2gaTbqabaaakeaacuWFdpWCgaqcamaaBaaaleaacqWG0baDcqWGTbqBaeqaaaaakiabcMcaPiabcMcaPiaaxMaacaWLjaWaaeWaaeaacqaIXaqmcqaI0aanaiaawIcacaGLPaaaaaa@6898@

The significance at each SNP tests whether the copy number value associated with the SNP deviates from the diploid state.

Define regions with significant alterations

Before defining regions with significant alterations, kernel smoothing is applied to reduce the effect of outliers caused by inherent experimental error as well as the occasional true single-marker copy number variant. A bandwidth of 100Kb with a Gaussian kernel is applied on the total copy number, the significance associated with the total copy number (i.e. log10 transformed p-values), and the allele-specific copy number. The bandwidth is fixed for all the analyses. For allele specific copy number, smoothing is applied separately on the lower copy number estimate and the higher copy number estimate at each marker in an effort to present the phased data. In order to achieve a better estimation, putative regions of LOH are first identified, defined as more than k (k = 10) contiguous homozygous calls on the genome; intermittent "no calls" are allowed but not counted in k. In such regions, all markers, i.e. homozygous calls and no-calls, participate in the allele-specific smoothing. In all other regions, only markers with heterozygous genotypes are used for smoothing to prevent the underestimation of one strand and the overestimation of the other. In the idealized case of a perfect copy number prediction on a normal diploid region of the genome, there will be heterozygous SNP genotype calls interweaved with homozygous SNP genotype calls. For heterozygous calls the lower copy number estimation and the higher copy number estimation of the two alleles will be both close to one. For homozygous calls, the lower copy number estimation for one allele will be near zero and the higher copy number estimation for the other allele will be near two. If both homozygous calls and heterozygous calls are used for allele-specific copy number smoothing, then the single point estimation on the "lower-copy-number" strand will contain interweaved values close to either zero or one. After smoothing, the copy number will be lower than one. Similarly, for the alternate DNA strand, the single point estimation will contain values close to two or one and the smoothed values will therefore be higher than one. Thus using only heterozygous calls in these normal regions largely reduces such under (over) estimation. In regions with long stretches of homozygous calls, which rarely occurs randomly and is more likely caused by asymmetry between the two strands, it is more appropriate to use all the markers to do the allele-specific copy number smoothing.

After smoothing, regression trees [45] are applied with the physical location of each marker as the solo predictor and the natural-log transformed total copy number plus one as the outcome (adding one is done to avoid negative infinity in the case of a homozygous deletion). Log-transformation is used because heuristically the variation in intensity has been observed to increase with copy number; and log-transformation stabilizes the variance and better fits the regression tree framework. The complexity parameter is set to a small value cp = 0.0001 to ensure that a complex enough partition is tested and to ensure that splits which do not increase the overall R-squared value by 0.01% are not tested. In addition, regions with equal to or less than three points are not further split. After a complex enough partition is achieved, 10-fold cross-validation and the one-standard deviation rule are applied to prune the large tree back to an appropriate size to control for over-fitting. After the final partitioning, the average across a region is assigned as the copy number of that region (it will be a geometric average of the original copy number estimate since the regression tree is performed on the log-transformed copy number space); the significance of that region is the average of the log10 transformed p-values (deletion uses log10; amplification uses -log10). Within each chromosome, regions with overall non-significant p-values (p-value > 0.01) are merged with copy number information and the significance values are then updated under the assumption that they represent the same normal diploid state. For allele-specific copy number estimation, the regression-tree partition is performed in an allele-specific manner and at each region defined by the total copy number partition. The results are pruned back using the same cross-validation approach as was carried out for the total copy number estimation.