## Abstract

Copy number variations (CNVs) are abundant in the human genome. They have been associated with complex traits in genome-wide association studies (GWAS) and expected to continue playing an important role in identifying the etiology of disease phenotypes. As a result of current high throughput whole-genome single-nucleotide polymorphism (SNP) arrays, we currently have datasets that simultaneously have integer copy numbers in CNV regions as well as SNP genotypes. At the same time, haplotypes that have been shown to offer advantages over genotypes in identifying disease traits even though available for SNP genotypes are largely not available for CNV/SNP data due to insufficient computational tools. We introduce a new framework for inferring haplotypes in CNV/SNP data using a sequential Monte Carlo sampling scheme ‘Tree-Based Deterministic Sampling CNV’ (TDSCNV). We compare our method with polyHap(v2.0), the only currently available software able to perform inference in CNV/SNP genotypes, on datasets of varying number of markers. We have found that both algorithms show similar accuracy but TDSCNV is an order of magnitude faster while scaling linearly with the number of markers and number of individuals and thus could be the method of choice for haplotype inference in such datasets. Our method is implemented in the TDSCNV package which is available for download at http://www.ee.columbia.edu/~anastas/tdscnv.

## Introduction

Copy number variations (CNVs) are a form of a structural genomic variation referring to duplications and deletions of DNA segments larger than 1 kilobase in size. CNVs are abundant in the human genome, and it is estimated that they can occupy as much as 4% to 6%.

Recently, large-scale genome-wide studies have shed light in many aspects and characteristics of CNVs providing unique insights into the origins, mechanisms, formation, and population genetics of CNVs [1–3]. At the same time, CNVs have been associated with complex traits unexplained by recent genome wide association studies (GWAS) [2] and are believed to make a substantial contribution in uncovering the mechanisms and etiology of disease phenotypes that result from complex patterns of inheritance [2, 4].

A variety of techniques exist for CNV detection. Initially, experimental studies have been performed primarily by array CGH, but lately due to improved resolution and genome coverage of genotyping arrays, a number of methods have been developed relying on whole-genome single-nucleotide polymorphism (SNP) genotyping arrays which offer a more sensitive approach and are more suitable for high-resolution CNV detection. As a result, there is currently simultaneously information on the integer copy number (CN) genotypes along a CNV region and on SNPs outside these regions, in which we will refer in the following as CNV-SNP genotypes.

For diploid organisms, theoretical and empirical arguments have been made for the use of haplotypes as opposed to genotypes. It has been shown that the study of haplotypes can improve the power of detecting associations with diseases, and a variety of methods exist in the literature that use haplotypes to detect causal relationships between a genetic region and a phenotype. Furthermore, haplotypes enable unique insides in the study of populations and are required for many population genetics analyses. Specifically, methods for inferring selection [5] for studying recombination [6, 7] as well as historical migration [8, 9] build their subsequent analysis on existing haplotype data.

The statistical determination of haplotype phase from genotype data is thus potentially very valuable if the estimation can be done accurately and has received an increasing amount of attention over recent years. A number of well-known algorithms have been developed based on coalescent theory [10], imperfect phylogeny [11], Markov chain Monte Carlo [10, 12], Gibbs sampler [13], hidden Markov models [14], expectation minimization algorithm [15], etc. However, only recently, this problem has drawn attention when haplotypes are inferred in a CNV-SNP region.

If we focus within a specific CNV region in a sample of individuals and assume that the ploidy is fixed for each individual along the region, then the problem of inferring the haplotypes is identical to the problem of inferring the haplotypes in polyploid organisms or estimating haplotypes from pooling data. A number of algorithms have been proposed for frequency estimation and inference on these settings, and not surprisingly, many have been applied to the associated CNV haplotype inference problem described above.

Apart from the previous scenarios, a number of methodologies have been specifically developed and tailored for CNV data. Kato et al. [16] have developed a methodology MOCSphaser based on the EM algorithm to assign copy numbers in their respective chromosomes in regions that include CN and SNPs. A core limitation of MOCSphaser as described above is that it takes into consideration only the total CN and not the alleles themselves, assigning on each chromosome a raw CN. As a consequence, even though it provides information about the total copies on a chromosome that could be potentially useful, it does not provide information on the diplotypes themselves.

Another algorithm recently proposed by Kato et al. [17], CNVphaser uses an EM approach to perform inference. The core limitation of that method is that the inference is performed within a CNV region and that the ploidy is considered fixed for an individual within the region. To address these problems and thus enabling the phasing of regions where the ploidy of an individual varies along the region and each individual can have different breakpoints, Su et al. [18] suggested polyHap(v2.0) in which they extended the functionality of their original methodology for pooling data [19]. In their study, they discern the phasing within a CNV into non-internal phasing in which the CNV in a chromosome is inferred as a diplotype and internal phasing in which the specific haplotypes comprising the CNV in a chromosome are further identified. We will use these definitions in our current work.

In their algorithm, Su et al. use an HMM methodology that has separate emission states for the internal and non-internal phasing. They treat the transition between states conceptually in a hierarchical two-level model where the first level is for the transition among CN states and the second for the transition among the haplotype states given the CN states. polyHap(v2.0) is the only currently available method that can phase complex CNV regions by allowing arbitrary changes of CN within individuals and along the genomic sequence.

In this paper, we propose a related new sequential Monte Carlo algorithm for haplotype phasing of CNV-SNP data. In our method, samples are processed sequentially and our method scales linearly with the number of samples as well as the number of individuals. We demonstrate that using our methodology, we can achieve state-of-the-art performance while our method is an order of magnitude faster than polyHap (v2.0).

## Methods

The structure of this section is as follows. In the beginning of the section, we introduce some notation that we will use throughout the remaining manuscript. In the subsections that follow, we present the modified version of our TDS methodology for the case of CNV-SNP data. For completeness, we develop again our framework in detail as presented in [20, 21]. We first present some modeling results for the prior and posterior distributions for the population haplotype frequencies given the observed data. We then present the TDS methodology for the cases of known population frequencies and subsequently extend it to the case of unknown frequencies. In the derivation of the later, we use the previously derived results for the prior and posterior distributions for the haplotype frequencies. We end the exposition of our method by deriving the state update equations for the ‘Tree-Based Deterministic Sampling CNV’ (TDSCNV) estimator and presenting the modified partition-ligation procedure adjusted for the CNV-SNP dataset scenario. In the end of the section, we describe the procedure for creating the datasets which we have used in the ‘Results’ section to evaluate our methodology.

### Definitions and notation

Suppose we are given a set of CNV-SNP genotypes on *L* diallelic loci. We denote the two alleles at each locus by 0 and 1. In the following, we will use the counts of allele 1 as the provided measurement for each allele on each sample. In our method, we allow in a specific position a single amplification or deletion. Therefore, if we are within a CNV region in a chromosome, the allele counts could range from 0 to 2 but could range from 0 to 1 outside these regions.

Suppose that we have *T* individuals and we denote ${\mathit{c}}_{\mathit{t}}=\left\{{\mathit{c}}_{\mathit{t}}^{1},\dots ,{\mathit{c}}_{\mathit{t}}^{\mathit{L}}\right\}$ to be the observed genotype of the *t*-th sample where ${\mathit{c}}_{\mathit{t}}^{\mathit{i}}\in \left\{0,1,2,3,4\right\}$ are the observed counts on the *i* th position. Suppose also that *C*_{
t
} = {*c*_{1}, …, *c*_{
t
}} is a set of individuals up to and including individual *t* and let *C* denote the full set of individuals.

In terms of haplotypes, we make an initial distinction in the values that alleles take in internal and non-internal phasing. The framework that follows however will be described generically and will be the same in both cases.

For non-internal phasing, our purpose is to infer haplotypic phase on diploid chromosomes as we are interested in the total copies of an allele at a specific position on a chromosome. Therefore, the possible values for an allele at each position are {−,0,1,01,00,11}. On the contrary for internal phasing, we infer haplotypic phase on polyploid chromosomes and the possible alleles at each position are {−,0,1}.

For individual *t*, we denote the haplotypes occurring in that individual as *h*_{
t
}. In the case of non-internal phasing, *h*_{
t
} = {*h*_{t,1}, *h*_{t,2}}. For internal phasing, *h*_{
t
} = {*h*_{t,1}, …, *h*_{t,p}}, where *p* is the ploidy of the organism, and *p* ∈ {1, 2, 3, 4} as in our methodology, we only consider a single deletion or a single amplification. Therefore, for the case of non-internal phasing *h*_{t,1}, *h*_{t,2} are strings of length *L* in which *h*_{t,i,j} ∈ {−, 0, 1, 01, 00, 11} and for internal phasing, *h*_{t,i} are strings of length *L* in which *h*_{t,i,j} ∈ {−, 0, 1}.

We further denote *H*_{
t
} = {*h*_{1}, …, *h*_{
t
}}, similarly to *C*_{
t
} as the set of haplotypes for each individual up to and including individual *t*.

Let us also define *z* = {*z*_{1}, …, *z*_{
M
}} as the set containing all haplotype vectors of length *L* that are consistent with any genotype in the set C. To obtain Z from the given dataset C, we first enumerate for each *c*_{
i
} the subset ${\mathit{\psi}}_{\mathit{i}}=\left\{{\mathit{h}}_{\mathit{i}}^{1},\dots ,{\mathit{h}}_{\mathit{i}}^{\mathit{Y}}\right\}$*i* = *1*,…,*T* that contains all possible haplotype assignments which are consistent with *c*_{
i
}. The set Z is then given simply as $\mathit{Z}={\mathrm{U}}_{\mathit{i}=1}^{\mathit{T}}{\mathit{\psi}}_{\mathit{i}}$. A set of population haplotype frequencies *θ* = {*θ*_{1}, …, *θ*_{
M
}} is also associated with the set Z of all possible haplotype vectors, where *θ*_{
m
} is the probability with which the haplotype *z*_{
m
} occurs in the total population. We note here once again that we have given the definitions of Z and *θ* generically for both internal and not internal phasing, respectively.

### Prior and posterior distribution for *θ*

Assuming random mating in the population, it is clear that the number of each unique haplotype in *H* is drawn from a multinomial distribution based on the haplotype frequency *θ*[22]. This leads us to the use of the Dirichlet distribution as the prior distribution for *θ* so that *θ* ~ *D*(*ρ*_{1}, …, *ρ*_{
M
}). It is well known in Bayesian statistics that the Dirichlet distribution is the conjugate prior of the multinomial distribution. This implies in our case that if we assume that the prior distribution for *θ* is Dirichlet and we draw haplotypes based on their frequencies (multinomial distribution), then the posterior distribution for *θ* is again a Dirichlet distribution. We prove this fact below.

where we denote *ρ*_{
m
}(*t*) *m* = *1*,…,*M* as the parameters of the distribution of *θ* after the *t*-th pool and *Ι*(*z*_{
m
} − *h*_{t,i}) is the indicator function which equals 1 when *z*_{
m
} − *h*_{t,i} is a vector of zeros, and 0 otherwise. We note here once again that the number of haplotypes (i.e., the index *p* in the assignment) depends on the phasing and is 2 for non-internal phasing while it ranges for internal phasing. Furthermore, in the previous calculations for *θ*, for each genotype vector, we only consider haplotype configurations that are consistent with that genotype.

We have shown that the posterior distribution for *θ* is also Dirichlet with parameters as given in (1) and depends only on the sufficient statistics, *T*_{
t
} = {*ρ*_{
m
}(*t*), 1 ≤ *m* ≤ *M*} which can be easily updated based on *T*_{t − 1}, *h*_{
t
}, *c*_{
t
} as given by (1) i.e., *T*_{
t
} = *T*_{
t
}(*T*_{t − 1}, *h*_{
t
}, *c*_{
t
}).

### TDS estimator with known system parameters *θ*

Similar to traditional sequential Monte Carlo (SMC) methods, we assume that by the time we have processed genotype *c*_{t-1}, we have a set of K potential solution streams (commonly termed as ‘particles’) ${\mathit{H}}_{\mathit{t}-1}^{\left(\mathit{k}\right)}$ (*k* = 1, …., *K*) each associated with its corresponding weight ${\mathit{w}}_{\mathit{t}-1}^{\left(\mathit{k}\right)}$, as $\left\{\left({\mathit{H}}_{\mathit{t}-1}^{\left(\mathit{k}\right)}|{\mathit{w}}_{\mathit{t}-1}^{\left(\mathit{k}\right)}\right),\mathit{k}=1,\dots ,\mathit{K}\right\}.$

At point *t-* 1, we approximate the real continuous distribution *p*(*H*_{t − 1}|*C*_{t − 1}) as a discrete distribution as follows:

where

and *I* (●) is the indicator function such that I (*x* − y) = 1 for *x* = *y* and I (*x* − y) = 0 otherwise.

Processing the next individual *t*, we would like to make an online inference of the haplotypes *H*_{
t
} based on the genotypes *C*_{
t
}. From Bayes' theorem, we have ${\mathit{p}}_{\mathit{\theta}}\left({\mathit{H}}_{\mathit{t}}|{\mathit{C}}_{\mathit{t}}\right)\propto {\mathit{p}}_{\mathit{\theta}}\left({\mathit{c}}_{\mathit{t}}\right|{\mathit{H}}_{\mathit{t}},{\mathit{C}}_{\mathit{t}-1}\left){\mathit{p}}_{\mathit{\theta}}\right({\mathit{H}}_{\mathit{t}}\left|{\mathit{C}}_{\mathit{t}-1}\right)\propto {\mathit{p}}_{\mathit{\theta}}\left({\mathit{c}}_{\mathit{t}}\right|{\mathit{H}}_{\mathit{t}},{\mathit{C}}_{\mathit{t}-1}\left){\mathit{p}}_{\mathit{\theta}}\right({\mathit{h}}_{\mathit{t}}|{\mathit{H}}_{\mathit{t}-1},{\mathit{C}}_{\mathit{t}-1}){\mathit{p}}_{\mathit{\theta}}\left({\mathit{H}}_{\mathit{t}-1}\right|{\mathit{C}}_{\mathit{t}-1})\propto {\mathit{p}}_{\mathit{\theta}}\left({\mathit{h}}_{\mathit{t}}|{\mathit{H}}_{\mathit{t}-1},{\mathit{C}}_{\mathit{t}-1}\right){\mathit{p}}_{\mathit{\theta}}({\mathit{H}}_{\mathit{t}-1}\left|{\mathit{C}}_{\mathit{t}-1}\right)$ where for our purposes, we only consider haplotype assignments for individual *t* that are compatible to its observed genotype.

Assume further that there are *K*^{ext} such assignments. From previous relationships, if we knew the system parameters *θ*, we would be able to approximate the distribution of *p*_{
θ
}(*H*_{
t
}|*C*_{
t
}) as follows:

where $\left[{\mathit{H}}_{\mathit{t}-1}^{\left(\mathit{k}\right)},{\mathit{h}}_{\mathit{t}}^{\left(\mathit{i}\right)}\right]$ represents the vector obtained by appending the element ${\mathit{h}}_{\mathit{t}}^{\left(\mathit{i}\right)}$ to the vector ${\mathit{H}}_{\mathit{t}-1}^{\left(\mathit{k}\right)}$ and ${\mathit{W}}_{\mathit{t}}^{\mathit{ext}}={\displaystyle \sum _{\mathit{i},\mathit{k}}{\mathit{w}}_{\mathit{t}}^{\left(\mathit{k},\mathit{i}\right)}}$ with

### TDS estimator with unknown system parameters *θ*

However, the system parameters are not known. In our model, we use a Dirichlet distribution, as the prior for *θ* and as shown, we obtain a posterior distribution for *θ* (given *H*_{
t
} and *C*_{
t
}) that is Dirichlet and only depends on a set of sufficient statistics.

Using Bayes' theorem and similarly to the previous subsection, we have:

where again we only consider haplotype assignments that are compatible with the observed genotype.

Taking into consideration as argued before that if we know the system parameters *θ*, then the *p*(*h*_{
t
}|*H*_{t − 1}, *θ*, *Z*) term represents sampling from a multinomial distribution and that the mean of the Dirichlet distribution with respect to an element *θ*_{
k
} of the vector *θ* is as follows:

we have from (4) that:

where $\mathit{r}=\left[{\displaystyle \sum _{\mathit{i}=1}^{\mathit{p}}}\mathit{I}\left({\mathit{z}}_{1}-{\mathit{h}}_{\mathit{t},\mathit{i}}\right),\dots ,{\displaystyle \sum _{\mathit{i}=1}^{\mathit{p}}}\mathit{I}\left({\mathit{z}}_{\mathit{M}}-{\mathit{h}}_{\mathit{t},\mathit{i}}\right)\right]\phantom{\rule{0.25em}{0ex}}$ and$\mathit{B}\left(\mathit{\rho}\left(\mathit{t}-1\right)\right)=\frac{{\displaystyle {\prod}_{\mathit{i}=1}^{\mathit{M}}}\mathit{\Gamma}\left({\mathit{\rho}}_{\mathit{i}}\left(\mathit{t}-1\right)\right)}{\mathit{\Gamma}\left({\displaystyle {\sum}_{\mathit{i}=1}^{\mathit{M}}}{\mathit{\rho}}_{\mathit{i}}\left(\mathit{t}-1\right)\right)}.$

Assuming that we have approximated *p*(*H*_{t − 1}|*C*_{t − 1}) as in (2), we can approximate *p*(*H*_{
t
}|*C*_{
t
}) using (5) as follows:

where the weight update formula is given by:

where again $\mathit{r}=\left[{\displaystyle \sum _{\mathit{i}=1}^{\mathit{p}}}\mathit{I}\left({\mathit{z}}_{1}-{\mathit{h}}_{\mathit{t},\mathit{i}}^{\mathit{j}}\right),\dots ,{\displaystyle \sum _{\mathit{i}=1}^{\mathit{p}}}\mathit{I}\left({\mathit{z}}_{\mathit{M}}-{\mathit{h}}_{\mathit{t},\mathit{i}}^{\mathit{j}}\right)\right]$ and *ρ*^{(k)}(*t* − 1) is the parameter vector of the assumed Dirichlet prior which represents how many times we have encountered each haplotype in stream *k* in the solutions up to individual *t* − 1.

### Partition-ligation

In the partition phase, the dataset is divided into small segments of consecutive loci and each of the individual blocks is phased separately. To ligate the individual blocks, we have adjusted the original partition-ligation (PL) method for the case of CNV-SNP data.

In our current implementation, to be able to derive all possible solution combinations for each pool genotype efficiently, we have decided to keep the maximum block length to 5 SNPs. Clearly, the more SNPs are included in a block, the more information about the LD patterns we can capture but at the same time, the number of possible combinations increases and becomes prohibitive for more than 5 SNPs. For our experiments in a dataset with *L* loci, we have considered *L/* 5 blocks of 5 consecutive loci and the remaining SNPs were treated as a separate block.

The result of phasing for each block is a set of haplotype solutions for each genotype. Two neighboring blocks are ligated by creating merged solutions for each genotype from combinations of the block solutions, each associated with the product of the individual solution weights called the *ligation weight*.

Depending on which haplotypes one from each block are going to be assigned on the same chromosome for each individual, a different number of changes in the ploidy of that individual will occur. In our method, we consider only the assignments that will produce the minimum number of such changes. Therefore, if both haplotypes in any block have the same CN, we examine both alternative assignments but we otherwise ligate solutions that have the same CN. The TDS algorithm is then repeated in the same manner as it was for the individual blocks with the weights of the solutions scaled by the associated ligation weight for that solution.

### Summary of the proposed algorithm

### Dataset creation

Our datasets consisted of SNPs from chromosomes 1 and 2 from HapMap CEU population (HapMap3 release 2 - phasing data). For our purposes, we have considered only the parents in each trio which are the unrelated individuals in our dataset thus resulting in a total of 88 individuals. We have initially filtered out SNPs with minor allele frequencies less than 5%, and we have then considered non-overlapping datasets with a fixed number of SNPs. To create artificial CNV regions within each dataset, we have used the following procedure.

First, in each dataset, we have found all the different haplotypes appearing in the dataset. In order to retain as much of the LD structure and also the property that most of the CNVs could be flagged by neighboring SNPs [2], we have randomly replaced specific areas of randomly chosen haplotypes with a CNV haplotype. To perform that procedure, we randomly selected haplotypes based on their frequency in the population and modified them inserting CNV regions sequentially as follows. Each position was considered as the beginning of a CNV region with a probability of 0.1. For each position flagging the beginning of a CNV, we assigned the length of the CNV region uniformly between three to eight SNPs. We then progressed along the haplotype from the end of the CNV region in a similar fashion until we reached the end of a given haplotype.

## Results

### Measurement of phasing accuracy

We have used a number of different measures to evaluate the performance of our methodology. First, the switch error rate [23, 24] is defined as the percentage of switches among all possible switches in haplotype orientation used to recover the correct phase in an individual.

In the case of a small number of loci where haplotype vectors can be expected to be reconstructed exactly, we have used two figures of merit namely the × ^{2} and *l*_{1} distance to evaluate the accuracy of frequency estimation. Suppose that *f* are the predicted haplotype frequencies from an algorithm and *g* are the gold standard population level haplotype frequencies. The × ^{2} distance between the two distributions is simply the result of the × ^{2} statistic, i.e., ${\mathit{\chi}}^{2}\left(\mathit{f},\mathit{g}\right)={\displaystyle \sum _{\mathit{i}=1}^{\mathit{d}}}{\left({\mathit{f}}_{\mathit{i}}-{\mathit{g}}_{\mathit{i}}\right)}^{2}/{\mathit{g}}_{\mathit{i}}$ where *d* is the number of gold standard haplotypes whereas the *l*_{1} distance between the two distributions is defined as ${\mathit{l}}_{1}\left(\mathit{f},\mathit{g}\right)={\displaystyle \sum _{\mathit{i}=1}^{\mathit{d}}}|{\mathit{f}}_{\mathit{i}}-{\mathit{g}}_{\mathit{i}}|$[25].

### Switch error rate

We have compared the performance of our method with polyHap(v2.0) for haplotypic phase inference using the switch error rate. In this section, the evaluation was done on non-internal haplotypes. In the evaluation of the switch error rate, we consider only CN and SNP positions that are ambiguous. For a marker genotype to have ambiguous phasing, there should be at least two alternative orientation assignments. As an example, all 3CN genotypes are ambiguous positions. This is easy to see, as the choice alone of the chromosome that would have the duplication creates two distinct possible assignments.

The performance of our method when considering the full set of individuals in each dataset is shown in Table 1. We have considered three marker sizes namely 30, 50, and 100 markers. For each marker size, we have simulated 100 datasets and the result presented is the average error rate on these 100 datasets. We can see that for 30 and 50 markers, our method was marginally better than polyHap(v2.0), whereas for the 100 marker datasets, it was marginally worse.

We further demonstrate the accuracy of our approach when ranging the number of individuals in each dataset. The results for a fixed number of 30 and 50 markers are shown in Figure 1. As expected, the performance for both methods improves with increasing number of individuals per dataset.

Finally, we have broken down and calculated the switch error rates based on the CN of the ‘from’ and ‘to’ sites as shown in Table 2. Similarly, to Su et al., we observe the highest switch error rates appearing when the transitions happen between different CNs.

### Haplotype frequency estimation

We have examined the accuracy of our method and compared it against polyHap(v2.0) on datasets of 8 and 10 markers in which individuals had a fixed ploidy. We have evaluated two appropriate figures of merit as described above, the × ^{2} and *l*_{1} distance. We should note here that in order to determine how good frequency estimations with a given method are, a small number of markers should be used. The reason is that for a large number of markers, it would be unlikely that the exact same haplotypes would appear or reconstructed with appreciable frequency. The results for both figures of merit on an increasing number of individuals are shown in Figure 2. Our method demonstrates superior performance for both figures of merit, and again as expected, both methods produced superior performance with an increasing number of individuals.

### Internal phasing

We have further evaluated the performance of our method using the switch error rate inside duplicated regions. In this subsection, the evaluation was done on internal phasing and particularly in duplicated segments of a chromosome as the scope was to detect how good the specific haplotypes comprising the duplicated chromosomal region could be recovered. The switch error rate evaluation within such duplicated regions is exactly the same as the evaluation on a genotype with only SNPs.

We have used the same 100 datasets for each of the three dataset sizes, namely 30, 50, and 100 markers, as in the evaluation of the switch error rate for non-internal phasing described in a previous subsection. We found, as expected, that the results were similar irrespectively of the dataset size, and the average across all datasets was 0.183.

### Timing results

The computational times for the 30, 50, and 100 marker datasets used for the calculation of the switch error rate are displayed in Table 3. We can see that TDSCNV is an order of magnitude faster than polyHap(v2.0) for all marker sizes examined.

## Discussion

We present an algorithm for haplotypic inference in regions of CNV-SNP genotypes. We compare our method with polyHap(v2.0) on a variety of marker sizes and evaluate the accuracy and computational time of each method. Our method has similar accuracy to polyHap(v2.0) but is an order of magnitude faster in all datasets examined.

In all instances of haplotype inference problems, it becomes increasingly significant that methods are able to incorporate prior knowledge in the form of haplotypes or genotypes from the same population as that from which the target samples were drawn. HapMap is a striking example of such database knowledge that could be used for haplotype inference. Furthermore, it is also important for researchers that samples that are phased at some point in time could be used efficiently for the phasing of samples presented at some later point. Our methodology offers a unique framework that can easily incorporate such prior knowledge. Haplotypes can be introduced in the form of a prior for the counts in the TDSCNV algorithm. From our experience with our framework and as expected, the presence of the extra information will improve the phasing accuracy of the target samples.

## Conclusions

In this paper, we propose a new sequential Monte Carlo algorithm for haplotype phasing of CNV-SNP data. In our method, samples are processed sequentially and our method scales linearly with the number of samples as well as the number of individuals.

To demonstrate the performance of our method, we have compared it against polyHap(v2.0), the only currently available software able to perform inference in CNV/SNP genotypes, on datasets of varying number of markers. We have initially compared the accuracy of both methods for haplotypic phase inference on non-internal haplotypes, on datasets of 30, 50, and 100 markers. We have then examined the accuracy of frequency estimation with both methods on datasets with a small number of markers (8 and 10 markers). Finally, we have evaluated the performance of our methodology inside duplicated regions for internal phasing.

We have found that our method demonstrates comparable or better accuracy than polyHap(v2.0) and at the same time is an order of magnitude faster in all datasets and marker sizes examined while scaling linearly with the number of markers and number of individuals. We therefore believe that our method could be the method of choice for haplotype inference in such datasets.

## References

- 1.
Conrad DF, Hurles ME:

**The population genetics of structural variation.***Nat Genet*2007,**39**(7 Suppl):S30-S36. - 2.
Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, Zhang Y, Aerts J, Andrews TD, Barnes C, Campbell P, Fitzgerald T, Hu M, Ihm CH, Kristiansson K, Macarthur DG, Macdonald RJ, Onyiah I, Pang AW, Robson S, Stirrups K, Valsesia A, Walter K, Wei J, Tyler-Smith C, Carter NP, Lee C, Scherer SW, Hurles ME, The Wellcome Trust Case Control Consortium:

**Origins and functional impact of copy number variation in the human genome.***Nature*2010,**464**(7289):704-712. 10.1038/nature08516 - 3.
Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W, Cho EK, Dallaire S, Freeman JL, González JR, Gratacòs M, Huang J, Kalaitzopoulos D, Komura D, MacDonald JR, Marshall CR, Mei R, Montgomery L, Nishimura K, Okamura K, Shen F, Somerville MJ, Tchinda J, Valsesia A, Woodwark C, Yang F,

*et al*.:**Global variation in copy number in the human genome.***Nature*2006,**444**(7118):444-454. 10.1038/nature05329 - 4.
McCarroll SA, Altshuler DM:

**Copy-number variation and association studies of human disease.***Nat Genet*2007,**39**(7 Suppl):S37-S42. - 5.
Sabeti PC, Reich DE, Higgins JM, Levine HZ, Richter DJ, Schaffner SF, Gabriel SB, Platko JV, Patterson NJ, McDonald GJ, Ackerman HC, Campbell SJ, Altshuler D, Cooper R, Kwiatkowski D, Ward R, Lander ES:

**Detecting recent positive selection in the human genome from haplotype structure.***Nature*2002,**419**(6909):832-837. 10.1038/nature01140 - 6.
Fearnhead P, Donnelly P:

**Estimating recombination rates from population genetic data.***Genetics*2001,**159**(3):1299-1318. - 7.
Myers SR, Griffiths RC:

**Bounds on the minimum number of recombination events in a sample history.***Genetics*2003,**163**(1):375-394. - 8.
Bahlo M, Griffiths RC:

**Inference from gene trees in a subdivided population.***Theor Popul Biol*2000,**57**(2):79-95. 10.1006/tpbi.1999.1447 - 9.
Beerli P, Felsenstein J:

**Maximum likelihood estimation of a migration matrix and effective population sizes in n subpopulations by using a coalescent approach.***Proc Natl Acad Sci U S A*2001,**98**(8):4563-4568. 10.1073/pnas.081068098 - 10.
Stephens M, Scheet P:

**Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation.***Am J Hum Genet*2005,**76**(3):449-462. 10.1086/428594 - 11.
Halperin E, Eskin E:

**Haplotype reconstruction from genotype data using Imperfect Phylogeny.***Bioinformatics*2004,**20**(12):1842-1849. 10.1093/bioinformatics/bth149 - 12.
Lin S, Chakravarti A, Cutler DJ:

**Haplotype and missing data inference in nuclear families.***Genome Res*2004,**14**(8):1624-1632. 10.1101/gr.2204604 - 13.
Niu T, Qin ZS, Xu X, Liu JS:

**Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms.***Am J Hum Genet*2002,**70**(1):157-169. 10.1086/338446 - 14.
Browning SR:

**Missing data imputation and haplotype phase inference for genome-wide association studies.***Hum Genet*2008,**124**(5):439-450. 10.1007/s00439-008-0568-7 - 15.
Qin ZS, Niu T, Liu JS:

**Partition-ligation-expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms.***Am J Hum Genet*2002,**71**(5):1242-1247. 10.1086/344207 - 16.
Kato M, Nakamura Y, Tsunoda T:

**MOCSphaser: a haplotype inference tool from a mixture of copy number variation and single nucleotide polymorphism data.***Bioinformatics*2008,**24**(14):1645-1646. 10.1093/bioinformatics/btn242 - 17.
Kato M, Nakamura Y, Tsunoda T:

**An algorithm for inferring complex haplotypes in a region of copy-number variation.***Am J Hum Genet*2008,**83**(2):157-169. 10.1016/j.ajhg.2008.06.021 - 18.
Su SY, Asher JE, Jarvelin MR, Froguel P, Blakemore AI, Balding DJ, Coin LJ:

**Inferring combined CNV/SNP haplotypes from genotype data.***Bioinformatics*2010,**26**(11):1437-1445. 10.1093/bioinformatics/btq157 - 19.
Su SY, White J, Balding DJ, Coin LJ:

**Inference of haplotypic phase and missing genotypes in polyploid organisms and variable copy number genomic regions.***BMC Bioinform*2008,**9:**513. 10.1186/1471-2105-9-513 - 20.
Iliadis A, Anastassiou D, Wang X:

**Fast and accurate haplotype frequency estimation for large haplotype vectors from pooled DNA data.***BMC Genet*2012,**13:**94. - 21.
Iliadis A, Watkinson J, Anastassiou D, Wang X:

**A haplotype inference algorithm for trios based on deterministic sampling.***BMC Genet*2010,**11:**78. - 22.
Excoffier L, Slatkin M:

**Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population.***Mol Biol Evol*1995,**12**(5):921-927. - 23.
Lin S, Cutler DJ, Zwick ME, Chakravarti A:

**Haplotype inference in random population samples.***Am J Hum Genet*2002,**71**(5):1129-1137. 10.1086/344347 - 24.
Marchini J, Cutler D, Patterson N, Stephens M, Eskin E, Halperin E, Lin S, Qin ZS, Munro HM, Abecasis GR, Donnelly P, International HapMap Consortium:

**A comparison of phasing algorithms for trios and unrelated individuals.***Am J Hum Genet*2006,**78**(3):437-450. 10.1086/500808 - 25.
Kirkpatrick B, Armendariz CS, Karp RM, Halperin E:

**HAPLOPOOL: improving haplotype frequency estimation through DNA pools and phylogenetic modeling.***Bioinformatics*2007,**23**(22):3048-3055. 10.1093/bioinformatics/btm435

## Author information

### Affiliations

### Corresponding author

## Additional information

### Competing interests

All authors declare they have no competing interests.

## Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

## Rights and permissions

This article is published under an open access license. Please check the 'Copyright Information' section for details of this license and what re-use is permitted. If your intended use exceeds what is permitted by the license or if you are unable to locate the licence and re-use information, please contact the Rights and Permissions team.

## About this article

### Cite this article

Iliadis, A., Anastassiou, D. & Wang, X. A sequential Monte Carlo framework for haplotype inference in CNV/SNP genotype data.
*J Bioinform Sys Biology* **2014, **7 (2014). https://doi.org/10.1186/1687-4153-2014-7

Received:

Accepted:

Published:

### Keywords

- Dirichlet Distribution
- Internal Phase
- Sequential Monte Carlo
- Haplotype Inference
- Copy Number State