Background

Increased affordability of high-throughput DNA sequencing has enabled studies of genetic variations and of the effects they have on health and medical treatments. In diploid organisms, such as humans, chromosomes come in pairs. The chromosomes in a pair of autosomes are homologous, i.e., they have similar composition and carry the same type of information but are not identical. The most common type of DNA sequence variation is a single nucleotide polymorphism (SNP), where a single base in the genome differs between individuals or paired chromosomes. Each of those variants is referred to as an allele; a SNP has at least two different alleles. If the two alleles at a SNP site are same, the SNP site is homozygous; if they are different, it is heterozygous. SNP calling is concerned with identification of the locations and types of such alleles, and is followed by genotype calling to decide the genotypes associated with the locations of the detected SNPs. Accurate SNP and genotype calling are challenging due to uncertainties caused by base calling and read alignment errors. The low-to-medium coverages typical of large-scale sequencing projects are often associated with erroneous SNP and genotype calling [1]. As an illustration, in the low-coverage (2−6×) 1000 Genomes Project pilot, the genotype accuracy at heterozygous sites was 90 % for the lowest allele frequencies (minor allele frequency (MAF) <3 %), 95 % for the intermediate frequencies (MAF 50 %), and 70-80 % for the highest frequency variants (MAF >97 %) [2].

SNP and genotype calling do not assign alleles to specific chromosomes in the pairs. Such detailed information is provided by haplotypes, ordered collections of SNPs on the chromosomes. Haplotypes have been of fundamental importance for the studies of human diseases and effectiveness of drugs [3]. The International Haplotype Map Project’s pursuit of developing a haplotype map of the human genome reflects the significance of acquiring and understanding haplotype information [4]. Haplotype inference typically refers to the task of reconstructing haplotypes from the genotype samples of a population. Haplotype assembly, or single individual haplotyping, aims to reconstruct single individual haplotypes from high-throughput sequencing data. Since the SNP sites are assumed to be bi-allelic (i.e., each SNP site contains one of only two possible nucleotides), the alleles are labelled as 0 and 1 and the haplotypes are represented by binary sequences. Therefore, haplotype assembly is often cast as the problem of phasing two binary sequences from their short samples (i.e., reads) that are represented by ternary strings (where the third symbol denotes missing information). Majority of the existing haplotype assembly algorithms rely on this formulation of the problem [5].

Several haplotype assembly criteria and algorithms to optimize them were considered in [6, 7]. The minimum error correction (MEC) criterion, in particular, has received a considerable amount of attention and has been broadly used in practice. Most of the haplotype assembly problem formulations have been shown to be NP-hard [68], which has motivated numerous computationally efficient heuristic solutions [5]. FastHare, proposed in [9], was an early heuristic method that was followed by several approximate techniques in [10, 11]. In [12, 13], the use of clustering approaches for splitting reads into two sets, each associated with one chromosome in a pair, was proposed. In addition to the approximate methods, several algorithms that search for the exact solution to the MEC formulation of the problem were developed, including the branch-and-bound technique in [14]. However, as argued in [15], the exact algorithms are often infeasible in practice; the approach in [16] based on the Markov Chain Monte Carlo (MCMC) method, HASH, also incurs high computational cost while being more accurate than heuristics. As a follow-up to HASH, [17] presented a significantly faster heuristic algorithm, HapCUT, suffering only a minor loss of accuracy. To minimize the MEC score, HapCUT iteratively computes the max-cut in a graph that represents the assembly problem. In [18], another max-cut based heuristic, ReFHap, was proposed; ReFHap relies on a different graph structure to achieve higher speed while maintaining accuracy similar to that of HapCUT. Other methods include a dynamic programming solution in [19]; a method that solves an appropriate integer linear program [20]; and several other heuristics including [21], H-BOP [22], HMEC [23], and HapCompass [24, 25].

A probabilistic framework for haplotype assembly was first introduced in [26]. There, in order to deal with inherently random errors in sequencing data, the probability that a site in the fragments is incorrectly sequenced is defined for each of the four nucleotide bases. The most likely haplotype phases between SNP sites are determined using joint posteriori probabilities whose calculation is limited to two or three adjacent SNPs due to the intensive computational cost that grows exponentially in the number of SNP sites. The locally estimated haplotype segments are linked if the corresponding confidence levels exceed a certain threshold. In the follow-up work [27], reconstruction of longer haplotype segments using the Gibbs sampling procedure was enabled. However, this iterative approach still first assembles short haplotype segments that are then connected, and requires runtimes infeasible for block lengths typically encountered in practice. More recently, [28] proposed a new probabilistic mixture model, MixSIH, which leads to a more efficient computation of the haplotype likelihoods than those in [26, 27]. However, MixSIH is still about 10-fold slower than either HapCUT [17] or ReFHap [18] while having comparable accuracy, and the model there is restricted to the bi-allelic representation as in [6-25].

It is worth pointing out that most of the existing algorithms for haplotype assembly allow no more than two alleles at a SNP site and only deal with the errors caused by substitutions between those two alleles [6-25,28]. In practice, when sequencing errors lead to reads that report more than two alleles at a SNP site, either all of the tri- or tetra-allelic sites are discarded [10, 17] or the alleles that do not match the reference (or its alternative) are thrown away [24, 25]. The former drastically reduces not only the number of SNP sites to be reconstructed but also the chance of reliable full haplotype reconstruction (due to reducing the length of already short reads). The main drawback to the latter is that by fully trusting genotype information provided by SNP/genotype calling, the true haplotypes may be incorrectly reconstructed when the genotype calling is erroneous (i.e., when alleles corresponding to an incorrect genotype are preserved while the alleles corresponding to the true genotype are discarded).

In this paper, we propose a novel method that relies on a probabilistic model of the data to incorporate genotype calling in the haplotype assembly procedure. Unlike [26, 27], the proposed method infers both the most likely genotypes and haplotype phases by examining the complete set of SNP loci in a computationally efficient manner. To this end, we employ the sequential Monte Carlo (SMC) algorithm (i.e., a particle filter). Particle filters are capable of sequentially estimating the posterior density of unknown variables by representing them with a set of particles and associated weights [29]. When the solution space is discrete and finite, a deterministic form of SMC can be derived [30, 31]; this has been exploited for solving various problems in genomics [32, 33]. Noting that the set of possible haplotype pairs is discrete and finite, we develop a modified deterministic sequential Monte Carlo (DSMC) method for solving the haplotype assembly problem. Our algorithm, ParticleHap, relies on the 2 nd-order Markov model of the haploype sequence to search for the most likely association of the SNPs to haplotypes. Phasing of the SNPs is done sequentially: the posteriori probability of a partial haplotype comprising n SNPs is calculated using the read information about the SNP in the n th position of the haplotype sequence and the posteriori probability of the previously inferred (n−1)-bases long partial haplotype. By working with SNP calls rather than their binary representations (the latter is typically used by state-of-the-art haplotype assembly algorithms), ParticleHap can reliably infer the most likely genotypes. Our extensive computational studies demonstrate that the proposed scheme is more accurate and computationally more efficient than state-of-the-art methods in [17, 18].

Methods

Problem formulation

In our model and the subsequently proposed haplotype assembly algorithm, we focus on the SNP sites where two or more alleles are observed. The sites with only one observed allele are declared homozygous and not used for the assembly. Assume there are m paired-end short reads covering n remaining SNP sites. Such data can be represented by an m×n matrix where the rows contain information provided by the reads while the columns correspond to the SNP sites.

By adopting the notation used in [26, 27], let X with elements \(X_{\textit {ij}}=x_{\textit {ij}}, x_{\textit {ij}}\in \mathcal {B}\), be the matrix of potentially erroneous observations, while Y with entries \(Y_{\textit {ij}}=y_{\textit {ij}}, y_{\textit {ij}}\in \mathcal {A}\), denote the corresponding error-free data matrix, 1≤im, 1≤jn, \(\mathcal {A}=\{\mathsf {A},\mathsf {C},\mathsf {G},\mathsf {T}\}\) and \(\mathcal {B}=\{\mathsf {A},\mathsf {C},\mathsf {G},\mathsf {T},-\}\). Here − denotes a gap, i.e., a site not covered by a read or an ambiguous base-call. Let a 2×n matrix S with entries S kj =s kj , \(s_{\textit {kj}}\in \mathcal {A}\), k∈{1,2}, denote the true haplotype pair, and let us collect the indicators of the origin of the reads, f i ∈{1,2}, 1≤im, into a vector F. With this notation, the true bases relate to the true haplotypes as \(\phantom {\dot {i}\!}Y_{\textit {ij}}=S_{f_{i},j}\), where 1≤im, 1≤jn. We assume that the composition probabilities Pr(S kj =s), \(s\in \mathcal {A}\), at each SNP position are mutually independent and constant across haplotypes. The measurement model is given by \( \text {Pr}(X_{\textit {ij}}=x_{\textit {ij}} | Y_{\textit {ij}}=y_{\textit {ij}}), \; y_{\textit {ij}}\in \mathcal {A}, \; x_{\textit {ij}}\in \mathcal {B}\). A sequencing error occurs when the true base Y ij =y ij is misread as X ij =x ij , x ij y ij . Note that the posteriori probability p(S|X) can be computed from p(S,X) and \(p(\mathbf {X})=\sum \limits _{\mathbf {S}}\displaystyle {p(\mathbf {S},\mathbf {X})}\) using the Bayes’ rule, where

$$\begin{array}{@{}rcl@{}} p(\mathbf{S},\mathbf{X}) &=& \left(\frac{1}{2}\right)^{m}\prod\limits_{j=1}^{n}p(S_{1j})\prod\limits_{j=1}^{n}p(S_{2j}) \\ &\: \times&\prod\limits_{i=1}^{m}\left[\prod\limits_{j=1}^{n}p(X_{i j} \mid S_{1 j}) +\prod\limits_{j=1}^{n}p(X_{i j} | S_{2 j})\right]. \end{array} $$
((1))

We assume that each read is generated from one of the two haplotypes with probability \(\frac {1}{2}\), i.e., \(\text {Pr}(f_{i}=1)=\text {Pr}(f_{i}=2)=\frac {1}{2}\).

The ParticleHap algorithm

Following the adopted notation, the goal of haplotype assembly is to determine matrix S from the observation matrix X. A Bayesian approach to solving this problem involves maximization of the posteriori distribution p(S|X). Let S ·j and X ·j denote the j th column vectors of S and X, respectively, and let us define S1:t ={S ·1,S ·2,⋯,S ·t } and X1:t ={X ·1,X ·2,⋯,X ·t }. Recursive Bayesian estimation (i.e., Bayesian filtering) is concerned with recursively finding the conditional probability density function p(S1:t |X1:t ). Having obtained the estimate \(\hat {p}(\mathrm {S}_{1:t} | \mathrm {X}_{1:t})\), we can determine the most likely S1:t . However, finding an analytical form of this probability density function is often infeasible, as is the case for the haplotype assembly problem.

Sequential Monte Carlo (SMC), often referred to as particle filtering [29], describes p(S1:t |X1:t ) using a set of discrete points (particles) and their corresponding weights. SMC can be interpreted as the dynamical system which, in the context of haplotype assembly, comprises the initial state model p(S ·1), state transitions model p(S ·t |S ·t−1) and measurement model p(X ·t |S ·t ). The distribution p(S1:t |X1:t ) can be propagated using an importance sampling technique where samples from a proposal density q(S1:t |X1:t ) are generated and appropriately weighted. Having drawn K samples \(\left \{S_{\cdot t}^{(1)}, S_{\cdot t}^{(2)},\cdots,S_{\cdot t}^{(K)}\right \}\) from q(S1:t |X1:t ) and assigned them weights \(w_{t}^{(k)}\), p(S1:t |X1:t ) can be approximated by

$$ {\small{\begin{aligned} {}\hat{p}(\mathrm{S}_{1:t} | \mathrm{X}_{1:t}) = \frac{1}{W_{t}}\sum\limits_{k=1}^{K}w_{t}^{(k)}\delta\left(\mathrm{S}_{1:t}-\mathrm{S}_{1:t}^{(k)}\right), \; w_{t}^{(k)}= \frac{p(\mathrm{S}_{1:t} | \mathrm{X}_{1:t})} {q(\mathrm{S}_{1:t} | \mathrm{X}_{1:t})}, \end{aligned}}} $$
((2))

where \(W_{t}=\sum \limits _{k=1}^{K}w_{t}^{(k)}\) and δ(·) is an indicator function, i.e., δ(ss 0)=1 for s=s 0 and δ(ss 0)=0 otherwise. The weight \(w_{t}^{(k)}\) can be further derived as ([29])

$$\begin{array}{@{}rcl@{}} w_{t}^{(k)} & \propto & w_{t-1}^{(k)}p\left(X_{\cdot t} \mid S_{\cdot t-1}^{(k)}\right) \\ & \propto & w_{t-1}^{(k)} \sum\limits_{l=1}^{L}p(X_{\cdot t} \mid S_{\cdot t}=s_{l})p\left(S_{\cdot t}=s_{l} \mid S_{\cdot t-1}^{(k)}\right). \end{array} $$
((3))

In the traditional SMC, the set \(\left \{\left (\mathrm {S}_{1:t}^{(k)},w_{t}^{(k)}\right), k=1,\cdots,K{\vphantom {\mathrm {S}_{1:t}^{(k)},w_{t}^{(k)}}}\right \}\) is recursively generated from the previous set of properly weighted samples \(\left \{\left (\mathrm {S}_{1:t-1}^{(k)},w_{t-1}^{(k)}\right), k=1,\cdots,K{\vphantom {\mathrm {S}_{1:t-1}^{(k)},w_{t-1}^{(k)}}}\right \}\) by using the optimal proposal distribution \(q\left (S_{\cdot t} | \mathrm {S}_{1:t-1}^{(k)},\mathrm {X}_{1:t}\right) = p\left (S_{\cdot t} | \mathrm {S}_{1:t-1}^{(k)},\mathrm {X}_{1:t}\right)\),

$$ {}q\left(S_{\cdot t}=s_{l} | \mathrm{S}_{1:t-1}^{(k)},\mathrm{X}_{1:t}\right) \propto p(X_{\cdot t} | S_{\cdot t}=s_{l}) p\left(S_{\cdot t}=s_{l} | S_{\cdot t-1}^{(k)}\right). $$
((4))

In contrast to the conventional SMC, the deterministic sequential Monte Carlo (DSMC, [30, 31]) explores all possible states in each step of the recursive procedure. In particular, each particle at step t−1, \(S_{\cdot t-1}^{(k)}, k=1,\cdots,K,\) is propagated to L possible states at step t instead of being propagated to a single particle, where L denotes the number of possible extensions of the partially reconstructed haplotype sequence in our haplotype assembly problem. Maintaining and further propagating all such particles would inevitably increase their number exponentially; to remedy this problem, in each step only K particles with the highest weights among KL possible particles are selected. Then, given a set \(\left \{\left (\mathrm {S}_{1:t-1}^{(k)},w_{t-1}^{(k)}\right), k=1,\dots,K\right \}\) that does not contain duplicate paths, (2) and Bayes’ theorem lead to an approximation of the posterior distribution of S1:t

$$ {}\hat{p}^{\text{\tiny DSMC}}(\mathrm{S}_{1:t} | \mathrm{X}_{1:t}) \,=\, \frac{1}{W_{t}^{\text{\tiny DSMC}}}\sum\limits_{k=1}^{K}\sum\limits_{l=1}^{L} w_{t}^{(k,l)} \delta\left(\mathrm{S}_{1:t}\,-\,\left[\mathrm{S}_{1:t-1}^{(k)} \; s_{l}\right]\right), $$
((5))

where \(W_{t}^{\text {\tiny DSMC}}=\sum \limits _{k,i} w_{t}^{(k,l)}\) and \(\left [\mathrm {S}_{1:t-1}^{(k)},s_{l}\right ]\) is obtained by appending the state s l to \(\mathrm {S}_{1:t-1}^{(k)}\). Each weight \(w_{t}^{(k,l)}\) is calculated as

$$ w_{t}^{(k,l)} \propto w_{t-1}^{(k)} p(X_{\cdot t} | S_{\cdot t}=s_{l}) p\left(S_{\cdot t}=s_{l} | S_{\cdot t-1}^{(k)}\right). $$
((6))

The procedure is continued until obtaining \(\mathrm {S}_{1:n}^{(k)}=\left (\mathrm {S}_{1:n-1}^{(k)},S_{\cdot n}^{(k)}\right)\) and its corresponding weights. Figure 1 illustrates the procedure of propagating particles in the DSMC. Each of K particles at step t−1 is propagated to L possible states at step t. Among K×L possible particles, only K particles with the highest weight are selected.

Fig. 1
figure 1

Procedure of propagating particles in deterministic sequential Monte Carlo. Each of K particles at step t−1 is propagated to L possible states at step t. Among K×L possible particles, only K particles with the highest weight are selected

The conditional distribution p(X ·t |S ·t =s l ) in the DSMC weight updates (6) reflects dependence of X ·t on the current state S ·t only, and does not include the phase information between nearby SNP sites (note that it does enable detection of the most likely genotypes at the t th site). To incorporate the phasing information, we extend the representation of the particle trajectories to the 2nd-order Markov model. In particular, we modify (6) so that the weight updates in our ParticleHap depend on the history of the state and the observation at t−1 as well as on the current state,

$$\begin{array}{@{}rcl@{}} w_{t}^{(k,l)} & \propto & w_{t-1}^{(k)} p\left(X_{\cdot t} | S_{\cdot t}=s_{l},S_{\cdot t-1}^{(k)},X_{\cdot t-1}\right) \\ & \times & p\left(S_{\cdot t}=s_{l} | S_{\cdot t-1}^{(k)},X_{\cdot t-1}\right), \end{array} $$
((7))

where \(\phantom {\dot {i}\!}s_{l}=(s_{l_{1}},s_{l_{2}})\), l=1,…,L. In particular, at step t, ParticleHap examines potential extensions of the partially reconstructed haplotype by adding a single SNP site, which requires no more than 12 likelihood calculations – one for each possible heterozygous pair \((s_{l_{1}},s_{l_{2}})\) at site t, with 12 such pairs when there are 4 different bases in the t th column of X.

While the conditioning on \(S_{\cdot t-1}^{(k)}\) and X ·t−1 in \(p\left (X_{\cdot t} | S_{\cdot t}=s_{l},S_{\cdot t-1}^{(k)},X_{\cdot t-1}\right)\) in (7) introduced phase information between SNPs in positions t−1 and t, there remains a major challenge for reconstruction of unknown haplotype due to gaps in the data matrix X. Even with the previously described 2nd-order Markov model of particle trajectories, phase information between two consecutive SNP sites cannot be retrieved using a read that is not covering both of those sites. For example, in Fig. 2, column t+2 contains 5 informative entries (i.e., entries which are not −). However, 2 of them belong to the reads which do not cover the SNP in the position t+1, i.e., are immediately preceded by −, and thus do not contribute to the phase information if the weights are computed according to (7). To this end, we modify (7) so that the information spread across gaps within paired-end reads can be utilized for generating particle trajectories. In particular, let us introduce a new variable \(\mathbf {Pos}^{t} = \left \{po{s_{i}^{t}}, i = 1,\cdots,m, po{s_{i}^{t}} \in \{0,1,2,\cdots,t\}\right \}\), where \(po{s_{i}^{t}}\) is the nearest informative (non-gap) position in the i th row left of the column t; note that \({pos}_{i}^{t+1} = po{s_{i}^{t}}\) if X i,t+1=−. Also, note that \({pos}_{i}^{t-1} = 0\) implies that there are no informative positions in the i th row left of the column t (i.e., X ij =− for all jt−1). With this notation, we rephrase (7) as

$$\begin{array}{@{}rcl@{}} w_{t}^{(k,l)} & \propto & w_{t-1}^{(k)} p\left(X_{\cdot t} | S_{\cdot t}=s_{l},S_{\cdot \mathbf{Pos}^{t-1}}^{(k)},X_{\cdot \mathbf{Pos}^{t-1}}\right) \\ & \times & p\left(S_{\cdot t}=s_{l} | S_{\cdot \mathbf{Pos}^{t-1}}^{(k)},X_{\cdot \mathbf{Pos}^{t-1}}\right). \end{array} $$
((8))
Fig. 2
figure 2

Information about heterozygous sites provided by paired-end reads and organized in the observation matrix X. Erroneous base characters are highlighted in red font

The measurement model in (8) assumes that the i th read is randomly generated from one of the two haplotypes, i.e.,

$$ \begin{aligned} p&\left(X_{\cdot t} | S_{\cdot t}=s_{l},S_{\cdot \mathbf{Pos}^{t-1}}^{(k)},X_{\cdot \mathbf{Pos}^{t-1}}\right)\\ &=\prod\limits_{i=1,x_{it}\neq -}^{m} p\left(X_{i,t} | S_{\cdot t}=s_{l},S_{\cdot {pos}_{i}^{t-1}}^{(k)},X_{i,{pos}_{i}^{t-1}}\right), \end{aligned} $$

where

$$\begin{array}{@{}rcl@{}} && p\left(X_{i,t} | S_{\cdot t}=s_{l},S_{\cdot {pos}_{i}^{t-1}}^{(k)},X_{i,{pos}_{i}^{t-1}}\right) \\ &=&\left\{ \begin{array}{ll} p(x_{it} | s_{l_{1}}), & \text{if}~ X_{i,{pos}_{i}^{t-1}}=S_{1,{pos}_{i}^{t-1}}^{(k)},\\ p(x_{it} | s_{l_{2}}), & \text{if}~X_{i,{pos}_{i}^{t-1}}=S_{2,{pos}_{i}^{t-1}}^{(k)}\\ \frac{p(x_{it} | s_{l_{1}}) + p(x_{it} | s_{l_{2}})}{2}, & \textrm{otherwise.}\\ \end{array} \right. \end{array} $$
((9))

When computing \(p\left (S_{\cdot t}=s_{l} | S_{\cdot \mathbf {Pos}^{t-1}}^{(k)}, X_{\cdot \mathbf {Pos}^{t-1}}\right)\), we assume that there is no correlation between the consecutive SNPs. Therefore, the state transition distribution is formed using the composition probabilities, e.g., \(p\left (S_{\cdot t}=s_{l} | S_{\cdot \mathbf {Pos}^{t-1}}^{(k)},X_{\cdot \mathbf {Pos}^{t-1}}\right) =\text {Pr}(S_{1t}=s_{l_{1}})\text {Pr}(S_{2t}=s_{l_{2}})\). Note that, in principle, side information such as genotype frequencies or the patterns of linkage disequilibrium (LD) can be incorporated in state transition probabilities.

Going back to the example illustrated in Fig. 2, the modification of the weights shown in (8) now allows ParticleHap to retrieve phase information at position t+2 from the reads (highlighted in orange) that have a gap in position t+1 but cover some SNPs in the positions left of the (t+1)st one. This often has a beneficial effect on the switch error rate, defined as the ratio of the number of SNP positions where the two chromosomes of a resulting haplotype phase must be switched in order to reconstruct the true phase. As an illustration, consider column t in the observation matrix shown in Fig. 2. Since none of the reads that cover SNPs at column t provide any phasing information, the partially reconstructed haplotype pair would equally likely be extended by either (C,G) or (G,C) which might lead to the switch error at the position t. This ambiguity is resolved at position t+2 from the reads i 1=4 and i 2=13 (highlighted in orange), which do not cover sites t+1, t nor t−1, but do cover sites 4 and 1, respectively: ParticleHap relying on those reads in (8)-(9) will assign larger weight to the particle propagated along the correct state path.

To initialize the algorithm at t=1 from k possible SNPs, all possible assignments are considered as \(S_{\cdot 1}^{(k)}, k=1,\dots,K\), with the corresponding weights \(w_{1}^{(l)}, l=1,\dots,L\), computed as \(w_{1}^{(l)}\propto p(X_{\cdot 1} | S_{\cdot 1}^{(l)}=s_{l})\). From t=2, all possible extensions of the (t−1)-long haplotype are considered, and the extensions having non-zero weights are used to generate particles until K such particles are created. Once the set of K particles is formed, the subsequently generated particles are included in the set if their weight is greater than the weight of at least one particle that is already in the set; the latter then needs to be excluded from the set so that its cardinality remains K.

The ParticleHap algorithm is formalized below.

  • Step 1 (Initialization): For the first SNP position, compute \(w_{1}^{(l)}\propto p\left (X_{\cdot 1} | S_{\cdot 1}^{(l)}=s_{l}\right)\), l=1,…,L. Normalize \(w_{1}^{(l)}\) and store the corresponding possible haplotype pairs in \(S_{\cdot 1}^{(k)}\), k=1,2,⋯,K.

  • Step 2 (Run iterations for 2≤tn): For each step t, t=2,…,n, enumerate all possible extensions of the existing particles \(S_{\cdot t-1}^{(k)}\), thus generating \(S_{\cdot t}^{(k,l)}=\left [S_{\cdot t-1}^{(k)},s_{l}\right ]\), l=1,⋯,L. For all l, compute the weights \(w_{t}^{(k,l)}\) using (8).

  • Step 3 (Particle selection): Select and store K particles \(\left \{S_{\cdot t}^{(k)},k=1,\cdots,K\right \}\) with the highest importance weights \(\left \{w_{t}^{(k)},k=1,\dots,K\right \}\) from the set \(\left \{S_{\cdot t}^{(k,l)},w_{t}^{(k,l)},k=1,\dots,K,l=1,\dots,L\right \}\). Normalize the weights of the selected particles. Go back to Step 2 and repeat until t=n.

  • Step 4 (Haplotype reconstruction): At t=n, assemble the entire haplotype sequence by selecting the particle with the largest weight.

Complexity analysis of ParticleHap

Since ParticleHap searches for the most likely haplotypes by sequentially extending the partially reconstructed candidate haplotypes one position at a time, it is computationally very efficient and has complexity that scales linearly with the haplotype length, O(n). On average, the amount of calculations needed for haplotype assembly with ParticleHap is n K L a C a , where C a denotes the average number of bases covering heterozygous sites and L a denotes the average number of possible extensions of the partially reconstructed haplotype at a heterozygous SNP site. It is worth pointing out that while there are in principle 12 possible SNP pairs of \((s_{l_{1}},s_{l_{2}})\) at one site, the 3rd or 4th most frequently reported nucleotides are never selected as potential SNPs’ genotypes based on the likelihood calculations. Therefore, ParticleHap can be implemented even more efficiently by retaining the two (or three in the case of ties) most frequently observed nucleotides at each site while the others, which are considered errors, are replaced by −; this step can significantly reduce the amount of likelihood calculations without compromising the accuracy of computing the most likely genotype.

Postprocessing

We can further improve the accuracy of assembled haplotypes by pooling the information obtained from multiple runs of ParticleHap whose performance may be affected by the choice of the starting position. In particular, we also run the algorithm in the opposite direction, e.g., perform sequential Monte Carlo reconstruction of the most likely haplotype starting from the site t=n and terminating the algorithm at the site t=1. For the sites where the haplotype pairs reconstructed by the two runs of ParticleHap differ from each other, we compare the likelihoods of the solutions (i.e., we compare the weights in (8)) and choose the one with larger likelihood. In case the sites are consecutive, we compare MEC scores for the two cases and choose the one with smaller MEC.

Results and discussion

We implemented ParticleHap in C and compared its performance with the publicly available implementations of HapCUT [17] and ReFHap [18]. All three methods are run on a Linux OS desktop with 3.06GHz CPU and 8Gb RAM (Intel Core i7 880 processor). Both real and simulated data are used for the experiments, as described in the remainder of this section.

genome project data

We first study the performance of ParticleHap on 454 sequencing data of CEU NA12878 genome (1000 Genomes Pilot Project [2]). The short-read data aligned with respect to a reference genome as well as variant and genotype calls for the individual are provided. The data contains a total of 2.58 million reads covering 1.65 million variants on 22 chromosomes. Due to the short lengths of reads and limited insert sizes, the data is split into a number of disconnected blocks. We use ParticleHap to reconstruct each such block.

To evaluate the performance of haplotype assembly, we adopt three measures: the number of phased SNPs (nPhased), the minimum error correction (MEC) score, and running time (Time). In particular, nPhased is the number of SNPs phased by a haplotype assembler; MEC score is the smallest number of entries in the data matrix which need to be changed so that the sequencing information is consistent with an error-free haplotype pair (we report the total MEC score evaluated as the sum of the MEC scores obtained for each haplotype block); and, Time is the runtime of an algorithm in seconds, measured for each algorithm on the same processor.

We choose a different number of particles K for each block. Specifically, the longer the block length n, the larger the number of particles (12 for n≤12, \(\frac {n}{2}\) for 12<n<100, and 50 for n≥100). We assume that the bases in the sequence are equally likely, i.e., the composition probabilities of the 4 nucleotides are equal, 0.25. Note that we assume the error rate e=0.01, which is consistent with the typical sequencing accuracy of the 454 platform. Then, the sequencing error probabilities \(\phantom {\dot {i}\!}\text {Pr}(X_{\textit {ij}}=x_{\textit {ij}} | S_{\textit {kj}}=s_{l_{k}}) = e/3\) if \(\phantom {\dot {i}\!}x_{\textit {ij}}\neq s_{l_{k}}\) and are equal to 1−e otherwise.

Table 1 shows that ParticleHap can assemble more haplotypes with higher accuracy and better computational efficiency than HapCUT and ReFHap (the lower the MEC score, the better the performance). Clearly, more SNP sites are phased by ParticleHap than by either HapCUT or ReFHap (please see columns 2, 5 and 8). This can be attributed to the fact that ParticleHap processes reads containing actual base calls (i.e., the reads represented by A, C, G, T and −, rather than the same reads being represented by their binary post-genotype calling counterparts), thus allowing more than two nucleotides at a site, while HapCUT and ReFHap discard some information in the process due to relying on the simplified representation of the reads (via the ternary alphabet with elements 0, 1 and −). It turns out that roughly 1.2−1.5 % of the heterozygous positions in the dataset are either tri- or tetra-allelic SNP sites. As can be seen from columns 3, 6 and 9, ParticleHap outperforms HapCUT and ReFHap by consistently achieving lower MEC scores on most of the chromosomes. Note that the MEC scores are calculated only for the haplotype pairs phased by an algorithm; thus, the lower MEC of ReFHap does not necessarily imply that it achieves better performance since it may actually be phasing a smaller number of SNP sites. ParticleHap simultaneously provides longer lengths of phased haplotypes as well as lower MEC scores, demonstrating the high accuracy of the proposed algorithm (note that the total number of reads and allele calls involved in the MEC calculation of ParticleHap is larger than those for HapCUT and ReFHap). This can be partly attributed to the fact that ParticleHap works with genotype likelihoods and allows thorough examination of tri- or tetra-allele SNP sites, which leads to improved genotype accuracy in situations where there are potential errors in “hard” genotype calls used by the competing methods. It is also worth pointing out that ParticleHap is designed to sequentially find the maximum-likelihood solution to the haplotype assembly problem rather than to optimize the MEC criterion while HapCUT uses MEC as its optimization objective (the MEC score is only used to correct potential errors in ParticleHap’s post-processing step); therefore, the superior MEC performance of ParticleHap demonstrates the robustness of the approach. Finally, as reported in columns 4, 7 and 10, ParticleHap assembles haplotypes significantly faster than either HapCUT or ReFHap. In particular, ParticleHap can complete haplotype assembly for each of the 22 chromosomes within 6 seconds, while HapCUT and ReFHap require 59 and 14 seconds for the same task, respectively.

Table 1 The performance comparison on a CEU NA12878 data set sequenced using the 454 platform in the 1000 Genomes Project

Simulated data

We further test the performance of our proposed method on the simulated data set. In particular, we examine how the genotype calling errors affect the performance of haplotype assembly. The data are generated using a similar strategy to the one in [15] except that a genotype calling error is included in our simulation data. We generate a pair of phased heterozygous SNP sequences of length n, which have genotype calling errors with probability g e . The parameter g e is judiciously chosen as 0.04 and 0.08 in order to emulate practical scenarios reported in [1] (there, the genotype call accuracy for high call rates was up to 96 % with the use of LD information, from 78 % and 87 % with the use of single sample and multiple individuals, respectively, for the 62 CEU individuals). The true haplotype sequences are generated by emulating genotyping errors; each base in the erroneous heterozygous SNP sequences is flipped to one of the other three nucleotides with equal probability, g e /3. To generate the observation data matrix, instead of sampling (with reads) from true haplotype sequences c times as in [15], we sample true haplotype sequences \(\frac {c}{2}\) times and sequences containing genotype calling errors \(\frac {c}{2}\) times. Each replicate is randomly partitioned into non-overlapping fragments of length between 3 and 7 (the lengths typical of benchmarking data sets in [15]). In order to simulate paired-end (or mate-pair) sequences, we randomly merge some of the generated fragments (fragments whose SNPs are in the first half of haplotype sequence are merged with those whose SNPs in the last half of sequence; as a result, half of the fragments in the dataset are paired-end sequences). Once the fragments are arranged in a SNP matrix, we emulate sequencing errors by randomly flipping a base to one of the other three nucleotides with equal probability. The probability that each base is flipped is 0.03 and 0.01 for g e =0.04 and g e =0.08, respectively, and thus the total error rate for the entries in the SNP matrix is e=0.05. To explore the performance of the algorithm over a broad range of experimental parameters, we generate datasets with different SNP lengths (n=100, 200 and 300) and vary the coverage rate (c=4, 6, 8 and 10) for each genotype calling error rate (g e =0.04 and g e =0.08). For each of the 24 combinations of the parameters, the experiment is repeated 100 times and the results averaged over the 100 instances are reported for each case.

We quantify the ability of an algorithm to reconstruct a haplotype by means of the reconstruction rate [15] defined as

$${}R=\!1-\frac{\min(D(h_{1},\hat{h}_{1})+D(h_{2},\hat{h}_{2}),D(h_{1},\hat{h}_{2})+ D(h_{2},\hat{h}_{1}))}{2l}, $$

where (h 1,h 2) is the pair of true haplotypes, \((\hat {h}_{1},\hat {h}_{2})\) is the pair of reconstructed haplotypes, \(D(h_{i},\hat {h}_{j})=\sum \limits _{k=1}^{n} d(h_{i}[k],\hat {h}_{j}[k])\) is the generalized Hamming distance between h i and \(\hat {h}_{j}\), and \(d(h_{i}[k],\hat {h}_{j}[k])= 0\) if \(h_{i}[k]=\hat {h}_{j}[k]\) and is 1 otherwise. Running time (Time(s)) for each algorithm is evaluated along with the reconstruction rate (ReconRate). In addition, we report the rate at which ParticleHap infers true genotypes for the locations where genotype calling errors are induced, i.e., the improvement rate of genotyping accuracy (labeled as ImpGeAc).

Tables 2 and 3 compare the results of ParticleHap, HapCUT and ReFHap for g e =0.04 and g e =0.08, respectively. As can be seen in those tables, ParticleHap assembles haplotypes with the reconstruction rates of 97.85 % and 95.68 % when the data is affected by the genotype calling error rates of 4 % and 8 %, respectively. This highly accurate performance is achieved in part due to ParticleHap’s ability to improve genotyping accuracy in mis-called (or uncertain) sites by more than 50 % in all the considered scenarios as shown in column 3 in Tables 2 and 3. (i.e, ParticleHap can improve the genotype accuracy of 96 % and 92 % to 98 % and 96 % in Tables 2 and 3, respectively.) It is worth pointing out that, in these simulations, we assumed equal prior probabilities of all genotypes. Imposing more judicious choices of priors may lead to further improvement of genotyping accuracy. On another note, the corresponding reconstruction rates of HapCUT and ReFHap do not exceed 96 % and 92 %, respectively. Evidently, incorporation of genotyping in the haplotype assembly procedure allows pushing the limits of the achievable accuracy of haplotype assembly. Note that ParticleHap is consistently much faster than HapCUT and ReFHap in all the considered scenarios. As expected, the running time of ParticleHap increases with both the haplotype length and sequencing coverage.

Table 2 The performance comparison on a simulated data set for g e=0.04
Table 3 The performance comparison on a simulated data set for g e=0.08

Conclusions

In this paper, we presented a novel deterministic sequential Monte Carlo (i.e., particle filtering) algorithm for solving the haplotype assembly problem. ParticleHap sequentially infers the haplotype sequence, one SNP site at a time, by exhaustively searching for the most likely extension of the partially assembled haplotype in each step, examining both the possible genotypes and phase. We tested the performance of ParticleHap on 1000 Genomes Project data, showing that it achieves better minimum error correction scores and phases more heterozygous sites than two of the most accurate existing methods while being significantly more computationally efficient. The results of testing ParticleHap on the simulated dataset also demonstrate that the proposed method can reconstruct haplotypes with higher accuracy and efficiency than those of competing techniques over a wide range of the haplotype assembly problem parameters.

The main goal of ParticleHap is accurate haplotype assembly rather than genotype calling. However, methods that can improve accuracy of genotype calling can be incorporated in the proposed algorithm. For example, the prior information about allele and genotype frequencies or linkage disequilibrium patterns can be incorporated in the proposed algorithm, which may further improve the accuracy of genotype calling and thus of haplotype assembly. In conclusion, the proposed method provides a framework for joint genotyping and haplotyping that leads to accurate haplotype assembly.