Haplotypes are combinations of alleles from multiple genetic loci on the same chromosome that are inherited together; the term haplotype can encompass as few as two loci or refer to a whole chromosome (that is, chromosome-scale haplotype). For diploid genomes, a given length of chromosomal DNA will have two haplotypes, one inherited from each parent, whereas several haplotypes exist for any given chromosomal region at the population level or for polyploid genomes. DNA microarrays and short-read sequencing can determine the collection of alleles at genetic loci (that is, genotypes) but provide no information at the level of haplotypes, whether alleles are co-located on the same copy of a chromosome, or which of the parental chromosomes harbors a particular allele. Hence, computational reconstruction of haplotypes using upcoming sequencing technologies, by either read mapping to a reference genome or de novo assembly, is required.

Haplotype information is fundamental for medical and population genetics [1, 2], where it is used to study genetic variation associated with human diseases [3, 4]. Traditionally, specific SNP locus-specific association to diseases was studied with respect to a linear reference sequence, for example, two SNPs, rs9494885 and rs2230926 in the TNFAIP3 gene w.r.t Grch37 reference, have known correlation with scleritis disease [5]. However, individual haplotypes (or their collection in the form of a pan-genome graph [6], which represents the genetic variations from populations and medical samples) can help to discover highly complex variations such as nested structural variation, inversions, and other complex rearrangements (reviewed in [7]) and to access the full spectrum of rare inherited variants and de novo mutations [8]. For example, the haplotype information is helpful to detect a rare case of keratitis-ichthyosis-deafness syndrome that exhibits a spontaneous correction of a pathogenic mutation by another mutation on the whole-chromosome scale [9]. Additionally, the phenomenon of compound heterozygosity on homologous chromosomes is responsible for recessive Mendelian disorders [4]. The chromosome-scale haplotypes also have functional relevance—the distribution of cis- and trans-acting variants between homologous chromosomes, that is, the phase of variants, can affect gene expression; chromosome-scale haplotypes help study interactions between variants in regulatory elements (long-range promoter-enhancer interactions) [4]. Another highly relevant chromosome-scale haplotyping example is to understand the context of aneuploidy (chromosome loss or gain) in cancer genomes, for example, large copy number gain in centromere 17 for chromosomal instability in breast cancer [10] also requires recent haplotyping approaches. The inference of whole-chromosome haplotypes has clinical relevance: having both variants on the same allele (cis) lead to a specific (for example, super-responder) phenotype, while those variants were on separate alleles (trans) do not. Haplotypes also play an important role in understanding the interplay of evolutionary processes that shape genetic variation, such as recombination, gene conversion, mutation, and selection. For example, modification of plant breeding strategies based on evolutionary processes identified through haplotype reconstruction can result in agricultural improvements [11]. Another highly relevant application occurs in the analysis of viral infections [12], where haplotype reconstruction can help to identify drug resistance and virulence factors and aid treatment decisions [13, 14].

Despite recent advances, sequencing technologies are limited in their ability to cover repetitive genomic regions to produce chromosome-scale haplotypes. Therefore, local (short-range) and genome-wide (long-range) information must be computationally integrated to assemble chromosome-scale haplotypes [15]. The integrative algorithms used for reconstruction must be tuned for the specific genome characteristics of a species, such as genome size, number of haplotypes, and repeat or haplotype variation. Many large-scale sequencing initiatives, such as the Vertebrate Genomes Project [16], the DNA Zoo project (, Darwin Tree of Life (, the Human Microbiome Project (, and the Human Pangenome Project (, have begun to leverage diverse recent sequencing data types (Table 1) to reconstruct haplotypes for various species. These projects have designed and integrated bioinformatic pipelines in a common platform for large-scale genome analyses [24].

Table 1 Third-generation sequencing initiatives and reference data sets

In this Review, we discuss the bioinformatic methods—reference-based de novo and strain-resolved metagenome assembly—to reconstruct haplotypes in diploids, polyploids, and microbial communities. We present the strengths and weaknesses of these methods, alongside examples of their biological applications. Finally, we conclude with challenges and future directions, with an emphasis on both the algorithmic and technological advances required to achieve routine high-quality haplotypes for further biological discoveries.

Evolution of sequencing technologies

Early advancements in sequencing technologies [25], such as next-generation sequencing with read lengths of 150–250 bp and accuracy > 99.8%, revolutionized haplotype reconstruction [26,27,28] and helped to characterize the genomic landscape. However, the fairly short read lengths limit the ability to uniquely span repeats and identify regions of heterozygosity, and these technologies are unable to produce whole-chromosome haplotypes. More recently, developments in long-read sequencing technologies [25, 29] have begun to substantially increase the utility and application of haplotype reconstruction.

Short-range sequence and haplotype information

In the era of third-generation sequencing technologies, we define short-range (or local) sequencing that produces genomic fragments (reads) spanning up to megabases of the genome, but cannot connect across multi-megabase sized regions on the whole-genome scale. For example, long-read sequencing technologies such as single-molecule real time sequencing from Pacific Biosciences (PacBio) [30] and nanopore sequencing [31] (including ultra-long [32]) from Oxford Nanopore Technologies produce reads of the order of hundreds of kilobases in length, with error rates of 6–10% [33] (Fig. 1). The latest developments of PacBio’s HiFi technology can produce reads with an average read length of 15–20 kb at error rates similar to short-read sequencing (that is, an accuracy of > 99%) [34]. These advancements have made it possible to achieve near-complete human haplotypes that include microsatellites, repetitive elements, and other complex structural variations [35], which were previously inaccessible. In addition to these “true” long-read platforms, the Chromium technology from 10x Genomics [34, 36] (10xG) employs genome partitioning and barcoding to generate linked reads that span tens to thousands of bases. Finally, new optical mapping instruments from BioNano Genomics [37,38,39] can rapidly fingerprint megabase segments of a genome, enabling the detection of structural variation at a fairly low cost (Fig. 1). However, local (short-range) sequencing technologies suffer from inability to uniquely resolve near-perfect repeats above the size of their read length to produce full haplotypes. These limitations have necessitated the development of methods that can resolve haplotypes at the chromosome (or genome-wide) scale.

Fig. 1
figure 1

Third-generation sequencing technologies and their characteristics (read length, error rate and scale of information). The read length and scale of information (local versus chromosome-scale) together determine the haplotype range that can be achieved; moving down the schematic this range increases (orange arrow). Sequencing costs per sample increase moving from short-read sequencing down to nanopore sequencing, and then decrease again for BioNano and Hi-C (yellow arrows). Similarly, read length and error rate first increase moving down to nanopore sequencing, and then decrease again for BioNano and Hi-C (green arrows)

Long-range sequence and haplotype information

Long-range (or chromosome-scale) sequencing consists of technologies that produce genomic fragments spanning across centromeres, thereby providing information to connect p- and q- arms over the entire genome, and connecting multi-megabases regions. Chromosome conformation capture methods such as Hi-C and related chromatin crosslinking protocols and produce long-range, mate-pair data for short-read sequencing [37] (Fig. 1). Hi-C technology [40] generates chimeric DNA fragments from two interacting chromosomal regions that are covalently linked together. These fragments are sequenced to produce paired-end reads representing genomic segments of a few kilobases and tens of megabases in physical distance.

Strand-seq is a recent single-cell sequencing advancement that allows independent sequencing of parental template strands and thereby characterization of individual homologues [41, 42]. Specifically, in the presence of bromodeoxyuridine (BrdU) during the DNA replication, sister chromatids generate one original template strand and one newly synthesized, BrdU-incorporated strand. The template strand and its directionality are preserved during the cell division phase that helps to separate the individual homologs.

These sequencing methods provide long-range information on genomic structure across centromeres, and they can be computationally assembled into chromosome-scale [43] at low cost. However, these haplotypes contain many gaps, especially in larger repeat regions. This limitation has led to further advancements in computational approaches for haplotyping (Table 2), such as the use of a hybrid approach that combines data from long-read and chromosome-scale sequencing technologies.

Table 2 Methods and computational tools for haplotype reconstruction

Reference-based haplotype reconstruction

When a reference genome is available, haplotype reconstruction of the target sample comprises identifying co-occurring alleles of paternal and maternal copies over variant sites from sequencing data aligned to the reference. The process of obtaining these haplotypes is known as haplotype phasing [2].

Traditionally, reference panels of more than 100,000 individuals (large-scale projects such as UK10K) are genotyped and used to assign, probabilistically, the most likely local phase of the target sample based on the underlying evolutionary model [72,73,74]. This statistical phasing technique limits chromosome-scale haplotype [25, 75] production because the ancestry tracts from populations or the Mendelian laws of inheritance from trios have only local information to produce the haplotypes [72,73,74]. Phasing directly from sequencing reads, that is, the direct observation of two or more variants on a single molecule or in paired reads derived from the same molecule, overcomes the above limitations [50, 76]. The process of obtaining haplotypes directly from long-read and chromosome-scale sequencing data of a single individual—as opposed to phasing from genotypes by population inference or genetic analysis of pedigrees—is known as molecular haplotyping [2, 37, 77, 78]. Molecular haplotyping can produce chromosome-level phasing [79] that is highly accurate as determined by evaluation metrics (switch error rates and Hamming error rates < 1%). In molecular haplotyping, the key challenge is to disambiguate sequencing errors from true genetic variation.

Diploid phasing

Reconstruction of haplotypes depends on how the heterozygous sites are connected on the chromosome-scale. If there are no reads that connect these sites, then the phasing is fragmented. Thus, the sites must be connected directly or indirectly via sequencing reads to achieve chromosome-scale phasing (Fig. 2). Long- and linked-read sequencing datasets, which span longer segments of heterozygous variants than short reads (Fig. 2), have improved the production of high-quality local phasing segments and the discovery of de novo and rare genomic variants.

Fig. 2
figure 2

Molecular haplotyping techniques in reference-based phasing. Individual haplotypes are derived directly from sequencing data of the target sample based on read alignments to the reference genome. Local and chromosome-scale haplotype phasing make use of short- and long-range sequencing data, respectively; hybrid haplotype phasing combines the two data types

The most widely used, state-of-the-art phasing methods are WhatsHap [44], HapCut2 [45], and ProbHap [46], which generate considerably longer haplotype blocks than short reads, in the order of several megabases in length with a switch error rate of < 0.5%. The performance of these methods is comparable and are optimized for different input data types, for example, all three methods produce comparable phasing completeness/accuracy/continuity using short-range sequencing, while HapCut2 and WhatsHap produce one large haplotype block on the chromosome-level using a combination of short-range and long-range sequencing data. The core aim is to assign all reads to two haplotypes while minimizing the number of sequencing error corrections or flips, also known as the minimum error correction (MEC) problem and weighted minimum letter flip (WMLF) problem [80]. More specifically, the MEC formalism, which is the most widely used, is the process of finding the minimum cost of correcting the sequencing data to partition the read set into two homologous sets such that the alleles between any two reads in any partition match [22, 44]. The MEC formulation is NP-hard [80, 81]. In practice, this formulation is solved using computational techniques such as dynamic programming, probabilistic modeling, graph-based optimization, and linear programming [82]. To scale these algorithms to human-sized genomes and beyond, a combination of greedy heuristics and dynamic programming is prominent [45, 83].

Genome-wide molecular phasing, a task that computes combinatorial solutions at chromosome-scale by using long-range sequencing technologies such as Hi-C and Strand-seq (Fig. 2), is a more challenging algorithmic task. Computational tools used in practice, such as HapCut2 [45] and StrandPhaseR [79], reduce the search space using greedy heuristics based on the MEC formulation. Remarkably, these tools generate haplotype blocks spanning full chromosomes [45]. However, they can typically phase only 50–70% of variants using Hi-C/Strand-seq [48]. This phasing sparseness can be improved with uniform-coverage data that is often difficult to produce experimentally. In this new era of advancements across technologies, hybrid algorithms that combine different data types at local and genome-wide scale are prominent. For example, WhatsHap [44] and HapCut2 [45, 84] both have local as well as chromosome-scale phasing modes. In addition, WhatsHap [44] can perform family-based phasing, which has been shown to give better results than single-individual approaches in terms of accuracy and phasing completeness [85]. The disadvantage is the unavailability of trio sequencing data for various species.

Hybrid approaches (combining long or 10xG reads with Strand-seq or Hi-C datasets [79]) for single individuals are leading the way into production-level efforts and provide competitive phasing performance at chromosome-scale with hamming error rates < 1% and switch error rates < 0.5% by using ~30x HiFi or 10x linked-reads combined with ~30x Hi-C (Fig. 2). State-of-the-art hybrid phasing tools are WhatsHap and HapCut2. HapCut2 directly works on reads from long-read and Hi-C sequencing using likelihood inference optimization, while WhatsHap operates on Strand-seq haplotypes and long-read reads using MEC formulation. These methods have enabled impressive advances in the production of high-quality chromosome-scale phasing, for example, phasing Ashkenazi, PGP-1, Chinese human genomes [15, 86], as well as genomes from the 1000 Genome project for a comprehensive SV callset.

Beyond the above bulk sequencing methods, single-cell phasing [76] has recently been used to study single-cell genomic heterogeneity. However, extremely low sequencing coverage (< 0.05x per cell) has restricted its use in phasing of large multi-megabase segments in individual cells for genome-scale analysis. Recent single-cell phasing methods such as CHISEL [47], Satas et al. [48], and RCK [49] use probabilistic models at a single-cell level that have the advantage of haplotyping rare alleles, which can be used to determine local relationships in allele-specific somatic aberrations, but cannot phase all variants across the genome. Thus, in the near future, combining single-cell and bulk sequencing approaches for phasing may enable accurate and complete genome-wide characterization of genomic heterogeneity, including rare alleles and cancer genomes.

Polyploid phasing

In phasing diploid genomes, the haplotypes are complementary: given the genotype data, determining one haplotype sequence directly identifies the other. However, polyploidy is common in plant genomes, and in the case of a k-ploid sample, k–1 haplotypes need to be computed before the final haplotype can be inferred. For example, there are k! possibilities (instead of two in diploid) to connect a pair of SNPs in the polyploid. A higher number of haplotypes also requires a greater overall sequencing depth, resulting in a larger number of reads per genome to be processed. This additional complexity requires specialized, highly optimized algorithms to resolve polyploid phasing (Fig. 2).

To solve polyploid phasing problems, the maximum likelihood framework is a common algorithmic strategy. HapTree [50, 87] uses the relative likelihood algorithm to identify k-ploidy phasing for first n SNPs, that is conditioned on previous n-1 SNPs. This approach lays the first theoretical foundation of polyploidy phasing problems. A few works have attempted to formulate the problem using approximate MEC formulations, for example, SDhaP [88] solves approximate MEC using semidefinite programming, and H-PoP [53] partitions the reads into haplotypes by solving a generalization of the MEC problem. However, there is an inherent problem in MEC based methods that it leads to inaccurate phasing as demonstrated by Motazedi et al. [89].

To address these shortcomings, local phasing methods such as Ranbow [90] follows graph-based algorithms by leveraging allele co-occurrence in overlapping short reads to produce accurate polyploid phasing, but lacks in haplotype block N50 length. An alternative phasing approach, designed specifically for long-read sequencing, is WhatsHap-polyphase [52] (available as part of the diploid phasing tool WhatsHap) that produces accurate phasing (switch error rates < 1% and hamming error rates < 2%) and better N50 compared to short-read methods. A recently linked-read based method [51]: Hap++ and Hap10 produces slightly more accurate and comparable haplotype block N50 compared to WhatsHap-polyphase, at the cost of efficiency. However, these methods have limitations to produce chromosome-scale haplotypes.

Similar to diploid phasing, additional long-range information can allow for chromosome-scale haplotype reconstruction of polyploid genomes. For example, TriPoly [91] uses family information (parent-offspring trios) to infer haplotypes from either short- or long-read sequencing data. This results in larger haplotype blocks compared to other approaches, in particular in regions with low divergence between haplotypes.

While these methods represent an important step forward in polyploid phasing, hybrid methods that leverage HiFi, or alternatively linked-read data, and long-range Hi-C data can potentially produce chromosome-scale haplotypes for complex repetitive polyploids. In the near future, further algorithmic developments in local k-mer strategies and graphs-based approaches that focus on every haplotype could enable chromosome-scale phasing in large polyploid genomes.

De novo haplotype assembly

De novo genome assembly exploits the overlaps between sequencing reads, without any bias towards reference sequence. The main steps in a standard genome assembly workflow are sequence graph construction, error correction, contig formation, scaffolding, and polishing of the assembled sequences [92]. The most widely used assembly software is Canu [93], FALCON [94], Flye [95], wtdbg2 [96], and shasta [97]; we refer the reader to a review by Sedlazeck et al. [98] for a literature survey on genome assembly using the latest technologies. Of particular interest to haplotype assembly are assemblers specifically designed for PacBio HiFi datasets, which, for the first time, improved the per-base quality of assemblies dramatically [99] and reduced the need for computational intensity of the error correction step.

After contig construction, the next step is to create scaffolds by ordering and orienting contigs along the chromosomes using chromosome-scale information sources, such as Hi-C data. Scaffolding with the chromosome-scale data types has resulted in chromosome-scale consensus assemblies of human genomes.

Due to sequencing errors, the reads often undergo error correction before contigs are formed; this is particularly relevant when using error-prone long-read sequencing technologies. Despite the error correction process, contigs and scaffolds may still be erroneous and thus another round of error correction is performed (now referred to as polishing) using tools such as Racon [100].

However, for diploid and polyploid genomes, most standard de novo assemblers collapse the haplotypes into a single consensus sequence, but it is important to realize that all haplotype information is ignored. Nevertheless, the consensus assembly is useful for de novo haplotype assembly approaches as discussed below.

Reconstructing every individual haplotype from sequencing data instead is known as de novo haplotype assembly (Fig. 3) and is even more challenging than consensus generation (de novo assembly) due to varying repetitive and heterozygosity rates, noisy sequencing data, chimeric reads, insufficient read length, and non-uniform coverage.

Fig. 3
figure 3

Haplotype-aware de novo assembly. Collapsed assembly approaches identify sequence variants on a consensus assembly and subsequently phase these variants into haplotypes using chromosome-scale data (Hi-C). Semi-collapsed approaches follow a similar approach, but after phasing, variants in the initial assembly graph are updated and final contigs are produced based on this updated graph. Uncollapsed approaches directly determine haplotype-specific overlaps in local sequencing reads by retaining SNPs and repeat variation in all possible overlaps and construct haplotypes based on the selected overlaps

In de novo haplotype assembly, there are two major related challenges: finding ordering of sequencing reads and distinguishing reads to haplotypes. To find ordering of reads, the brute-force approach is to align all reads to all other reads, where the performance is directly proportional to the square of the number of reads. In repetitive regions, finding alignments of reads is even more expensive. For systematic study, overlap-based [101] or de Bruijn graph [102]-based techniques are used. To solve another challenge of finding haplotype of reads, the commonly used approach was heterozygous SNPs informative sites to partition reads to haplotypes in the space of single consensus sequence (due to high error rates in long-read PacBio and ONT data); however, latest advancements in Hifi allowed to separate reads to haplotypes during the overlapping step as discussed below.

Diploid haplotype assembly

Algorithms for long-read sequencing are now able to produce megabase contigs for haplotypes and improve the availability of reference-quality genomes for humans and various other eukaryotic organisms [23, 103, 104]. This technique has been applied to assemble phased sequences of humans [55, 103] (Table 1), diploid potato [105], zebra finch [54], cattle [54], and goat genomes [106]. Broadly, the bioinformatic approaches for diploid assembly fall into three classes: collapsed, semi-collapsed, and uncollapsed (Fig. 3).

In collapsed diploid assembly, generic de novo assemblers are used to generate a consensus sequence. Subsequently, by using heterozygous SNP information from reads aligned to the consensus sequence, long-read and chromosome-scale sequencing reads are partitioned into haplotype-specific read sets, which are then separately assembled into haplotypes. This technique is used by tools such as DipAsm [55] or Porubsky et al. [56], resulting in phased contigs of up to several tens of megabases and chromosome-scale phased scaffolds, with haplotype sizes of ~ 3 Gb each and overall base quality scores of >Q48. The preferred input data types are PacBio Hifi and Hi-C. This technique works well for the human genome in regions of low heterozygosity, but fails in highly repetitive and high heterozygosity regions.

Alternatively, the widely used FALCON-unzip [23] method uses a semi-collapsed approach for diploid assembly from long noisy reads, where the initial assembly graph is generated using FALCON and a consensus sequence is generated. Similar to the collapsed approach, reads are partitioned into haplotype-specific sets using SNP information. Phased read information is then used to update the initial assembly graph, and phased contigs (size of about several tens of megabases with quality score < Q48) are reported [94]. These phased contigs are then combined into scaffolds using phase information (> 1 Mb) provided by ultra-long nanopore or Hi-C data, as employed by FALCON-Phase [54], producing a chromosome-scale diploid assembly. The preferred input data types are PacBio CLR and Hi-C. Similar to the collapsed approach, it works particularly well for human genomes when the heterozygosity rate is low, but fails in regions or genomes with high repeat and heterozygosity rates. However, the most promising uncollapsed approaches overcome these limitations by directly determining haplotype-specific overlaps in the overlap step of graph generation using SNP information from overlapping reads [58]. The core idea is to preserve heterozygosity and repeat information from various data types in the graph space. To achieve this, on every reference read, similar reads from the same haplotype and repeat are detected based on shared alleles at SNP sites and are clustered together. Standard tools use run-length encoding or base-level alignment [57] in the overlap step. Thus, a haplotype and repeat-aware overlap graph is generated with subsequent graph cleaning steps, finally reporting phased contigs.

The recent invention of PacBio HiFi technology has made the diploid assembly process, that entains ordering as well as the phasing in the assembly process, easier [58]. A whole generation of new algorithms based on uncollapsed approaches have become possible due to the availability of accurate long-read data and are implemented in tools such as Hifiasm (, HiCanu [57], and SDip [58], producing contigs with lengths of several tens of Mb having base quality scores >Q50, but phased blocks of only a few hundreds of kb. In these systems, the field is moving towards accurate HiFi data using k-mer based strategies for haplotype-aware error correction of phased contigs, which can be completed in a few hours for human-scale genomes. Similar to semi-collapsed approaches, these phased contigs can be combined into phased scaffolds using long-range information to produce a chromosome-scale diploid assembly. For phased scaffolding, one of the largest challenges is the development of computational models that combine both phasing and scaffolding information together, an approach that is recently explored by the pstools method ( in the graph sequence space. The preferred input data types are PacBio Hifi and Hi-C.

When trios are available, methods such as TrioCanu [59] search for k-mers from maternal and paternal haplotypes in reads that were sequenced from the child to produce haplotype-specific read sets and then assemble these separately. New methods (Hifiasm+trio and WHdenovo [60]) use both trio and local phasing information from sequencing data, thus resulting in high-quality phased contigs. Although pedigree-based haplotype assembly allows for improved accuracy as compared to haplotype assembly of individuals, it requires sequencing of three individuals which limit its applications. Moving forward, substantial improvements to the uncollapsed approach using graphs and k-mers from local and chromosome-scale sequencing datasets in single individuals are expected to become routine for chromosome-scale diploid assembly within the next few years.

Polyploid haplotype assembly

Polyploid assembly is in principle an immediate extension of diploid assembly; however, an increase in the number of haplotypes inflates the search space dramatically on the whole-genome scale. Some progress has been made in computational approaches for local haplotype assembly in polyploids as a potential step towards chromosome-scale polyploid assembly. For example, using Illumina short-read sequencing reads, POLYTE [62] performs overlap graph-based de novo assembly for diploids and polyploids. Since short reads cannot span difficult-to-assemble regions such as long repeats or variant deserts, the haplotype-specific contigs produced by this algorithm remain relatively short, yet highly accurate. Alternatively, linked-read technologies have been used to obtain long, contiguous, polyploid genome assemblies [107]. Long-read based methods such as SDA [61] and SDip [58] have demonstrated their ability to assemble polyploid regions in human genomes several megabases in length (Fig. 3). With some further algorithm engineering, it should be feasible to apply these methods to obtain chromosome-scale haplotypes of polyploid genomes with chromosome sizes of less than tens of megabases in size. However, for large complex repetitive polyploid genomes, algorithm development is required to exploit the latest HiFi sequencing, and its combination with other technologies such as Hi-C, by separating all haplotypes during the assembly process in graphs on the whole-genome scale.

Strain-resolved metagenome assembly

Haplotype reconstruction plays an important role in strain-resolved metagenome assembly, that is, the computational reconstruction of haplotypes from pooled sequencing to identify microbial strains (Fig. 4a). Variation within and across species and low per-strain (haplotype) sequencing depth across different datasets make it extremely difficult to distinguish genetic variation from sequencing errors [108, 109]—a microbial sample can contain several hundreds of haplotypes with levels of variation ranging from < 1 to > 5%. Similar to de novo haplotype assembly for diploid and polyploid genomes, additional hurdles are the longer repeats and homologous regions between closely related strains (that is, intergenomic repeats) relative to sequencing read lengths.

Fig. 4
figure 4

Strain-resolved metagenome assembly. a Given a pooled sequencing sample, the goal of strain-resolved metagenome assembly is to reconstruct all individual microbial strains. b A typical workflow consists of four steps: de novo assembly, contig binning, bin-wise re-assembly, and assembly curation. Each step can be performed at the species-level or at the strain-level, as illustrated in the left and middle column, respectively. Some workflows skip the initial de novo assembly step and perform strain-resolved binning directly on the sequencing reads, which can be reference-guided (right column)

The bioinformatic approaches for metagenome assembly are highly related to haplotype reconstruction in diploids and polyploids, as noted by Kolmogorov et al. [110] and Nicholls et al. [111]. In practice, various diploid or polyploid haplotyping approaches are adapted to solve the strain-resolved metagenome assembly problem [67, 68, 110] and vice versa [62]. Here, we classify two classes of methods: species-level and strain-resolved metagenome assembly—species-level reconstruction aims at constructing a single (consensus) haplotype per species, while strain-resolved assembly aims at every strain of species. Each class of methods can further be distinguished into reference-based (database of species/strains) and de novo approaches—similar to diploids and polyploids approaches. The advantage of reference-based approaches is that they are efficient, but they often lead to biases towards the database(s) being used, for example, due to incompleteness of reference databases, as many microbes on Earth remain uncharacterized [112]. This type of reference bias is even more pronounced than the reference bias observed in diploid and polyploid assembly, hence de novo algorithms are an essential component of any complete, unbiased analysis of metagenomes.

From an author’s perspective, we present a general workflow for metagenome assembly that consists of several steps [113,114,115,116] (Fig. 4b): (1) de novo metagenome assembly to produce contigs or scaffolds; (2) contig binning per genome, either de novo or reference-guided; (3) mapping reads back to individual bins and reassembling each bin; and (4) curation of the resulting assembly per bin. In theory, each of these steps can be performed at the species-level or at the strain-level, depending on the goals of the study.

Short-read metagenome assembly

The commonly used data structures for metagenome assembly are de Bruijn and overlap graphs with special tuning of parameters related to sequencing depth, variations, and errors. For example, IDBA-UD [63] is the first metagenome assembly method based on de Bruijn graphs, and LSA [117], a method that uses k-mers to identify (partial) bacterial strains in short-read sequencing data with relative abundances as low as 0.00001%. Other de novo approach tools are MEGAHIT [118] and metaSPAdes [119], but these can only produce species-level assemblies. Alternatively, several approaches are based on single-nucleotide variants (SNVs), which are identified using metagenome assemblies or reference databases, or entirely de novo—see REF [120] for a detailed review. The major limitation of such approaches is that structural variants are completely ignored. Available methods for SNV-based metagenome assembly with strain-resolution include ConStrains [66] and StrainFinder [67], both of which can trace strain identities across multiple samples (a longitudinal time series). Recently, a Bayesian model for local haplotype reconstruction was proposed in a promising approach called Gretel [68], which is based on a new data structure designed to efficiently store variation across sequencing reads. All of these methods (ConStrains, PathFinder, Gretel) aim at strain-level sensitivity in step 1 (Fig. 4b). Another class of methods achieve strain-level sensitivity in step 3 while relying on species-level sensitivity described in steps 1 and 2 (Fig. 4b). DESMAN [64] is one such method, which leverages base haplotype frequencies in a Bayesian model. Finally, if strain-level assembly is not achieved in steps 1–3, further curation in step 4 can help to identify intra-species variation [121] (Fig. 4b).

These short-read methods take an important step in strain-level metagenome assembly field and are widely used in studying the human microbiome, health and disease [122, 123], as well as the biodiversity of marine ecosystems [64]. These methods establish the first step towards producing chromosome-scale metagenome assembly.

Hybrid metagenome assembly

Local and chromosome-scale sequencing is essential in achieving chromosome-scale, strain-resolved metagenome assemblies. Recently, a hybrid metagenomic assembly approach (OPERA-MS) was proposed that combines short-read contig assembly with long-read scaffolding and binning to obtain high-quality, strain-resolved metagenomes [65]. OPERA-MS provides an order-of-magnitude improvement in contiguity compared to short-read metagenomic assemblers and a 200% increase compared to generic long-read assemblers. As little as 7x haplotype coverage with long reads was sufficient to obtain megabase N50 genomes [65]. Alternatively, the first long-read metagenome assembler (MetaFlye [110]) proposes the use of local k-mer distributions to identify species of low abundance. MetaFlye can assemble haplotypes with as little as 10x per-haplotype coverage [110], though the extent to which it can distinguish between closely related strains remains to be evaluated. Another approach by Anoton et al. [124] uses long-read assembly (with MetaFlye), followed by assembly curation using short- and long-read data. Yet, another approach, MetaMaps [69], offers strain-level long-read binning, but this requires a reference database and therefore complicates discovery of new haplotypes (Fig. 4b).

Alternatively, the combination of Hi-C and shotgun sequencing enables chromosome-scale, strain-resolved metagenome assembly through improved clustering of metagenome-assembled contigs at strain level, as well as linking of plasmid sequences to the chromosomes of their hosts [40, 70, 71]. Such an approach has recently been used to leverage structural information obtained from Hi-C data of the human gut microbiome to perform strain-level assembly and enable tracking of microbial evolution over time [125].

For complex repetitive metagenomes, HiFi reads, in combination with Hi-C, have the ability to become the strategy of choice to produce complete, strain-level resolved metagenome assemblies in the near future.

Remaining challenges and perspectives

Repetitive regions

Haplotype reconstruction remains challenging in multi-megabase complex repetitive regions. Despite considerable time and effort, the current version of the human reference genome either contains gaps or is collapsed in these regions without haplotype-level resolution. These regions include tandem repeats [126], segmental duplications [127, 128], sex chromosomes (containing complex heterochromatin repeat structures) [129, 130], the mitochondrial genome [131], pseudo-autosomal regions [132] (PARs), centromeres (or pericentromeric regions) [133], ribosomal DNAs [134] (or acrocentric regions), and subtelomeric regions [135]. For example, the human genome includes complex satellite arrays of repeats in centromeres. α-satellite DNA contains ∼171-bp tandem repeats that are organized into higher-order repeats (HORs), with a single repeat structure reiterated over hundreds or thousands of times with high (>99%) sequence conservation [136]. Some human chromosomes comprise ~3200 repeats of ~2 kb HORs and ~1,100 repeats of a 1.8-kb HOR unit [137]. The centromere assembly produced by state-of-the-art tools (centroFlye [138], HiCanu [57]) using HiFi and ultra-long nanopore reads is haploid [20]. Humans are diploid and should produce two haplotype sequences in centromeres; however, currently, there are no algorithms, technologies, or tools to achieve this goal.

Recent developments in long, accurate long reads (Hifi) as well as ultra-long nanopore reads could pave the way for new advancements in finishing centromeric and other highly repetitive regions in humans and polyploids. Computationally, Hifi reads can be decomposed into monomers [139] that are represented in the graph, where monomers are nodes and edges represent the adjacencies of node sequences from reads. In this process, the haplotype variation is also considered in the monomers that can result in a haplotype-aware graph. Through this graph, the ultra-long nanopore reads are anchored to potentially find ordering between repeating units and disentangle the graph. On a complex centromeric region involving >2000 repeat units, the in situ information such as chromosome visualization [140] can further be helpful to order the repeating units. Further increase in read lengths to several megabases in size, and/or reads with spatial coordinates, and/or longer reads with > 99% accuracy, as well as innovations in k-mer and graph-based strategies to distinguish variations/motifs may enable exploration of high-resolution haplotypes in these human centromeric regions.


Developments are required to scale haplotype reconstruction efforts to overcome current limitations and enable routine application to more than hundreds of genomes at a time. Such developments require innovations in technologies that are cheaper and easy to use than long-read, HiFi and Hi-C sequencing. Alternatively, a further reduction in sequencing costs of existing technologies will be required to scale up efforts.

With an exponential growth in datasets, the real challenge will be to store and access haplotyping data in an efficient way, which can potentially be achieved by applying massive parallelism (detailed reviews in [141, 142]). In addition, cloud-based strategies will be required for storing, accessing, and sharing data (for example, Building a collaborative haplotyping platform that can serve as a repository of data and computational tools and enable exchange of ideas for the scientific community may help to usher in a new era of biological discoveries.

Further integration of datasets using scalable bioinformatics approaches will be important. Innovative algorithm engineering (for example, using sequence sketches instead of full sequences has been shown to vastly reduce storage and memory requirements [143]) could enable production-level integration of datasets for haplotype reconstruction. Beyond engineering efforts, combining reference-based and de novo approaches will improve scalability. More specifically, genomes that are similar to known samples can be reconstructed efficiently using reference-based approaches, thus reducing de novo efforts to the remaining highly divergent genomes.

Validation, benchmarking, and annotation

For the final haplotype assemblies of diploid genomes, many high-quality benchmarks are available, and validation is done with standardized evaluation metrics as a standard practice for non-repetitive regions. This is not the case for polyploids, tumors, and complex repetitive regions in diploids. Innovations in algorithms (beyond k-mer approaches) that improve the capability for assessment and biological validation could benefit from a public collection of high-quality benchmarks, for example in the form of a community-driven assessment initiative similar to the Critical Assessment of Metagenome Interpretation [108] (CAMI), Assemblathon [144, 145], and Genome Assembly Gold-standard Evaluations [146] (GAGE). As the field advances to produce high-quality chromosome-scale phased sequences, the next critical step will be in the development of new gene annotation tools [147] to enable more precise downstream analyses in the coming decade.


Another challenge is the visualization of large-scale haplotyping raw sequencing datasets and haplotype sequences from multiple species. The combination of long haplotype sequences and divergence across or within genomes and the large diversity of haplotyping data types pose numerous visualization challenges. While a number of tools exist (reviewed in [148, 149]), none can be used to visualize large-scale phased sequences. New visualization techniques will be required that enable abstractions or reductions in data dimensions from multi-scale, multiple data measurements, binary encoding of variations and divergence across haplotypes for visual maps, and discovery of informative patterns in the haplotyping data. Interactive visualization or animation in the chromosome-scale coordinate system can be useful.


Chromosome-scale haplotype reconstruction has yielded new insights into the genetic underpinnings of disease pathogenesis, evolution, and comparative biology. To overcome the limitations of sequencing reads to cover genomic repeats, chromosome-scale haplotype reconstruction using a combination of long-read (HiFi and ultra-long ONT) and chromosome-scale sequencing (Hi-C) datasets, along with integrative algorithms, has become a common strategy to produce haplotypes in diploids, but not polyploids yet.

Improvements in fragment lengths and combining complementary technologies through innovative algorithms (graphs, k-mers and data-driven) will be state-of-the-art to reconstruct high-quality haplotypes with fewer gaps in the near future. Both fragment accuracy and length—a few megabases size with accuracy of > 98%—could be important to finish haplotypes. Major reductions in sequencing and computing costs will be critical to scale efforts to thousands of genomes at a time. In the next decade, algorithmic and technological advances, paired with the incorporation of haplotypes with disparate layers of biological information, could mark a new era of gapless end-to-end haplotypes and further our understanding of complex biological phenomena.