Key words

1 Introduction

The genetic differences between a sample and a reference genome are called variants. Variant calling is the process of finding these variants. A typical variant caller detects only genotypes, which is the set of alleles present at a locus. Phasing is the process of resolving these genotypes further into haplotypes, that is, to determine whether an allele of one genotype is on the same or different chromosome as another. Read-based phasing, also called haplotype assembly, uses sequencing reads covering at least two heterozygous variants to achieve this.

WhatsHap is a read-based phasing tool that uses as its main component an exact fixed-parameter tractable (FPT) algorithm to find an optimal solution to the phasing problem, often formulated as Minimum Error Correction problem [1]. WhatsHap is highly accurate in practice [2].

Since WhatsHap comes with extra tools to analyze already phased data, it is also often used for other phasing-related tasks. It expects and outputs data in standard formats (BAM/CRAM/VCF/BCF), making it interoperable with other tools.

WhatsHap is an Open Source software released under the MIT license; documentation is available at https://whatshap.readthedocs.io/.

2 Installation

  1. 1.

    Follow the installation instructions for Conda and Bioconda [3] at http://bioconda.github.io/.

  2. 2.

    Create a new Conda environment containing WhatsHap:

    conda create -n whatshap-env whatshap

  3. 3.

    Activate the environment:

    conda activate whatshap-env

    This needs to be repeated every time you open a new shell (or terminal window) before you can use WhatsHap.

  4. 4.

    Optionally, install samtools and bcftools [4], which are usually needed for preprocessing input files:

    conda install samtools bcftools

  5. 5.

    Ensure you have the most recent WhatsHap version (1.2 at the time of writing) by running

    whatshap --version

3 Using WhatsHap

Installing WhatsHap will make available a single whatshap command, which offers various subcommands, provided as the second word on the command line, to give access to WhatsHap’s functionality. Run whatshap --help to see a list of subcommands. The ones discussed here are

  • whatshap phase—Phase diploid samples

  • whatshap polyphase—Phase polyploid samples

  • whatshap stats—Show statistics for a phased variant file

  • whatshap compare—Compare two or more phased variant files

  • whatshap haplotag—Assign reads in an alignment file to their haplotype

Display help for a subcommand by running whatshap subcommandname --help.

For brevity, the commands provided in this chapter use short file names such as ref.fa for the reference FASTA file, which will need to be adjusted in actual use. The examples show usage of .bcf and .bam files, but .vcf, .vcf.gz and .cram are also supported.

See Fig. 1 for a flowchart summarizing the procedure described in the following sections.

Fig. 1
A flowchart to streamline data starts from short and long reads in FAST Q format and FASTA reference format. All inputs pass through several components and then to whatshap phase.

Overview of the process for obtaining phased variant calls, statistics, and a haplotagged alignment file starting from unaligned reads. For better readability, the index files that must accompany the FASTA file, the alignment files, and the variant files are not shown

3.1 Preparing Alignment and Variant Files

For phasing, WhatsHap requires a variant file in BCF or VCF format (which can be bgzip compressed), one or more alignment files in BAM or CRAM format (containing aligned reads), and a genome reference in FASTA format. All files must be accompanied by a matching index file. For best results, two sets of reads are recommended: High-quality reads for variant calling (such as Illumina) and long reads for phasing (such as Oxford Nanopore or PacBio, see Note 1).

When starting from raw reads, read mapping and variant calling are necessary. This can be done in many different ways and is not needed if mapped reads and variant files are already available. If they are not, we provide here a minimal, but functional example workflow.

  1. 1.

    Install FreeBayes and BWA:

    conda install 'freebayes>=1.3.5' bwa

  2. 2.

    Index the reference:

    bwa index ref.fa

  3. 3.

    Map Illumina short reads (assuming paired-end data in R1.fastq and R2.fastq):

    bwa mem -R '@RG\tID:1\tSM:samplename' ref.fa R1.fastq R2.fastq | \

    samtools sort --write-index -o illumina.bam -

  4. 4.

    Call variants (creates variants.bcf):

    freebayes -f ref.fa illumina.bam | bcftools view -o variants.bcf -

  5. 5.

    Map long reads (creates longreads.bam), ensuring correct metadata (see Note 2):

    bwa mem -R '@RG\tID:1\tSM:samplename' ref.fa longreads.fastq | \

    samtools sort --write-index -o longreads.bam -

3.2 Phasing Diploid Samples

Ensure you have a genome reference in FASTA format, a variant file with high-quality variant calls, and one or more alignment files with long reads. Variant and alignment files can contain multiple samples (see Note 3).

  1. 1.

    Index the reference file (skip if the .fai file already exists):

    samtools faidx ref.fasta

  2. 2.

    Index the variants file (skip if a .csi or .tbi file already exists):

    bcftools index variants.bcf

  3. 3.

    Index the alignment file (skip if a .csi or .bai file already exists):

    samtools index longreads.bam

  4. 4.

    Run read-based phasing (see Note 4):

    whatshap phase --indels --reference=ref.fasta -o phased.bcf variants.bcf longreads.bam

    WhatsHap augments variants in variants.bcf with phasing information for all samples (see Notes 5 and 6) by setting the GT (genotype) fields of heterozygous variants to either 0|1 or 1|0 and adding a PS (phase set) tag (see Note 7). The result is written to phased.bcf. Variants that could not be phased (see Note 8) are left unchanged. Due to usage of the --indels option (see Note 9), even indels and multi-nucleotide variants such as GAC → GTA or GAC → GGTT are phased.

  5. 5.

    If you get no phased variants and also encounter the message “Found 0 reads covering 0 variants,” check the metadata in both files (see Notes 10 and 11).

  6. 6.

    Continue in Subheading 3.5 to compute statistics for the created variant file.

3.3 Pedigree Phasing

To phase multiple individuals that are related such as trios, WhatsHap implements a pedigree-aware phasing mode [5] that combines read-based phasing with phasing based on the Mendelian rules of inheritance (genetic phasing). Highly accurate results are thus possible even with low read coverages (2X). Even if no reads are available in a region, a pure pedigree-based phasing can still be computed.

  1. 1.

    Prepare a PLINK-compatible .ped file (https://zzz.bwh.harvard.edu/plink/data.shtml#ped) that describes the family relationship between the samples. Each row describes one father/mother/child trio, so only one row is needed to phase a trio. An example trio.ped file:

    Family1 child father mother 0 1

    The six fields are: family ID, individual ID, father ID, mother ID, sex, phenotype, but WhatsHap ignores all except individual, father and mother ID. These IDs must match the sample names in the variant and alignment files.

  2. 2.

    The input variant file needs to contain the variants for all samples to be phased. If necessary, merge single-sample variant files into a multi-sample one:

    bcftools merge -o trio.bcf father.bcf mother.bcf child.bcf

  3. 6.

    Run pedigree-aware phasing (see Note 12) by providing the --ped option, ensuring you provide reads for all samples (see Note 6). This assumes a constant recombination rate across the chromosome (see Notes 13 and 14):

    whatshap phase --indels –reference=ref.fasta --ped=trio.ped -o phased.bcf variants.bcf mother.bam father.bam child.bam

    The haplotypes in the resulting variant file are reported as paternal|maternal in the GT tag. That is, the first allele is the one inherited from the father and the second one is the allele inherited from the mother (see Note 7).

  4. 3.

    Continue in Subheading 3.5 to compute statistics for the created variant file.

3.4 Polyploid Phasing

The “polyphase” subcommand is available for phasing samples with ploidy greater than two [6] (see Note 15).

  1. 1.

    Follow instructions for diploid phasing in Subheading 3.2, but leave out the actual phasing step. Ensure also the variant caller uses the correct ploidy. For example, for tetraploids, the GT field needs to be in the form 0/0/1/2 (four alleles).

  2. 2.

    Run the polyploid phasing command (see Notes 16 and 17), providing the correct ploidy. For example, for a tetraploid sample:

    whatshap polyphase --indels --ploidy=4 --reference=ref.fasta -o phased.bcf variants.bcf longreads.bam

  3. 3.

    Unlike the diploid phasing command, output genotypes can differ from input genotypes if the distribution of alleles observed in the reads deviates too much from what is reported in the input variant file.

3.5 Computing Statistics of Phased Variant Files

  1. 1.

    Run whatshap stats phased.bcf to print statistics about a phased variant file. Statistics will be broken down by chromosome. Aggregate statistics for all chromosomes are printed last.

  2. 2.

    Alternatively, for visualizing the haplotype block structure in a genome browser such as the Integrative Genomics Viewer (IGV, [7]), use this version of the command to also produce a Gene Transfer Format (GTF) file (see Note 18):

    whatshap stats --gtf=phased.gtf phased.bcf

  3. 3.

    Statistics are shown only for the first sample in the variant file. If necessary, re-run the above command and add --sample=samplename to show statistics for a different sample.

  4. 4.

    To additionally produce output in tab-separated value (TSV) format, add the --tsv option:

    whatshap stats [other options] --tsv=stats.tsv phased.bcf

    The stats.tsv file will be detected by recent MultiQC [8] versions as containing WhatsHap statistics and will thus be included in its report.

  5. 5.

    When interpreting the output, keep in mind that the term block is used as a synonym for phase set. Singletons are phase sets with only one member. Their variants are considered unphased. The sum of the phased, unphased, and singleton variant counts is equal to the number of heterozygous variants. Block statistics (see Note 19) are given in terms of the number of variants in a block and also in terms of the number of bases a block covers on the reference (from its first to last variant).

3.6 Assigning Reads to Haplotypes and Visualization in IGV

The WhatsHap haplotag subcommand takes an alignment file and tags the contained reads with the information which haplotype they belong to, based on a phased variant file. The alignments being haplotagged do not necessarily have to be the ones that were used to obtain the phasing. We call this procedure haplotagging and show how to use it for visualization of phasing results in IGV.

  1. 1.

    Run the command to add the HP (haplotype), PC (quality), and PS (phase set) tags where possible to the alignments in an alignments.bam file, writing the result to a new file (see Note 22):

    whatshap haplotag --reference=ref.fasta --output-haplotag-list=table.tsv -o tagged.bam phased.bcf alignments.bam

    Alignments that cannot be tagged will be written unchanged. The output file can therefore be used in downstream analyses instead of the original file without losing information.

  2. 2.

    Alternatively, process only a specific region by adding option --region (see Note 23):

    whatshap haplotag [other options] --region=chrom:start-end [...]

  3. 3.

    The --output-haplotag-list option causes haplotag to write information about how reads were tagged to a separate tab-separated value (TSV) file. The file contains columns read name, haplotype, phase set identifier, and contig name. The haplotype is H1, H2, or none in case the read could not be haplotagged. The file can be opened in a spreadsheet application or processed further on the command line. For example, use this command to see the total number of reads assigned to each haplotype:

    sed 1d table.tsv | cut -f 2 | sort | uniq –c (sed skips over the first line, cut extracts the second column, uniq counts the haplotags). This prints a summary like this:

    132708 H1

    133225 H2

    21285 none

  4. 4.

    Index the created alignment file: samtools index tagged.bam

  5. 5.

    Open the reference and tagged.bam in IGV.

  6. 6.

    Right-click on the alignment track, choose “Color alignments by,” then “tag” and enter “HP.” The reads are now colored (blue and red in case of diploid samples) depending on the haplotype.

  7. 7.

    Open the same dialog, but enter “PS.” The reads are now colored depending on the phase set they are in.

3.7 Comparing Phased Variant Files

The compare subcommand compares two or more phased variant files. The key metric is the number of switch errors. A switch error occurs when the haplotype of a block no longer matches the haplotype in the other file it is compared to, but jumps to the other haplotype from that position onwards. For example, if the variants in a block in the first file have an encoded genotype of 0|1, 0|1, 1|0 and 0|1, then the first haplotype can be written as h1=0010 and the second one as h2=1101. When comparing this to the haplotype g1=0101 in another file, one switch error is counted because the first position in g1 matches the first position in h1, but the remaining alleles are switched so that they match the second haplotype h2.

Two consecutive switch errors are called a flip error. For example, comparing haplotype 0110 to 0010 gives one flip error because there are switch errors between positions 1 and 2 and between 2 and 3. The switch/flip decomposition in the whatshap compare output shows the number of switch and flip errors with the switch errors adjusted so that those contributing to flip errors are excluded.

  1. 1.

    Compute switch errors and other metrics:

    whatshap compare --tsv-pairwise=eval.tsv truth.vcf phased.vcf

  2. 2.

    Inspect the output printed to the terminal or open the tab-separated value file eval.tsv in a spreadsheet program (see Notes 25 and 26).

4 Notes

  1. 1.

    Because the average distance between variants in the human genome is on the order of 1 kbp, using short paired-end reads instead of long reads will not produce good phasing results. The longer the reads, the better.

  2. 2.

    Correct metadata (in particular correct sample names) are required by many WhatsHap tools. Setting these already at the mapping stage (here with the -R option to bwa mem) is recommended. Ensure that the chosen sample name is unique for each sample.

  3. 3.

    Variant files and alignment files can contain multiple samples. For WhatsHap to be able to match these to each other, the sample names (metadata) in the variant file need to match the ones used in the alignment file. If unsure, run bcftools view -h in.vcf|grep '^#CHROM'|cut -f10- to see the sample names in the variant file, and samtools view -H in.bam|grep '^@RG' to see the read group (RG) headers in the alignment file. The text after “SM:” is the sample name.

  4. 4.

    A command-line option --no-reference is available to avoid having to provide the reference FASTA, but its use should generally be avoided as it makes results worse and cannot be combined with indel phasing.

  5. 5.

    WhatsHap phases all samples found in the variant file. Specify --sample=name to phase only one sample.

  6. 6.

    Input reads can be distributed over multiple alignment files as long as the read group (RG) headers set correct sample names. Use this to provide reads for multiple samples in separate files:

    whatshap phase [other options] variants .bcf sample1.bam sample2.bam sample3.bam

    Another use is to provide multiple files (such as from different sequencing technologies) for a single sample:

    whatshap phase [other options] variants .bcf pacbio .bam nanopore.bam mate-pairs.bam

    If a single pre-merged alignment file is available, it can be used instead.

  7. 7.

    In an unphased, diploid variant file, the genotype of a variant is stored as “0/1” in the GT tag of a sample. 0 encodes the reference (ref) allele and 1 encodes the first alternative (alt) allele. 0/1 means that the ref. and first alt allele were observed, but it is unknown on which haplotype. For phased diploid variants, the GT tag is “0|1” if the reference allele is on the first haplotype or “1|0” if it is on the second. The PS (phase set) tag is added to each phased variant. All variants with the same PS value belong to a set of variants (also called phase block) that are phased relative to each other. Within a phase set, which haplotype is designated the first or second is usually arbitrary: Swapping all 0|1 with 1|0 would represent the same phasing, except in pedigree phasing mode.

  8. 8.

    Multi-allelic variants in the input variant file (any variants with a comma in the ALT column, such as REF A and ALT C,G) are not supported and will currently be ignored. Use bcftools view -m 2 -M 2 -g het -v snps -s samplename input.bcf to show SNVs usable by whatshap phase.

  9. 9.

    To phase only single-nucleotide variants (SNVs), omit the --indels argument to whatshap phase.

  10. 10.

    Metadata in variant files can be corrected with bcftools reheader. Metadata in alignment files can be corrected with samtools addreplacerg. If the input alignment file contains only one sample, it is easier to provide WhatsHap with the option --ignore-read-groups instead of correcting metadata. If the variant file contains multiple samples, also --sample=name is required to specify which of the samples to work on.

  11. 11.

    Only a subset of reads is used when phasing: High coverage is automatically reduced to a subset of reads that are most promising to contribute to good phasing results. Reads not covering at least two heterozygous variants and low-quality reads are ignored.

  12. 12.

    Pedigree phasing is intended for pedigrees with up to five trio relationships (equivalent to the number of children) in the pedigree. For larger pedigrees, runtimes would be excessive. Instead, genetic haplotyping methods such as MERLIN [9] can be used and will yield highly accurate results.

  13. 13.

    Phasing in pedigree mode requires costs for recombination events. Per default, a constant recombination rate across the chromosome is assumed. The recombination rate (in cM/Mb) can be changed with option --recombrate. The default value is 1.26 cM/Mb and is suitable for human genomes.

  14. 14.

    An alternative way to specify recombination rates is to provide region-specific rates through a genetic map file via option --genmap, formatted like this:

    position COMBINED_rate(cM/Mb) Genetic_Map(cM)

    568527 0 0

    721290 2.689 0.4103

    723819 2.822 0.4174

    The three columns give the position in bp, the local recombination rate between the given position and the position given in the previous row (in cM/Mb), and the cumulative genetic distance from the start of the chromosome (in cM). See the SHAPEIT [10] documentation (https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#gmap), from where this example was taken. As a genetic map file is chromosome-specific, --genmap has to be combined with --chromosome, which will phase only the specified chromosome.

  15. 15.

    Polyphase works with --ploidy=2, but the regular algorithm (the “phase” subcommand) should be preferred in this case.

  16. 16.

    Ploidy greater than six or read coverages exceeding 120X will lead to impractically long runtimes.

  17. 17.

    “Polyphase” supports many of the options that “phase” supports, including those for dealing with incorrect or incomplete metadata. However, polyploid phasing cannot be combined with pedigree phasing.

  18. 18.

    In the GTF, haplotype blocks are represented as “genes,” and interleaved blocks will appear as multiple “exons” (connected by thin horizontal lines in IGV).

  19. 19.

    The “Block NG50” statistic can only be computed if chromosome lengths are available, either from the variant file header or from a separate file specified via option --chr-lengths. Under some circumstances, for example, if mate-pair reads are used for phasing, phased blocks can be interleaved. These are cut before computing the NG50 in order to avoid artificially inflating it.

  20. 20.

    Chimeric reads, i.e., reads that do not align in one contiguous piece to the reference, may be represented by multiple records in the alignment file, only one of which is the main one and tagged by haplotag. Add the option --tag-supplementary to the haplotag subcommand to also tag the other (supplementary) read alignments. They will be assigned the same haplotype as the primary read alignment.

  21. 21.

    If the reads were generated using the “linked-read” technology by 10X Genomics and the alignment file contains the respective barcodes in the BX tags, WhatsHap haplotag will automatically detect and use that information. For reads that cannot be haplotagged on their own, the information that reads that are close and have identical BX tags likely come from the same haplotype will be leveraged to infer the correct haplotag. BX tags can be ignored with --ignore-linked-reads.

  22. 22.

    The distance cutoff used to merge reads with identical BX tags into the same “read cloud” can be adjusted with --linked-read-distance-cutoff=N. The default is N=50,000. Reads further apart than this will be considered to be in different read clouds even if they have the same BX tag value.

  23. 23.

    The haplotag command also needs correct metadata (see Notes 2, 3). As some other subcommands, it supports the options --ignore-read-groups and --sample=samplename (see Note 10) to deal with incorrect or missing sample names.

  24. 24.

    To tag reads on a single chromosome, use --region=chrom (leaving out start and end coordinates). Limiting haplotagging to a specific region can also be realized by using a variant file that only contains the region(s) of interest, and setting the option --skip-missing-contigs to avoid that WhatsHap raises an error about missing contigs/chromosomes in the VCF.

  25. 25.

    Comparison is restricted to variants that occur in both inputs and that are heterozygous in both.

  26. 26.

    Blocks of both files are split up such that each block covers the same set of variants in both files. Further statistics are computed for these intersection blocks. Blocks with a single variant (singletons) are counted but excluded from other computations.

  27. 27.

    When comparing blocks, complemented haplotypes (swapping all 0 s with 1 s and vice versa) are allowed. That is, haplotypes 0110 and 1001 are considered to be identical.