Erratum

During the type-setting of the final version of the article [1] some of the additional files were swapped (with the legends remaining correct):

  • Additional File 1 was published under Additional File 3 hyperlink;

  • Additional File 2 was published under Additional File 1 hyperlink;

  • Additional File 3 was published under Additional File 2 hyperlink.

  • Additional Files 4–9 are all correct. The editors apologize for the clerical mistake that led to the mismatch of the additional files.

Additionally, the authors, based on the input from the readers, wish to expand some of the descriptions of the experimental procedures in the Additional File 1. In the section describing the mutation status analysis the authors added the following:

“For the alignment, pairs of Fastq files (i.e. R1 & R2) sequenced from the same sample were aligned separately using bwa aln & bwa sampe (default parameters) to the hg19 (GRCh37) reference. Each pair of Fastq files generates a single BAM file. Individual BAM files from the same sample were merged to generate a single BAM file representing all reads from the sequencing run. Using the GATK routine CountCovariates, the merged BAM file was subsequently analyzed to generate the covariates necessary to perform base quality recalibration. Briefly, it searches for mismatching bases in reads that do not overlap known heterozygous sites (1,000 genomes + dbSNP) and collects information on the mismatching base’s quality and a series of other covariates (e.g. base quality, read group, neighboring bases, sequencing cycle). Using the GATK routine TableRecalibration, the recalibration metrics obtained from CountCovariates were used to recalibrate all base qualities from the BAM file. This step is necessary as the base qualities generated by the sequencer often inaccurately reflect the true frequency of mismatching bases. The BAM files with base quality recalibration are the files used in all post-processing steps.

For mutation calling, allele counts and their associated base qualities were collected for each individual cell line. Only alleles fulfilling the following criteria were used in subsequent steps: base quality (BQ) > = 10; neighborhood base quality (NBQ) > = 10; mapping quality of associated read (MQ) > = 20; and its associated read is not a duplicate. Any base quality exceeding the read’s mapping quality is reduced to the read’s mapping quality. Positions with less than 2 reads supporting any non-reference allele were deemed homozygous reference and excluded from further analyses. The likelihoods of all possible genotypes (AA, AT, AC, etc.) given the allelic data collected for the cell line were computed using the MAQ error model originally defined in (11) and now available in the samtools source code. The genotype likelihoods were then used in a Bayesian model incorporating a prior probability on the reference, and the heterozygous rate of the human genome. The genotype with the highest likelihood given the data was chosen as most likely. No further analysis was performed at this position for a homozygous reference genotype. Otherwise, the following metrics were computed at the variant position and used for post-processing filtering of all putative variants: DP: Total read depth, AD: Depth or coverage for all alleles, including alleles not in genotype; BQ: Average base quality of each allele; MQ: Average mapping quality of reads supporting each allele; MQ0: Number of mapping quality zero reads overlapping position; MQL: Number of ‘low’ mapping quality reads overlapping position; NAHP: Average number of adjacent homopolymer runs on either side of each allele in genotype; MAHP: Longest adjacent homopolymer run on either side of each allele in genotype; AMM: Average number of mismatches in reads supporting each allele; MMQS: Average sum of the base qualities for all mismatching bases; DETP: Average effective distance to 3’ end of read for each allele, normalized by read length; LD/MD/RD: Number of reads supporting each allele where the allele is located in the left-most third of read, middle-third of read, or right-most third of read, respectively; LDS/MDS/RDS: Strand-aware version of above; SB: Number of reads supporting each allele aligned to the forward strand; and PN/NN: Previous and next nucleotides in reference.

Since no normal control is available for our cell lines, all variants were considered germline and the genotype’s log-likelihood was used to compute a Phred-scaled quality/confidence of the germline variant. All putative variants and associated metrics were converted to the VCF format, with the following filters applied to each variant: conf: Genotype quality > = 100; dp: Total depth > = 8; mdp: Maximum depth < 800; mq0: MQ0 < 5; mql: MQL < 5; sb: Mutant allele strand bias p-value > 0.005 (Binomial test); mmqs: MMQS < = 20; amm: AMM < = 1.5; detp: 0.2 < = DETP < = 0.8; ad: AD of mutant allele > = 4; and ma: More than two alleles have read support > = 2. Variants that pass all filters were marked PASS in the FILTER column of their VCF record. Otherwise, the names of each filter that the variant does not meet were recorded in the FILTER column.

Read coverage was calculated using a dynamic windowing approach that expands and contracts the window’s genomic width according to the local read density in the sample’s sequence. When the window’s read count exceeds a user-defined threshold, the window’s size and location, the raw read count, N, and the average coverage of the window, N / window size, were recorded.”

The correct Additional files 1, 2 and 3, which include the expanded description of the methods, are published below.