Detecting clinically actionable variants in the 3′ exons of PMS2 via a reflex workflow based on equivalent hybrid capture of the gene and its pseudogene
- 1k Downloads
Hereditary cancer screening (HCS) for germline variants in the 3′ exons of PMS2, a mismatch repair gene implicated in Lynch syndrome, is technically challenging due to homology with its pseudogene PMS2CL. Sequences of PMS2 and PMS2CL are so similar that next-generation sequencing (NGS) of short fragments—common practice in multigene HCS panels—may identify the presence of a variant but fail to disambiguate whether its origin is the gene or the pseudogene. Molecular approaches utilizing longer DNA fragments, such as long-range PCR (LR-PCR), can definitively localize variants in PMS2, yet applying such testing to all samples can have logistical and economic drawbacks.
To address these drawbacks, we propose and characterize a reflex workflow for variant discovery in the 3′ exons of PMS2. We cataloged the natural variation in PMS2 and PMS2CL in 707 samples and designed hybrid-capture probes to enrich the gene and pseudogene with equal efficiency. For PMS2 exon 11, NGS reads were aligned, filtered using gene-specific variants, and subject to standard diploid variant calling. For PMS2 exons 12–15, the NGS reads were permissively aligned to PMS2, and variant calling was performed with the expectation of observing four alleles (i.e., tetraploid calling). In this reflex workflow, short-read NGS identifies potentially reportable variants that are then subject to disambiguation via LR-PCR-based testing.
Applying short-read NGS screening to 299 HCS samples and cell lines demonstrated >99% analytical sensitivity and >99% analytical specificity for single-nucleotide variants (SNVs) and short insertions and deletions (indels), as well as >96% analytical sensitivity and >99% analytical specificity for copy-number variants. Importantly, 92% of samples had resolved genotypes from short-read NGS alone, with the remaining 8% requiring LR-PCR reflex.
Our reflex workflow mitigates the challenges of screening in PMS2 and serves as a guide for clinical laboratories performing multigene HCS. To facilitate future exploration and testing of PMS2 variants, we share the raw and processed LR-PCR data from commercially available cell lines, as well as variant frequencies from a diverse patient cohort.
KeywordsPMS2 PMS2CL Lynch syndrome Hereditary cancer screening Reflex testing
Genome in a Bottle
Hereditary cancer screening
Short insertions and deletions
Multiplex ligation-dependent probe amplification
New England Biolabs
Reverse transcription PCR
Turn around time
Variant of uncertain significance
Individual genomic variants inherited through the germline account for approximately 5% to 10% percent of cancer [1, 2, 3]. This heritable component can increase risk for malignancies across a range of tissues [4, 5]—such as breast, colorectal, pancreatic, and prostate—and is associated with pathogenic variants in >100 genes . To assess patients’ risk for such cancers, hereditary cancer screening (HCS) typically uses targeted next-generation sequencing (NGS) to detect relevant variants in the coding regions and select noncoding regions on a multigene testing panel.
In most genomic regions interrogated by HCS panels, NGS alone is sufficient to yield high sensitivity and specificity [7, 8]; high accuracy is critical for HCS because test results prompt patients to alter their clinical-management decisions [9, 10]. In a minority of regions, however, standard NGS strategies that use hybridization to capture and sequence short DNA fragments could incorrectly identify genotypes. Genes that pose particular challenges often have homologous sequences (e.g., pseudogenes) elsewhere in the genome that are captured and sequenced along with the gene itself, complicating alignment and the identification of variants specific to the gene.
Multiple testing strategies exist that can achieve high sensitivity and specificity in the last five exons of PMS2 [18, 19, 20, 22, 25, 26]. Performing LR-PCR, MLPA, and hybrid-capture NGS on each screened sample was presented previously on a small cohort , but applying this combination to a large patient population would be resource intensive and complicate workflow logistics. Herman et al. recently presented a method for identifying CNVs (but neither SNVs nor indels) in the terminal exons of PMS2 or PMS2CL . The method identified samples for follow-up LR-PCR testing to definitively localize the CNV to the gene or pseudogene. The authors noted a CNV false positive rate of 6.8%, meaning that a significant portion of CNV-negative samples would unnecessarily undergo follow-up testing.
Here we present a reflex strategy for detection of SNVs, indels, and CNVs in the last five exons of PMS2. Our aim was to have the workflow’s initial testing phase (i.e., upstream of reflex) be sensitive enough to maximize detection of PMS2 variants and sufficiently specific to minimize reflex burden. The proposed workflow applies hybrid-capture NGS to all samples and LR-PCR/MLPA only as a reflex assay. As the validity of LR-PCR in the last five exons of PMS2 is established [20, 21], we sought primarily to evaluate the performance of the hybrid-capture NGS assay via comparison to LR-PCR results from 299 clinical and cell line samples. We found that the workflow has high analytical accuracy while requiring reflex testing for only 8% of samples. Because our development of this workflow (schematized in Additional file 1: Figure S1) required collection of sequencing data and calculation of variant frequencies from a complicated genomic region with important impact on human health, we have made this information publicly available.
This study was reviewed and designated as exempt by Western Institutional Review Board and complied with the Health Insurance Portability and Accountability Act (HIPAA).
Additional file 2: Table S1 indicates which sample sets were used for particular assays and analyses. Cell-line DNA was purchased from Coriell Cell Repositories (Camden, NJ) (Additional file 2: Table S2). Patient sample DNA was extracted from de-identified blood or saliva samples from patients who underwent Counsyl HCS testing. DNA samples with known positives were a gift from the Invitae Corporation.
DNA was extracted and underwent an additional cleanup via incubation with 1× SPRI beads followed by 80% ethanol wash and elution into TE (10 mM Tris-HCl, 1 mM EDTA, pH 8.0). Approximately 300 ng of eluted DNA served as the template in separate gene- and pseudogene-specific LR-PCR reactions with the following final concentrations: 1× LongAmp Taq Reaction Buffer (New England Biolabs, NEB), 0.3 mM dNTPs, 1 μM of a gene- or pseudogene-specific forward primer, 1 μM of common reverse primer LRPCR_Unv_R (all primer sequences in Additional file 2: Table S3), 0.25% Formamide, and 5 units LongAmp Hot Start Taq DNA Polymerase (NEB). Reactions including the gene-specific forward primer PMS2_LRPCR_F yielded a ~ 17 kb amplicon spanning PMS2 exons 11–15 (the forward primer targets exon 10), whereas use of the pseudogene-specific forward primer PMS2CL_F amplified ~ 18 kb from PMS2CL (spans region upstream of PMS2CL through exon 6). Thermal-cycling involved initial denaturation at 94 °C for 5 min followed by 30 cycles of 94 °C for 30 s and 65 °C for 18.5 min. Final elongation was 18.5 min at 65 °C, followed by a 4 °C hold. Quality of LR-PCR amplicons was assessed using 0.5% agarose gel electrophoresis and quantification with the broad range Qubit assay kit (Thermo Fisher).
Two different library-prep strategies were used to prepare LR-PCR amplicons for NGS. In the first, applied to patient samples, LR-PCR amplicons were fragmented by adding 2 μL NEBNext dsDNA Fragmentase and NEBNext dsDNA Fragmentase Reaction Buffer v2 (1× final, NEB) to the remaining LR-PCR reaction volume, and then incubated at 37 °C for 25 min. Addition of 100 mM EDTA stopped the reaction, which underwent cleanup with 1.5× SPRI beads, followed by 80% ethanol wash and elution in TE. Fragmentation quality was assessed via Bioanalyzer (Agilent) with the High Sensitivity DNA kit. NGS library prep included end repair, A-tailing, and adapter ligation. Samples were PCR amplified with KAPA HiFi HotStart PCR Kit (Kapa Biosystems) for 8–10 cycles with barcoded primers with the following thermal cycling: initial denaturation at 95 °C for 5 min followed by cycles of 98 °C for 20 s, 60 °C for 30 s, and 72 °C for 30 s. The last elongation was 5 min at 72 °C, followed by 4 °C hold. Library quality was verified via Bioanalyzer with a High Sensitivity DNA kit and the concentration was measured with absorbance via a microplate reader (Tecan Infinite M200 PRO).
The second approach to prepare LR-PCR amplicons for NGS—applied to the 155 cell-line samples—entailed fragmenting and inserting adapters into LR-PCR amplicons via tagmentation. Two duplex adapters were created by annealing single-stranded oligonucleotides: one duplex adapter had the Unv_Tn5_oligo (all primer sequences in Additional file 2: Table S3) annealed to Oligo A; the other duplex adapter had the Unv_Tn5_oligo annealed to Oligo B. The two separate annealing mixes included 25 μM of each oligonucleotide in the duplex plus 1× annealing buffer (10 mM Tris-HCl, 50 mM NaCl, 1 mM EDTA, pH 8.0). The reaction was denatured at 95 °C for 2 min, incubated at 80 °C for 60 min, stepped down in temperature by 1 °C every minute until reaching 20 °C, and then held at 4 °C. Adapters were loaded into the Tn5 enzyme during a 30 min incubation at 37 °C with 0.15 units of Robust Tn5 Transposase (kit from Creative Biogene), 1.25 μM of each adapter, and 1× TPS buffer. LR-PCR amplicons were subjected to tagmentation with the Tn5-adapter construct. Tagmentation reactions occurred at 56 °C for 10 min in 1× LM Buffer, with 0.5 μL of loaded Tn5 and 1–2 ng of DNA from each LR-PCR reaction. After incubating, SDS (0.02% final) was added to each reaction and incubated for 5 min to dissociate Tn5 from the DNA. Tagmention cleanup with 1× SPRI beads preceded molecular barcoding and amplification via PCR to generate NGS libraries. The PCR reaction included 1 unit Kapa HiFi Polymerase (Kapa Biosystems), 1× HiFi Buffer, 375 μM dNTPs, 0.5 μM of each primer, and the cleaned-up tagmented sample. Cycling started with gap-filling at 72 °C for 3 min and followed with 10 cycles of denaturation at 98 °C for 30 s, annealing at 63 °C for 30 s, and extension at 72 °C for 3 min. Cleanup of NGS libraries was performed with 1× SPRI beads.
For patient samples, LR-PCR libraries were sequenced on a HiSeq 2500 (Illumina) in rapid run mode (paired reads, 150 cycles each). For cell line samples, LR-PCR libraries were sequenced on a NextSeq 550 (Illumina) to a minimum depth of 500 reads (single read, 150 cycles).
Hybrid capture and sequencing
Targeted NGS was performed as described previously [7, 8]. Briefly, DNA from a patient’s blood or saliva sample was isolated, quantified by a dye-based fluorescence assay, and then fragmented to 200–1000 bp by sonication. Fragmented DNA was converted to an NGS library by end repair, A-tailing, and adapter ligation. Samples were then amplified by PCR with barcoded primers, multiplexed, and subjected to hybrid capture-based enrichment with 40-mer oligonucleotides (Integrated DNA Technologies) complementary to regions common between PMS2 and PMS2CL. NGS was performed on a HiSeq 2500 with mean sequencing depth of ~ 500× for the whole panel (coverage in PMS2 is ~ 1000×). All target nucleotides are required to be covered with a minimum depth of 20 reads.
For hybrid-capture data, in order to aggregate PMS2- and PMS2CL-originating reads at the PMS2 locus in the reference genome, paired-end NGS reads were first aligned to the hg19 human reference genome using BWA-MEM (version 0.17) . The alignment at PMS2 exon 11 was filtered to only include reads that overlapped with a site of known difference between gene and pseudogene. Reads that aligned to PMS2 exons 12–15 and reads that aligned to PMS2CL exons 3–6 were partitioned into a BAM file using samtools . The BAM file was converted into two unaligned FASTQ files (each member of the read pair parsed to one of the two files) using Picard (Broad Institute). Each single-end FASTQ file was separately realigned to the hg19 genome using BWA-MEM allowing for ambiguous alignments. In order to ensure each read aligns to both PMS2 and PMS2CL, the BWA-MEM parameters were set to the following values for 115 bp reads: “-a” to emit all alignments; seed length of 11; gap open penalty of 2; mismatch penalty of 1; and an alignment score threshold of 20. The resulting single-end alignments were used to generate a paired-end alignment in the following manner: 1) both single-end reads had the same read name; 2) both single-end reads mapped to the region spanning PMS2 exons 12–15; 3) both single-end reads aligned within 1000 bp of each other, and 4) when multiple putative pairs met the above conditions for a given read name, the pair with the highest alignment score was chosen. Reads that could not form proper pairs as described above were discarded. The resulting paired-end BAM file contained reads originating from both PMS2 and PMS2CL mapped to the PMS2 sequence.
For RT-PCR data (described below) and LR-PCR data, NGS reads were aligned to the hg19 genome sequence in which the PMS2CL sequence was removed, thereby aggregating genic and pseudogenic reads in PMS2.
SNV and Indel calling
For the PMS2 region into which reads from PMS2 and PMS2CL were mapped (see above), SNVs and short indels were identified using GATK 4.0 HaplotypeCaller  with the sample-ploidy option set to four, the max-reads-per-alignment-start option off, and the min-pruning option set to one. For the diploid PMS2 exon 11 region, SNVs and short indels were identified using GATK 1.6  and FreeBayes . For diploid SNV calling in the LR-PCR data, GATK 1.6 was similarly used. For the LR-PCR sample in which we suspected allelic dropout (see Discussion), AB was determined by visual inspection of the NGS data in the Integrative Genomics Viewer .
For short-read NGS data of hybrid-captured fragments, CNVs in PMS2 exon 11 were determined by measuring the relative NGS read depth at target positions using the algorithm described previously . To call CNVs in PMS2 exons 12–15 from BAM files in which PMS2- and PMS2CL-originating reads were positioned in the PMS2 sequence (see “Read Alignment” above), two modifications to the CNV calling algorithm were made: 1) the expected wild type copy number was changed from two to four copies, and 2) pCNV, the parameter determining how likely the HMM is to transition from a wild-type to a CNV state, was set to 0.01 to obtain high CNV sensitivity and specificity from empirical data.
For CNV calling from LR-PCR data, read depth was counted in equal-sized bins (50 bp) that tile the amplicon. Bin counts for each sample were normalized by the median bin depth of the sample; next, each bin’s values were normalized by the median of the bin. The same bins were used for corresponding regions of PMS2 and PMS2CL. The resulting binned and normalized data were searched for CNVs using the algorithm described previously . CNV no-calls were manually reviewed to resolve status as positive or negative.
Single-copy duplications and deletions were introduced by modifying the number of observed reads in one of the CNV-negative samples in a given batch of samples, as described previously . For PMS2 exons 12–15, where baseline copy-number was four, single-copy deletions and duplications were introduced by subsampling reads to 75% or scaling read number by 125%, respectively. Simulated CNVs were created for every possible contiguous combination of exons in the last 4 exons in PMS2. For each CNV size and position, 2186 samples were simulated and tested via the CNV calling algorithm, and sensitivity was calculated as the percentage of the synthetic CNVs that were correctly detected. CNVs were simulated separately in PMS2 exon 11, which had a baseline copy number of two, because pseudogenic reads were filtered from the genic sequence.
Tetraploid Indel simulations
Indels in a tetraploid background (relevant for exons 12–15 of PMS2, where gene- and pseudogene-originating reads were remapped) were simulated to better test indel-calling sensitivity using GATK4. Two diploid alignments, at least one of which was previously determined via HCS testingto contain an indel, were merged to create a tetraploid alignment. If one of the samples had more reads than the other in the 100 bp region centered on the indel, reads were binomially downsampled such that each merged diploid sample had approximately the same number of aligned reads. Indels were then called from these synthetic tetraploid alignments using GATK4 as described in section SNV and Indel Calling above.
For all variants in the last five exons of PMS2, variant interpretation was performed in accordance with American College of Medical Genetics and Genomics (ACMG) criteria using a 5 tier classification category system (Benign, Likely Benign, Variant of Uncertain Significance, Likely Pathogenic, Pathogenic) . Classifications were made using evidence available in the published literature and publicly available databases. Allele-frequency based rules were not used because of potentially inaccurate PMS2 variant identification in population databases. Variant classifications were reviewed and approved by board-certified laboratory directors.
MLPA was performed according to manufacturer’s protocol (MRC Holland, probemix P008-C1 PMS2 protocol issued 12/11/17 and MLPA General Protocol issued on 3/23/18). Generally, genomic DNA was covered with mineral oil to reduce evaporation during hybridization and ligation; next, DNA was denatured for 5 min at 98 °C and then held at 25 °C. Hybridization reagents and probemix were added to the samples and incubated at 95 °C for 1 min followed by 16–20 h at 60 °C. Probe pairs that bind target DNA at adjacent positions were ligated for 15 min at 54 °C and then amplified via PCR for 35 cycles. Amplified probes were mixed with ROX ladder and formamide and then separated on a capillary electrophoresis instrument. Coffalyser software (MRC Holland) normalized PMS2 probe intensities to those of the reference probes first within each sample and then among samples. Normalized probe intensities of each sample were compared to the average intensities of the reference samples; Coffalyser emitted CNV calls in the region.
Reflex rate estimate
The reflex rate was estimated using SNV-, indel-, and CNV-specific reflex rates from the LR-PCR and hybrid-capture data and subsequently extrapolating to a large cohort size using Markov Chain Monte Carlo simulations with pymc .
Distinguishing Base analysis
NGS reads from LR-PCR amplicons from PMS2 and PMS2CL were aligned to PMS2, and variants were called with GATK UniversalGenotyper as described in section SNV and Indel Calling above. Sites were considered reliable if variants were homozygous for the reference allele in the PMS2-specific amplicon and homozygous for an alternate allele in the PMS2CL-specific amplicon (as aligned to PMS2) in 100% of samples.
RNA extraction and reverse transcription
RNA was extracted from 33 samples with the Agencourt RNAdvance Blood kit (Beckman Coulter) from 400 μL of whole blood following the manufacturer’s instructions. RNA was extracted from blood tubes no more than seven days after blood draw was performed. Extraction quality was assessed with the RNA 6000 Nano kit (Agilent). RNA was quantified with Qubit HS RNA Assay kit (Thermo Fisher).
RNA was reverse transcribed using Superscript II Reverse Transcriptase with oligo-dT and random hexamers as primers (kit from Thermo Fisher). Reactions were performed as follows: 0.1–1.0 μg total RNA, 1.25 μM of both random hexamers and oligo-dT primer, 0.8 mM dNTPs, and water up to a final volume of 12 μL. Reactions were heated at 65 °C for 5 min and then chilled on ice for 5 min. 1× first-strand buffer and 0.01 M DTT were added to each reaction and incubated at 42 °C for 2 min. 10 U/μL Superscript II Reverse Transcriptase was added to each reaction and incubated at 42 °C for 50 min, then heat inactivated at 72 °C for 15 min. A positive control of pooled mRNA (Stratagene, Catalog #750500–41) was used with each reverse transcription reaction.
Following reverse transcription, RNA was hydrolyzed with 2 μL 1 N NaOH and heated at 95 °C for 5 min. 4 μL of 1 M Tris-HCL pH 7.5 was used to neutralize the reaction for downstream processing. Qubit ssDNA Assay kit (Thermo Fisher) was used to quantify cDNA.
For each sample, two reactions were set up: 1) forward primer PMS2_RNA_F and reverse primer RNA_Unv_R amplified 1.5 kb of PMS2 from cDNA and 2) forward primer PMS2CL_F and reverse primer RNA_Unv_R amplified 1.5 kb of PMS2CL from cDNA (primer sequences in Additional file 2: Table S3). PCR reactions contained 1× LongAmp Taq Reaction Buffer (NEB), 0.3 mM dNTPs, 1 μM of each forward and reverse primer, 20–70 ng cDNA, 0.1 U/μL LongAmp Taq DNA polymerase (NEB), and water up to 25 μL. Thermocycling was as follows: 94 °C for 5 min, 30 cycles of 94 °C for 30 s, annealing at 52 °C for PMS2 and 55 °C for PMS2CL, 65 °C for 2 min, followed by a final extension at 65 °C for 10 min and then a 4 °C hold. PCR products were cleaned with 1.2× SPRI beads. Amplicons were visualized with a 2% agarose gel or with the DNA 7500 kit (Agilent).
50–100 ng of each amplicon were fragmented in 50 μL volumes with a Bioruptor (Diagenode) for 12 cycles, 30 s on and 90 s off. Fragmentation was visualized with High Sensitivity DNA kit (Agilent). All fragmented material was used as input for library preparation. KAPA Hyper Prep kit (Kapa Biosystems) was used for library preparation, and manufacturer instructions were followed. Adapters were diluted to 15 μM for PMS2 and 3 μM for PMS2CL. Nine cycles of enrichment PCR were performed. Samples were quantified using absorbance measurements (Tecan M200), normalized to 10 nM, and consolidated into one reaction. The final library was quantified with qPCR using KAPA Library Quantification Kit (Kapa Biosystems) and sequenced on the NextSeq 550 System (Illumina) for 75 cycles single read with dual indexing.
Basecall files were converted to FASTQ files using bcl2fastq (Illumina). FASTQ files were aligned using STAR .
Metrics were defined as follows: Sensitivity = TP/(TP + FN); Specificity = TN/(TN + FP). The CIs were calculated by the method of Clopper and Pearson . For SNVs and indels, true negatives were defined as concordant negative results observed at sites found to be polymorphic in our cohort (positions at which we observed non-reference bases in at least one sample).
Zero nucleotides can reliably distinguish exons 12–15 of PMS2 from PMS2CL
NGS of short DNA fragments would only be able to identify PMS2-specific variants in the last five exons if the fragments themselves could be unambiguously aligned to the gene or pseudogene. To overcome pseudogene interference, unique mapping would rely on the bases that differ between PMS2 and PMS2CL. In the hg19 reference genome, these distinguishing bases are scarce (Fig. 1d, light bars): sequence identity in each of the last five exons of PMS2 (padded with 20 nt of intronic sequence) exceeds 97%, and the differences comprise only 26, 0, 1, 1, and 0 bases in exons 11 through 15, respectively. Further, previous reports noted that natural variation may suppress the reliability of these distinguishing bases represented in the reference genome [17,18].
Summary of samples
LR-PCR + NGS
718 patient samples
155 cell-line samples
Hybrid capture + NGS
144 patient samples
155 cell-line samples
3 known positives
4 patient samples
4 cell-line samples
RT-PCR + NGS
Reflex workflow to disambiguate variants discovered with short-read NGS
The workflow employs different bioinformatics strategies for PMS2 exon 11 and for the group of exons 12–15 (Fig. 2b, blue box). For exon 11, we identified PMS2-specific variants by tailoring the read-alignment software to partition reads to PMS2 or PMS2CL based on the gene- and pseudogene-distinguishing bases. By contrast, for PMS2 exons 12–15, reads are aligned with permissive settings such that each read will align to both its best genic location and its best pseudogenic location (see Methods). For the typical sample with two copies each of PMS2 and PMS2CL, this approach effectively provides read depth in each location corresponding to four copies. To identify SNVs, indels, and CNVs, we adjust the variant calling software such that it anticipates a baseline ploidy of two in exon 11 and four in exons 12–15 (Fig. 2b, blue and green boxes).
Disambiguation via reflex testing is only required for a subset of variants based on their type and clinical interpretation (Fig. 2b, orange box). As such, variant interpretation is performed prior to reflex testing. Benign variants are not reflex tested nor reported to patients. Samples with CNVs in any of the last five exons of PMS2 that are classified as pathogenic, likely pathogenic, or variants of uncertain significance (VUS) undergo reflex testing for disambiguation. Samples with non-benign SNVs or indels in exons 12–15 are reflex tested for disambiguation, but samples with such variants in exon 11 are simply reported without reflex due to unique read mapping in that exon. Disambiguation testing for SNVs, indels, and CNVs can be performed via LR-PCR followed by sequencing to determine if the variant came from PMS2 or PMS2CL; MLPA can assist resolution of CNVs .
Executing the proposed workflow resolves cancer risk associated with the last five exons of PMS2 for the majority of samples with short-read NGS alone. For each of the 707 patient samples that underwent LR-PCR (Table 1), we performed variant classification on the results and found that nearly 93% could forgo reflex testing. The remaining ~ 7% would have required subsequent testing to yield confident PMS2 screening results (Fig. 2a). The SNV- and indel-specific component of this reflex rate was 41/707 (5.8%), and the reflex rates due to CNV calls and no-calls were 2/707 (0.3%) and 1/144 (0.7%), respectively. Using simulations (see Methods), we estimated the reflex rate on a larger cohort of 13,000 patients to be 7.7% (95% CI: 5.4–10.7%). We expect the 0.7% contribution to the reflex rate from samples with CNV no-calls to be an upper-bound estimate because our standard practice of retesting such samples at least once on short-read NGS typically yields a confident negative call (data not shown), thereby avoiding reflex testing. Therefore, we anticipate the overall reflex rate of the proposed workflow would be less than 8%.
Short-read NGS accurately identified samples needing reflex testing for SNVs and indels
In sum, our comparison of SNV and indel calls between LR-PCR and short-read NGS suggests the pre-reflex step of our proposed workflow achieves sufficient analytical sensitivity and specificity to be considered for clinical use.
Accurate detection with short-read NGS of samples needing CNV reflex testing
CNV simulations demonstrate high analytical sensitivity
High sensitivity for CNVs must not come at the expense of low specificity, prompting us to measure the CNV false-positive rate in our large cohort. In our hybrid capture cohort of 302 samples, there was one no-call, which we treat as a false positive. Therefore, sample-level specificity is 99.7% (95% CI: 98.2–100%).
Based on these analyses, we conclude that short-read NGS—as optimized in our described workflow—can achieve >96% sensitivity and >99% specificity for detecting samples with CNVs in the terminal five exons of PMS2.
Gene- and pseudogene-specific variant information for common cell lines
Reference cell lines with known genotypes facilitate development and validation of novel molecular diagnostic methods, yet samples with high-quality genotypes in the PMS2 region are generally unavailable due to the region’s complicated nature. In the course of developing and testing the workflow characterized above, we performed NGS of both hybrid-capture fragments and LR-PCR amplicons on cell lines where high-quality genome assemblies were publicly available from whole-genome sequencing with ~ 30× depth (Illumina Polaris 1 diversity panel) or from the genome in a bottle (GIAB) consortium [38, 39]. Importantly, Additional file 4: Figure S2 shows that the gene-specific genotypes we observed differed from the Polaris and GIAB data (including phased data on GIAB samples; Additional file 4: Figure S2C). In principle, such differences could arise due in part to errors in either dataset, for example through biological contamination, non-specific amplification, non-specific sequence alignment, or technical processing errors by the chosen genotyping software. However, the concordance between orthogonal hybrid-capture and LR-PCR assays suggests that the genotypes we report here are correct. Further, as a third orthogonal method, we also genotyped PMS2 and PMS2CL from RNA extracted from 33 of the LR-PCR samples (see methods). The RNA-derived genotypes were concordant with the LR-PCR data (Additional file 5: Figure S3), strongly suggesting that we elucidated correct gene- and pseudogene-specific genotypes. To aid scientific research and clinical development of PMS2 and its role in lynch syndrome, we share the gene- and pseudogene-specific variant information. For patient samples, to share valuable data while being mindful of patient consent and PHI compliance, we provide variant frequencies (Additional file 3: Table S4). For cell lines, we share variant frequencies, as well as BAM and VCF files for the LR-PCR amplicons spanning the last five exons of PMS2 and PMS2CL (Additional file 6: Table S5 and in ENA accession #PRJEB27948)
Here we show that a reflex workflow starting with short-read NGS and reflexing to LR-PCR and/or MLPA can detect sequence variants in the last five exons of PMS2 with high analytical sensitivity (>99% for SNVs/indels; >96% for CNVs) and specificity (>99% for SNVs/indels/CNVs). In isolation, short-read sequencing would be incapable of attributing variants to PMS2 or PMS2CL, but it is proficient both at resolving SNVs and indels in PMS2 exon 11 and at flagging samples with other variants whose origin requires disambiguation via reflex testing. In addition to presenting and testing a comprehensive and plausible workflow, we resolved and have shared the PMS2- and PMS2CL-specific genotypes and allele frequencies of many hundreds of clinical and cell line samples. Together, the contributions described herein may advance understanding of PMS2 and facilitate routine screening for Lynch syndrome in HCS offerings.
A high reflex rate after short-read NGS testing (e.g., >10%), while acceptable for the accuracy of a patient’s report, may exert unmanageable logistical overhead on the testing laboratory. The reflex rate has two components—one biological and one technical—each with different sources and constraints. The biological component serves as the floor of the reflex rate: if the assay had perfect analytical specificity (i.e., zero false positives) and clinical accuracy (i.e., correct classifications with no VUSs), then there would nevertheless be a nonzero reflex rate due to the presence of pathogenic variants in PMS2 exons 12–15 and the corresponding PMS2CL regions that need disambiguation. This biological component would, therefore, reflect primarily the integrated population frequency of pathogenic variants across the ambiguous region. The technical component of the reflex rate, by contrast, arises from imperfect analytical specificity and incomplete knowledge of variant pathogenicity. Though higher in our study (99.7%), analytical specificity for CNVs was 93.7% in Herman et al. , meaning that the technical component of the reflex rate in that study was at least 6.3% (highlighting the variable nature of the technical component). Also, technical reflex due to VUSs in our workflow was required in 4% of samples, a share that is expected to drop with further screening of PMS2 and the resulting ability to reclassify VUSs.
There are several laboratory strategies that can yield accurate results for the last five exons of PMS2, though each requires quality-control monitoring. These approaches include LR-PCR with Sanger sequencing, LR-PCR with NGS, MLPA (often requires LR-PCR with sequencing to disambiguate if CNV in gene or pseudogene), and a reflex workflow built upon short-read NGS as presented here. A risk of performing LR-PCR alone on all samples is allelic dropout, where one of the alleles amplifies poorly or not at all (e.g., due to a SNV under the LR-PCR primer). Appropriate quality control to mitigate this risk could include examining allele balance at sites across the amplicon: allelic dropout is likely if no heterozygous sites are observed, or if all such sites have allele balance significantly less than 50%. We observed one such sample in our cohort with allele balance of ~ 7% at SNVs across the amplicon (note that NGS but likely not Sanger sequencing would be able to identify these low-allele-balance SNVs that may be below the Sanger sequencing detection limit); inspection of the hybrid-capture data revealed a SNV under the PMS2 exon-10 LR-PCR primer (PMS2_LRPCR_F in Additional file 2: Table S3), which we broadly observed in 0.13% of patients (i.e., 1 in 769). A shortcoming of monitoring allele balance to flag allelic dropout is that a sample that simply lacks SNVs in the amplified portion of the genome could undergo extensive follow-up characterization without actually being spurious. Ultimately, an asset of a workflow incorporating multiple orthogonal methods—e.g., both LR-PCR and hybrid-capture-based NGS (in parallel or in a reflex arrangement)—is that the different datasets facilitate reconciliation of complicated genotypes.
We attempted to mitigate several potential limitations of our proposed workflow. For instance, our short-read NGS approach in PMS2 exons 12–15 operates variant-calling software with the assumption of a tetraploid background, obviously unusual in a human clinical genomics setting (GATK supports tetraploid calling, but reports of its efficacy in this mode are scarce). Importantly, there was high concordance between short-read-generated SNV calls in the four-copy background and the combined genotypes detected using gene- and pseudogene-specific LR-PCR. A dearth of cell-line or patient samples with indels or CNVs in exons 12–15 also complicated the ability to assess performance of detecting these important variants. In silico simulations enabled generation of 940 indel-positive and 2186 CNV-positive samples in a tetraploid background, and variant calling on these simulated samples revealed high sensitivity. Finally, despite examination of the workflow’s variant-calling accuracy on hundreds of samples, the assay would still require validation before being demonstrated suitable for clinical use.
Although we measured the sensitivity and specificity of the proposed workflow, its potential impact on cost and turnaround time (TAT) of the test were not explored here. The impact on cost and TAT depends greatly on how a laboratory decides to implement the reflex-testing workflow. For instance, TAT could be minimized by running the LR-PCR reactions as soon as samples arrive and then only perform NGS of the amplicons upon flagging by the short-read NGS analysis. But, this approach would incur the cost of generating amplicons in >90% of samples that would not need further testing. By contrast, cost is minimized by only doing LR-PCR with NGS on relevant samples after the short-read NGS step, but this approach could increase TAT for those samples. These considerations are ultimately important because TAT and cost impact the utility and accessibility of HCS. It will be exciting to see if future technical developments enable targeted long-read sequencing, as this advance would facilitate clinical-grade testing of highly homologous regions of the genome.
Screening for pathogenic variants in the last five exons of PMS2 is technically challenging. High homology between PMS2 and PMS2CL complicates identification of gene-specific variants with short-read NGS alone, and reference cell lines that typically aid assay development and validation are not accurately genotyped in public databases. To help overcome these limitations, we have characterized a reflex workflow that achieves high accuracy and publicly shared the gene- and pseudogene-specific raw data, genotypes, and variant frequencies in widely available cell lines.
We are grateful to Kaylene Ready, Kristin Price, Leslie Bucheit, Peter Kang, Kevin Haas, Kevin Iori, Eric Evans, Krista Moyer, Becca Mar-Heyming, Kerri Hensley, Megan Judkins, Christine Lo, Eric Olson, Kyle Beauchamp, Kristjan Kaseniit, Jeffrey Tratner, Henry Lai, Carly Paul, Pranav Sharma, Victoria Brewster, Irina Ridley, Harris Naemi, and Gabor Brasnjo for support of this manuscript. We thank Invitae for providing CNV-positive samples.
Counsyl provided support for the design of the study, data collection, data analysis, data interpretation, and writing.
Availability of data and materials
BAMs for the cell-line LR-PCR data can be found in the European Nucleotide Archive (accession #PRJEB27948). Corresponding VCFs of these samples are included as a supplemental file.
GMG, PVG, MRT, DHH, CSC, JRM, GJH, and DM designed the study. GMG, PVG, MRT, LS, IEW, LMM, and RGC collected and analyzed the data. GMG, PVG, MRT, LS, IEW, LMM, GJH, and DM wrote the manuscript. All authors read and approved the final manuscript.
Ethics approval and consent to participate
The protocol for this study was reviewed and designated as exempt by Western Institutional Review Board.
Consent for publication
All authors are current or former employees and equity holders of Counsyl.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 30.Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing [Internet]. arXiv [q-bio.GN]. 2012. Available: http://arxiv.org/abs/1207.3907
- 34.Richards S, Aziz N, Bale S, Bick D, Das S, Gastier-Foster J, et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med. 2015;17:405–24.CrossRefGoogle Scholar
- 35.Salvatier J, Wiecki TV, Fonnesbeck C. Probabilistic programming in Python using PyMC3. PeerJ Comput Sci. PeerJ Inc. 2016;2:e55.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.