Background

For some cancer types, the wide-spread adoption of cancer screening has increased the detection of pre-malignant lesions (PML) [1]. Today, breast ductal carcinoma in situ (DCIS) comprises nearly ~ 25% of all breast cancer diagnoses in the United States [2]. In the case of DCIS, disease-specific guidelines recommend surgical excision and radiation, and endocrine risk reducing therapy. While treatment prevents rate of second events, it has not translated into increase survival rates, which are very high for most patients with DCIS, suggesting overtreatment of PML and highlighting a critical need to improve risk models and identify prognostic markers [1, 3, 4]. Current PML risk models rarely account for molecular biomarkers such as mutations or copy number alterations, which are seldom profiled. This is in large part due to technical challenges in profiling PML: specimens from PML biopsies are typically formalin fixed and paraffin embedded (FFPE) in their entirety, to verify absence of any invasive component. As a consequence, no fresh or frozen material is available for research. Moreover, many PMLs observed in absence of invasive lesions are very small or have low overall cellularity, sometimes less than a millimeter in diameter or containing fewer than 1000 cells. Hence, while FFPE specimens have successfully been used in high-throughput sequencing, the challenges posed by excessive formalin-induced deoxyribonucleic acid (DNA) damage—detailed below—are typically overcome by an increase in DNA input quantity [5,6,7,8,9], a solution not available for archival PML profiling. Thus, small FFPE PML specimens pose significant challenges in the generation of high throughput sequencing libraries and preclude the investigation of genetic biomarkers. To overcome these limitations, previous studies have been performed in fresh PML from areas adjacent to invasive disease instead of on pure PML, ignoring the vast majority of PML that are less likely to progress [10,11,12,13,14,15]. Profiling of pure PML in the absence of invasive disease is required to avoid such biases and thus necessitates methodology to work with archival FFPE specimens.

One of the main challenges of library preparation from damaged, low input DNA samples is to preserve the library complexity: the faithful and unbiased representation of all fragments in the starting DNA sample. Indeed the multiple steps of the library preparation, including the repair of the input DNA, the ligation of adapter, target enrichment and the multiple rounds of purification and PCR amplification can all act as bottlenecks, and introduce strong skews that will reduce the library complexity and eventually impact precision and recall of variant calling. Moreover, formalin is known to create adducts in the DNA and lead to spurious substitutions, which can be difficult to distinguish from true somatic variants, especially at low allelic fractions [16, 17]. Finally, the most insightful prognostic biomarker studies of PML progression require long follow up to rely on actual outcome (recurrence, second events, survival) rather than proxy risk markers (grade, subtype, histological markers). As a consequence, most studies are retrospective and rely on old archived material without matching germline DNA sample, rendering the identification of high confidence somatic mutations more difficult. Hence, both technical and experimental challenges are hampering progress in PML mutational profiling.

Here we present the development of DNA library preparation and variant calling strategies specifically optimized for low abundance, damaged DNA, commonly extracted from PMLs. Using highly damaged DNA, we compared the effect of the input amount, the size of the captured genomic region, and library preparation strategy on the quality of coverage depth and breadth. We determined that library preparation using blunt-end (BE) adapter ligation strategy maximizes the library complexity down to 3 nanograms (ng) of input DNA and is compatible with whole exome capture. Using a set of DNA variants called from a frozen mirrored tissue specimen, we optimized the variant calling strategy to maximize its accuracy. We further demonstrated its validity on 10 DNA samples extracted from laser capture micro-dissected regions of PML or adjacent normal from DCIS of 3 independent patient FFPE specimens. We illustrate the utility of the approach to identify somatic mutations in candidate genes and characterize PML clonal heterogeneity within a specimen.

Methods

Sample collection and preparation

Test specimen

Mirrored frozen-FFPE tissue specimen from a HER2 positive invasive breast cancer was obtained from Asterand Biosciences (Detroit, MI). The length distribution of the DNA fragments was measured by capillary electrophoresis (Agilent BioAnalyzer) and used to calculate the DNA integrity number (DIN) between 1 (very degraded) to 10 (intact genomic DNA).

PML specimen

FFPE blocks were obtained from UCSD Health Anatomic Pathology after surgical biopsy, excision or mastectomy. The UCSD institutional review board approved the retrospective study and waived the requirement for consent. Consecutive sections of the blocks were used for Hematoxylin–Eosin staining (N = 1; 4 µM glass slide) then for Laser Capture Microdissection (LCM; N = 3; 7 µM glass slide coated with polyethylene naphthalate—ThermoFisher #LCM0522). The slides were stored at − 20 °C in an airtight container with desiccant until ready for dissection (1 day to 3 months). The LCM sections were thawed and stained with eosin, sections were kept in xylene and dissected within 2 h of staining. Laser Capture Microdissection was performed using the Arcturus Laser Capture Microdissection System. Matching regions from 3 adjacent sections were collected on one Capsure Macro Cap (Thermofisher), region size permitting. Post-dissection, all caps were covered and stored at − 20 °C.

DNA extraction and QC

The DNA was extracted from FFPE tissue using the QIAamp DNA FFPE tissue kit and QIAamp DNA Micro Kit (Qiagen) for the test specimen or LCM specimen, respectively. For the LCM sample, the membrane and adhering tissue were peeled off the caps using a razor blade and the peeled membrane was incubated in proteinase K digestion reaction overnight for 16 h at 56 °C to maximize DNA yield after cell lysis and the elution was done in 20 µL. The extracted DNA was quantified by fluorometry (HS dsDNA kit Qbit—Thermofisher). All samples used in the study are described in Additional file 1: Table S1.

Targeted sequencing

DNA fragmentation

DNA was sheared down to 200 base pairs (bp) using Adaptive Focused Acoustics on the Covaris E220 (Covaris Inc) following manufacturer recommendations with the following modifications: 50 µL of Low TE buffer in microTUBE-130 tubes (AT libraries) or 10 μL Low EDTA TE buffer supplemented with 5 μL of truSHEAR buffer using a microTUBE-15 (SS and BE libraries).

Library preparation

AT libraries were prepared with the SureSelect XT HS protocol (Agilent Technologies) extending the adapter ligation time to 45 min (min). After ligation, excess adapters were removed using a 0.8 × SPRI bead clean up with Agencourt AMPure XP beads (Beckman Coulter), then eluted into 21 μL of nuclease-free water. SS libraries were prepared using the Accel-NGS 1S Plus DNA Library Kit (Swift Biosciences). Prior to the single-strand ligation protocol, 15 μL of fragmented DNA was denatured at 95 °C for 2 min, then set on ice. The adaptase and extension steps were performed by kit specifications followed by a purification step using 1.2 × AMPure XP, eluted into 20 μL of nuclease free water. The subsequent ligation step incorporates SWIFT-1S P5 and SWIFT-1S P7 adapters, followed by a 1 × AMPure XP bead clean-up and elution into 20 μL of nuclease free water. BE libraries were prepared using the Accel-NGS 2S PCR-Free DNA Library Kit (Swift Biosciences). Repair I was followed by a 1 × AMPure bead cleanup, Repair II was followed by a 1 × PEG NaCl cleanup, and Ligation I (P7 index adapter) and Ligation II (P5 UMI adapter) were followed by a 0.85 × PEG NaCl cleanups. Only Ligation II cleanup was eluted into 20 μL Low EDTA TE, the other cleanups proceeded directly into the next reaction. Adapters used in the study are summarized in Additional file 1: Table S2.

Pre-capture PCR amplification

Ligated and purified libraries were amplified using KAPA HiFi HotStart Real-time PCR 2X Master Mix (KAPA Biosystems). AT libraries were amplified with 2 μL of KAPA P5 primer and 2 μL of SureSelect P7 Index primer. SS libraries were amplified with 5 μL of SWIFT-1S P5 Index and P7 Index primers. BE samples were amplified with 5 μL of KAPA P5 and KAPA P7 primers. The reactions were denatured for 45 s (s) at 98 °C and amplified 13–15 cycles for 15 s at 98 °C, for 30 s at 65 °C, and for 30 s at 72 °C, followed by final extension for 1 min at 72 °C. Samples were amplified until they reached Fluorescent Standard 3, cycles being dependent on input DNA quantity and quality. PCR reactions were then purified using 1 × AMPure XP bead clean-up and eluted into 20 μL of nuclease-free water. The resulting libraries were analyzed using the Agilent 4200 Tapestation (D1000 ScreenTape) and quantified by fluorescence (Qubit dsDNA HS assay). Primers used in the study are summarized in Additional file 1: Table S2.

Targeted capture hybridization and post-capture PCR

Samples were paired and combined (12 μL total) to yield a capture “pond” of at least 350 ng, and supplemented with 5 μL of SureSelect XTHS and XT Low Input Blocker Mix. The baits for target enrichment consisted of either Agilent SureSelect Clinical Research Exome panel (S06588914), Human All Exon V7 panel (S31285117) or Cancer All-In-One Solid Tumor (A3131601). The hybridization and capture was performed using Agilent SureSelect XT HS Target Enrichment Kit following manufacturer’s recommendations. Post-capture amplification was performed on the beads in a 25 μL reaction: 12.5 μL of nuclease-free water, 10 μL 5 × Herculase II Reaction Buffer, 1 μL Herculase II Fusion DNA Polymerase, 0.5 μL 100 mM (mM) dNTP Mix and 1 μL SureSelect Post-Capture Primer Mix. The reaction was denatured for 30 s at 98 °C, then amplified for 12 cycles of 98 °C for 30 s, 60 °C for 30 s and 72 °C for 1 min, followed by an extension at 72 °C for 5 min and a final hold at 4 °C. Libraries were purified with a 1 × AMPure XP bead clean up and eluted into 20 μL nuclease free water in preparation for sequencing. The resulting libraries were analyzed using the Agilent 4200 Tapestation (D1000 ScreenTape) and quantified by fluorescence (Qbit—ThermoFisher).

RNA-sequencing

Expression profiling was performed on select dissected PML regions: 1A, 2A2, 2B and 3C. RNA library preparation was performed with SMART-3Seq, a 3′ tagging strategy specifically designed for degraded RNA directly from FFPE LCM specimen [18]. Read count data was obtained using a dedicated analysis workflow https://github.com/danielanach/SMART-3SEQ-smk. Count data was then normalized for read depth and scaled by a million to give transcripts per million (TPM) counts.

Sequencing

All libraries were sequenced using the HiSeq 4000 sequencer (Illumina) for 100 cycles in Paired-End mode. Libraries with distinct indexes were pooled in equimolar amounts. The sequencing and capture pools were later deconvoluted using program bcl2fastq [19].

Data analysis

Sequencing reads processing and coverage quality control

Sequencing data was analyzed using bcbio-nextgen (v1.1.6) as a workflow manager [20]. Samples prepared with identical targeted panels were down-sampled to have equal number of reads using seqtk sample (v1.3) [21]. Adapter sequences were trimmed using Atropos (v1.1.22), the trimmed reads were subsequently aligned with bwa-mem (v0.7.17) to reference genome hg19, then PCR duplicates were removed using biobambam2 (v2.0.87) [22,23,24]. Additional BAM file manipulation and collection of QC metrics was performed with picard (v2.20.4) and samtools (v1.9) [24]. The summary statistics of the sequencing and coverage results are presented in Additional file 1: Table S3.

Copy number analysis

Copy number alterations (CNAs) were called using CNVkit (v0.9.6) [25] using equal sized bins of ~ 250 bp. Any bins with log2 copy ratio lower than − 15, were considered artifacts and removed. Breakpoints between copy number segment were determined using the circular binary segmentation algorithm (p < 10−4) [26]. Low quality segments were removed from downstream analysis (less than 10 probes, biweight midvariance more than 2 or log2 copy ratio confidence interval contains 0). Copy number genomic burden was computed as the sum of sizes of segments in a gain (log2(ratio) > 0.3) or loss (log2(ratio) < − 0.3) over the sum of the sizes of all segments. The summary statistics of CNA calling on the test specimen are reported in Additional file 1: Table S4. Chromosomal arm gains and losses were called when more than half of their total length was involved in a gained and lost segment, respectively. Gene copy number estimates were assigned based on the segment that covered the gene. For the test specimen, if more than one segment covered a gene then the higher confidence segment was used. For the DCIS specimen, copy number alterations were determined for all autosomal genes containing at least 3 bins whose segments were covered by at least 110 probes (N = 17,750 total). Additionally, due to the imprecision of segmentation breakpoints, any genes with a breakpoint identified in one region of a patient, were removed from the comparison to other regions. Loss of heterozygosity (LOH) was called for segments with B-allele frequencies lower than 0.3 or greater than 0.7.

Variant calling and initial filtering

Single nucleotide variants (SNVs) and short insertions and deletions (indels) were called with VarDictJava (v1.6.0), and Mutect2 (v2.2) [27, 28]. Variants were required to fall within a 10 bp boundary of targeted regions that overlapped with RefSeq genes (v 109.20190905). A publicly available list of variants observed in a pool of normal DNA exome sequencing was obtained from the GATK resource (https://console.cloud.google.com/storage/browser/details/gatk-best-practices/somatic-b37/Mutect2-exome-panel.vcf) and used to eliminate artifacts and common germline variants. Only variants called by both algorithms were considered (ensemble calling). These ensemble variants were then subjected to an initial filtering step with default bcbio-nextgen tumor-only variant calling filters listed in Additional file 1: Table S5. Functional effects were predicted using SnpEff (v4.3.1) [29]. The resulting variants are referred to as raw ensemble variants.

Germline variant filtering

In absence of a matched normal control for both test (frozen and FFPE) and DCIS specimens, somatic mutations were prioritized computationally using the approach from the bcbio-nextgen tumor-only configuration then additionally subjected to more stringent filtering [30]. Briefly, common variants (MAF > 10−3) present in population databases—1000 genomes (v2.8), ExAC (v0.3), or gnomAD exome (v2.1)—were removed unless in a tier 1 gene from the cancer gene consensus and present in either COSMIC (v68) or clinvar (20190513) [20, 31,32,33,34,35]. Additionally, variants were removed as likely germline if found at a variant allelic fraction (VAF) greater or equal to 0.9 in non-LOH genomic segments—as determined by CNA analysis (above). The remaining variants are referred to as candidate somatic mutations.

Analysis specific filtering of candidate somatic mutations

Additional filtering was implemented in a context specific to the analysis presented. (1) Calling “gold standard” mutations from the frozen test specimen: the candidate somatic mutations in the DNA of the frozen test specimens were filtered for high-quality variants: ensemble quality score greater than 175, average number of read mismatches less than 2.5, position covered by at least 25 reads, mean position in read greater than 20, microsatellite length less than 5, and VAF more than 0.14. This resulted in 247 SNVs and 10 indels. (2) Benchmarking mutations in FFPE test specimens: mutations in the DNA of the FFPE test specimens required specific filtering due to the abundant low-frequency damage as well as lower coverage depth. The following parameters were used: position covered by at least 5 reads, mapping quality more than 45, mean position in read greater than 15, number of average read mis-matches less than 2.5, microsatellite length less than 5, tumor log odds threshold more than 10, Fisher strand bias Phred-scaled probability less than 10 and VAF more than 0.14. The accuracy of resulting DNA variants from the test FFPE specimen was measured against the set of “gold standard” variants from the mirrored frozen specimen using vcfeval by RTG-tools (v3.10.1), using variant ensemble quality as the score [36]. The results of the benchmarking analysis are reported in Additional file 1: Table S6. (3) Profiling dissected regions from the DCIS specimen: in addition to the filtering of FFPE candidate somatic mutations presented above, the following steps were implemented. Any variants found at high VAF (> 0.9) in non-LOH segments in one region were also excluded from the variants from all other regions of same patient. Candidate somatic mutations with ensemble quality score lower than 115 were excluded, corresponding to the optimal F-score obtained for low-input BE libraries in the benchmarking analysis. To the exception of few well-described hotspot mutations in breast cancer (PIK3CA, TP53, GATA3), somatic mutations identified in more than one patient were removed.

Clonality analysis

To allow the analysis of clonal relationships between regions of the same patient, the coverage depth of each allele at any remaining mutated position in any region was extracted using Mutect2 joint variant caller on the sets of aligned reads from each region. In order to call a mutation either absent or present in a region, we used a Bayesian inference model specifically designed for multi-region variant calling [37]. Treeomics (v1.7.10) was run with the default parameters except for e = 0.02.

Results

Evaluation of targeted sequencing approaches for low input FFPE DNA

Regions of PMLs on a histological section can contain as few as 500 cells, corresponding to 3.3 ng of haploid DNA. The expected reduced DNA extraction yield can be mitigated by combining regions matched across sequential sections. Hence, optimizing targeted sequencing down to 3 ng of input DNA is a reasonable objective to identify high confidence mutations in PML. To develop the methodology, a test DNA sample was extracted from a 4 year-old FFPE HER2 positive breast invasive carcinoma which showed significant fragmentation (DNA integrity number of 2.4), likely representative of DNA extracted from old archived specimens. High-throughput sequencing libraries were prepared using traditional A-tailing adapter ligation protocol optimized for low-input, damaged DNA (referred to as AT method—Fig. 1a) with decreasing amount of input DNA from 200 down to 10 ng. After capture of the whole exome by hybrid selection, the DNA libraries were amplified and sequenced. The performance was evaluated in comparison to whole exome data generated from 200 ng of DNA extracted from a mirrored frozen tissue specimen [38]. Libraries generated from 200 or 50 ng FFPE DNA achieved reasonable coverage, with nearly all targeted bases covered at least 20-fold (referred to as Cov20). In contrast, the sequencing libraries from 10 ng FFPE DNA lacked complexity (79% read PCR duplicates), resulting in 17% Cov20 after elimination of the duplicate reads (Fig. 1b, c). Such poor performance at 10 ng, precluded us from further decreasing the DNA input and suggested that perhaps a smaller capture panel, restricted to cancer genes (710 kb total size) would elicit the goal of 3 ng input. Unfortunately, 3 ng AT libraries sequencing with a cancer panel had a high percentage of duplicate reads (83%) and 1.6% Cov20 (Fig. 1d, e). Hence, the consistently poor performance of both low input strategies (3 ng and 10 ng) irrespective of the capture panel size suggests that the bottleneck reducing the library complexity originates upstream of the targeted capture, which led us to examine the initial ligation of the sequencing adapters.

Fig. 1
figure 1

Benchmarking results for sequencing performance. a Experimental design for performance evaluation using a test DNA specimen (* Exome and † Cancer panel). b, c Fraction of targeted bases covered by a minimum of 20 reads (b) and fraction of PCR duplicates (c) observed in whole exome sequencing. d, e Fraction of targeted bases covered by a minimum of 20 reads (d) and fraction of PCR duplicates (e) observed in cancer panel sequencing. All error bars represent standard deviation

Library preparation methods which utilize alternate ligation techniques have previously been shown to increase the complexity of the library notably for the analysis of cell free DNA [39, 40], ancient DNA [41,42,43] and other applications [44, 45]. We therefore evaluated blunt-end ligation (BE) and single strand ligation (SS) library preparation strategies to increase the number of input DNA fragments incorporated in the library and enhance the resulting complexity and usable sequence coverage (Fig. 1a, Additional file 2: Fig. S1). The cancer panel capture of BE libraries prepared from 3 ng of test DNA showed a reduced percentage of duplicate reads compared to AT libraries with 10 ng input (73% vs. 83% respectively Fig. 1e), leading to a dramatic increase in Cov20 (99.7% vs. 2%) and bringing it to levels comparable to high input (50 ng) AT libraries (Fig. 1d), albeit with higher duplicate rates. In turn, the SS library offered a lesser, but measurable, improvement over AT library for low input (Fig. 1d, e). The superiority of the BE and SS strategies were further confirmed using whole exome capture with BE and SS libraries resulting in 84% and 63% Cov20 at low input (3 ng), respectively, higher than 17% observed for low input (10 ng) AT libraries (Fig. 1b). Compared to the cancer panel capture, the improvements of these alternative ligation strategies on whole exome sequencing were milder but remained remarkable and, in the case of BE strategy, likely to support more sensitive mutational profiling of archived PMLs.

Somatic mutation profiling from low input FFPE DNA

At a minimum, both SNVs and indels are necessary to evaluate the mutational landscape of PML. Unfiltered SNVs identified in FFPE DNA showed both a high overall abundance (422,322) of low variant allelic fraction substitutions (VAF < 5%) and bias of C to T substitutions (53%), which is expected from the cytosine deamination resulting from formalin fixation [16, 17]. In contrast, low VAF variants from frozen DNA were much lower in abundance (175,364) but contained a high-prevalence (52%) of C to A substitutions consistent with previously reported 8-oxoguanine damage observed in frozen samples (Additional file 2: Fig. S2) [46]. We hypothesized that we could discriminate against artifactual FFPE variants in the test specimens using stringent filtering criteria including high strand bias, low allelic fraction and poor concordance between multiple variant callers in order to call accurate somatic variants in FFPE preserved PML (Methods).

First, we established a set of benchmarking somatic mutations from the test DNA extracted from a mirroring frozen tissue specimen [38]. Its whole exome sequence resulted in 247 SNVs and 10 indels that were used to measure performance of the variant calling from the FFPE libraries generated above. Prior to filtering, the analysis of variants from low input AT libraries resulted in an average of 7,475 false positive somatic mutations. In contrast, the analysis of variants from BE and SS libraries resulted in an average of 1,967 and 3,137 false positive mutations, respectively (Fig. 2a). We developed additional filtering criteria, trained on variants from the high-input (200 ng) FFPE library and used variants called from a publicly available panel of normal samples to remove additional artifacts. This approach considerably reduced the fraction of false positives (Fig. 2b), increasing precision from less than 20% to 84%, 93% and 91% for the AT, BE and SS low input libraries, respectively (Fig. 2c, d). The variant recall increased from 75% in the AT low input library to 85% and 80% for the BE and SS low-input libraries, respectively, consistent with differences observed in Cov20 (Additional file 1: Table S6). Importantly, these values were similar to the theoretical maximum values obtained from down-sampling the frozen sample itself to an equivalent number of reads (90% precision, 88% recall), indicating that the differences mainly comes from sampling rather than from technical artifacts. The improvement in accuracy was similar for the small number of indels (Additional file 1: Table S6). These data suggest that FFPE specific filtering of ensemble variant calls paired with BE library preparation enables accurate clonal somatic SNV and indel variant calling of whole exome sequencing data from 3 ng FFPE test DNA.

Fig. 2
figure 2

Benchmarking results for variant calling. a, b Count of total variants from whole exome sequencing, separated by false positives (red), false negatives (black) and true positives (grey), before (a) and after (b) PML specific filtering. c, d Exome variant calling precision for various library preparation strategies and amount of input DNA (x-axis) before (c) and after (d) PML specific filtering. e, f Fraction of the genome involved in a copy number alteration (CNA burden—y axis) for all exome (e) and cancer panel (f) library preparation strategies and DNA input amounts. All error bars represent standard deviation

In contrast to SNVs and indels, Copy Number Alterations (CNA) can be accurately identified using lower coverage depth whole genome sequencing data, though has been more challenging in targeted sequencing [47, 48]. We evaluated the ability of all input quantity and library preparation methods to reliably identify CNA (Methods). The genome-wide CNA burden obtained from low input SS and BE library was consistent with the one of high-input (200 ng) AT libraries (8.1%, 7% and 8.4% respectively—Fig. 2e and Additional file 1: Table S4), and consistent with the results of the cancer panel sequencing and the lower CNA burden observed in the 10 and 50 ng AT libraries are still within the expected confidence interval (Fig. 2f, Additional file 2: Fig. S3). The resulting level of copy number gains and losses estimated for each chromosome arms and more focal areas were highly reproducible between all tested library preparation strategies (Additional file 2: Fig. S4, S5). Additionally, the copy number status of known cancer genes was also consistent between exome and cancer panel, including the expected ERBB2 amplification which was correctly determined in all cases. In a few instances, the denser tiling of the exome probes helped identify a copy number breakpoint missed in the cancer panel, resulting in discordant copy number estimate for a few genes (Additional file 2: Fig. S6). This suggests that the input amount and quality of the sample have little impact on the accuracy of the copy number profiling, albeit this observation was limited to a specimen with few CNAs.

Mutational profiling of breast PML

To validate the optimized targeted sequencing and somatic variant calling on PMLs, we collected archived tissue specimens from 3 patients diagnosed with breast high grade ductal carcinoma in situ (DCIS—1 low grade, 2 high grade) without evidence of invasive disease. A total of 8 PML regions and one normal breast epithelium region (patient 3, region 3 N) were isolated by laser capture microdissection (LCM—Fig. 3a). The dissected regions had a mean area of 5 mm2 and, combined over three adjacent sections, contained an average of 80,000 cells (Fig. 3b). For one large DCIS region, adjacent sections (patient 2, regions 2A1 and 2A2) were processed independently for replication. For each region, between 1.4 and 21 ng of DNA were extracted and used to prepare exome libraries using the BE method. The rate of duplicate reads was between 32 and 82%, and, as expected, inversely correlated with the amount of input DNA (Additional file 2: Fig. S7). The resulting mean coverage depth was between ~ 2 and 45 fold, which is sufficient for accurate detection of CNA, but likely limiting the sensitivity to identify mutations, particularly in patient 2.

Fig. 3
figure 3

Overview of the PML regional sequencing strategy. a Overall experimental and analytical workflow of the validation study. b Images showing the Hematoxylin and Eosin stained sections from the three DCIS patient studied. Dissected PML regions are highlighted in color to the exception of region 3N consisting of multiple areas of normal epithelium outside the selected field of view

Between 7 and 43% of the regions’ genomes were involved in CNAs, predominantly through copy number losses (Fig. 4a). With the exception of region 2B, the CNA burden was consistent between regions of the same patient and minimal in normal breast epithelium (< 5%). A total of 18 chromosome arms were affected by copy number changes in at least one DCIS region. None of these were altered in the normal breast epithelium (Fig. 4b). Some of the chromosome arm losses identified, such as 6q, 8p, 16q, 17p or 22q are frequent in DCIS [49]. Within patient 1 and 3, all regions have consistent CNA suggesting a common clonal ancestor. In contrast, regions 2A and 2B have a different number of arms altered (5 vs. 13, respectively) with only 3 in common. Both regions featured high nuclear grade, but only 2B showed comedo-necrosis, a marker of more advanced and worse prognosis DCIS. Region 2B was the only region affected by chromosome arm gains at 8q, 20q and 21q. The absence of these gains in 2A as well as its additional losses of 4p and 9p, suggest the independent clonal evolution of 2A and 2B. Finally, all PML regions from patient 2 and 3 displayed a loss of 17p, generally associated with TP53 loss of heterozygosity (LOH) frequently observed in high-grade specimens. At a higher resolution, out of 98 cancer genes evaluated, 38 had a CNA in at least one region (Fig. 4c). The most notable ones were the amplification of ERBB2 in all regions of patient 2, amplification of FGFR1 in all regions of patient 3 and loss of TP53 in patient 2 and 3. This latter observation was consistent with 17p LOH and confirmed by change in B-allele frequency at heterozygous SNPs (Additional file 2: Fig. S8). The relative gene expression levels measured for 3 genes in 4 matching regions were consistent with their copy number in 10/12 cases, with possible discrepancies due to spatial variation in histology or transcriptional regulation (Additional file 2: Fig. S9) [18]. Similar to chromosome arms, none of the genes were altered in the normal epithelium and most had consistent copy numbers between regions of the same patient, with the exception of 12 genes distinguishing regions 2A and 2B and further supporting separate clonal evolution.

Fig. 4
figure 4

Mutational profile and clonal analysis from multi-region DNA sequencing of DCIS patients. a Fraction of genome involved in copy number losses (blue) or gains (red) for each sequenced region. b Chromosome arm copy number status in each sequenced region: lost (blue) or gained (red). c Cancer gene copy number (log2 ratio—blue red gradient). Genes from the cancer panel with copy number gain (log2(ratio) > 0.4) or loss (log2(ratio) < -0.6) in at least one region, indicated with an asterisk, are displayed. d, e Bayesian probabilistic variant classification of selected high confidence somatic variants (represented by their cognate gene—rows) across all dissected regions of same patient (columns). Variants are shown for patient 1 (d) and patient 3 (e). The color gradient indicates the posterior probability of mutation presence in each region. Genes from the Cancer Gene Census are indicated in red font. f Maximum likelihood tree generated using somatic mutations identified in patient 3’s regions

We next identified somatic mutations in each region of patient 1 and 3’s specimens. After additional quality filtering using cross-patient information, we identified between 18 and 154 somatic mutations per PML region, of which 14 to 108 were non-silent (Additional file 1: Table S7). The resulting mutational burden (0.46–3.97 mutations/Mb) is a range similar to what has been observed in invasive breast cancer [50]. The somatic mutations were then used to characterize the clonal relationships between regions. To account for uneven sequencing coverage between regions and the possibility of random allelic dropout, we used a maximum likelihood approach, comparing allelic fraction and total read depth of any mutated position in all regions of patients 1 and 3 (Fig. 4d, e) [37]. Mutations in patient 1 were evaluated at 155 mutated positions, of which 141 were confidently identified in all 3 regions, and 14 were either missing or absent from one or more regions (Additional file 1: Table S8). While a portion of these may be shared mutations may be germline variants, we identified a GATA3 splice-site deletion in regions 1A and 1B previously observed in DCIS studies, disrupting a canonical splice site and for which the resulting transcript has been shown to lead to an abnormally high number of neoantigens [51, 52]. Similarly, mutations in patient 3 were evaluated at 183 positions, 160 of which were shared by all 4 regions, including normal, and likely residual germline variants. The results were used to build a phylogenic tree illustrating the clonal relationship between regions (Fig. 4f). Interestingly, region 3 N is mutated at 4 positions not found in the PML regions. These could represent mutations acquired in aging normal tissue or residual germline variants lost in the PMLs [53,54,55]. The 3 PML regions gained an additional 8 mutations before diverging including an ERBB3 Ile763Leu substitution was exclusively observed in all 3 PML regions of patient 3. This mutation is predicted to be deleterious, possibly activating this uncommon driver of breast cancer [56, 57] and may have contributed, in concert with FGFR1 gain, to the clonal expansion observed in this patient. Region 3A gained an additional 2 mutations, while 3C and 3B shared an additional 4 and each gained between 1 and 3 mutations that were not shared. Overall, our analysis suggests that even in absence of normal DNA, and akin to experiments in large tumors, variants from multi-region sequencing can be used to trace evolutionary relationships between areas of pre-malignancy [37, 58].

Discussion

The results presented here demonstrate our ability to perform comprehensive mutational profiling from some of the most challenging clinical specimens with DNA in limited quantity—down to 3 ng—and of poor quality—highly degraded and chemically altered. In particular, this demonstration relied on a thorough benchmarking study using DNA from mirrored matching frozen versus FFPE specimens, which provided a real-world experimental framework to guide the process development.

We demonstrated that by utilizing a sequencing library preparation that uses a non-standard adapter ligation, we can drastically improve sequencing performance from these challenging specimens. A-tailing, together with transposon-mediated construction, is one of the most popular methods to prepare high-throughput sequencing libraries. While the latter necessitates longer DNA fragments and is not-suitable for highly degraded DNA, A-tailing has been broadly used in library preparation and is compatible with highly degraded specimens so long as DNA input is increased. We illustrated this limitation, analyzing targeted sequencing from limited dilutions and saw a drop in coverage and variant calling accuracy below 50 ng DNA input. Target coverage cannot be rescued by sequencing of a smaller panel, since the bottleneck resulting in lack of library complexity occurs prior to capture. For clinical reasons, to allow the thorough inspection of all tissue parts to formally exclude invasive disease, all pure PMLs are fixed in formalin and archived. Furthermore, areas of PML are typically small, limiting the quantity of material available for analysis. For the largest lesions, a labor-intensive dissection and pooling can increase the amount of DNA extracted but would preclude the study of their genetic heterogeneity. Similar to the benefit demonstrated on ancient DNA [41, 43], we showed that an alternative library preparation using either single-strand and even more so for blunt-end ligation strategies considerably improved both coverage and consequently variant calling performances. Despite the recognized high efficiency of sticky-end ligation [59], end-repair and addition of an overhanging adenosine is likely the rate-limiting step on highly damaged DNA.

Another challenge we faced and addressed was calling mutations in absence of a matching normal DNA in both the test and PML DNA, likely leading to some ambiguous mutation classification [60]. The most useful PML specimens for biomarker studies are the ones with long follow-up, and aside from logistical challenges of collecting matching blood or saliva as part of routine clinical workflow, these traditional sources of normal DNA are generally not available for archival PML. The dissection of adjacent normal tissue, performed for one specimen in this study, can be sometimes used. While sufficient material can be found at the margin of the surgical specimen, the cellularity of the normal tissue will vary greatly between organs and histological context leading to even smaller quantities of DNA. In the breast, the normal ductal tree is poorly cellular in comparison to dysplastic and in situ proliferative lesions. In some instances, an area of high lymphocytic infiltration, for example at the location of a previous biopsy, can be used. Such histologically normal tissue are also formalin fixed and their mutational profiling presents similar, if not more, challenges than for PMLs. Furthermore, and as demonstrated elsewhere, some histologically normal specimens will contain a few somatic mutations at low-allelic fraction, resulting from early clonal selection, such as the few private mutations identified in sample 3N [53,54,55]. Nevertheless, the inclusion of such a matched normal DNA in the analysis and interpretation would greatly aid in the removal of residual germline DNA and additional sequencing artifacts. In absence of matching normal DNA, the parallel sequencing of a panel of unmatched normal DNA, from the same ethnic background and processed using the exact same protocol and analysis is recommended, especially in a clinical setting [60]. This was not available for this benchmarking study. Instead, we combined the use of a publicly obtained pool of normal with filtering for common variants using public databases following recommendation provided elsewhere [30, 61]. This approach will miss rare germline variants, especially present in rare, or under-studied ethnicities. Previous studies have observed that tumor-only exomes may lead to ~ 300 residual germline variants after careful filtering [60, 62]. Importantly, coding and deleterious germline variants could be a source of false positives. In our study, multi-region sampling provided additional information to help us classify mutations, and we determined that 155–160 mutations were shared and likely represented residual germline variants. This approach would however not remove the ambiguity for shared mutations and, in the event this mutation is an oncogenic driver common to all regions, additional validation steps may be required, including sequencing from a more distant region located on a separate tissue block, or comparison to its allelic fraction and copy number status in the bulk DNA.

Independent of the possible residual germline variants, we took specific steps to benchmark variant calling in highly degraded FFPE DNA, comparing the results to high quality variants called from the DNA of an adjacent frozen specimen. After carefully down-sampling the data to obtain the same number of raw sequencing reads—including in silico “replicates” from the deeper reference (frozen tissue) dataset itself—we identified two main mode of errors. First, a large number of false negative variants, leading to lower recall, were directly associated to the uneven and lower coverage depth, particularly in the low input FFPE AT library. The high and more even coverage observed in the BE libraries, for the same number of raw reads, remediated this issue, resulting in recall similar to the one expected from sampling bias observed in other studies [63]. Second, the false positives were mostly due to C to T substitutions as a consequence of formalin fixation, as previously observed. Such substitutions however remained at low allelic fraction and displayed strong strand bias, which could be remediated using stringent heuristic filtering. Alternate solutions have been described elsewhere to remove same or similar sequencing artifacts using machine learning [64] or relying on the precise substitution signature of the artifacts [65, 66]. These would likely yield similar or superior results. DNA damage can also be repaired prior to library preparation altogether using cocktails of DNA repair enzymes, such as UDG and Fpg [67, 68]. This strategy would decrease false positive mutations [69] but typically require more than 5 ng FFPE DNA and the extra enzymatic step would likely add bottleneck and decrease library complexity.

Of the 3 patients studied, all displayed chromosomal copy number changes previously observed in DCIS, leading to losses of known tumor suppressors (TP53) or gains of known oncogenes (ERBB2 or FGFR1). In 2/3 patients, we observed a comparable number of somatic mutations to pure DCIS previously studies [70, 71]. The identification of only two known or likely breast cancer driver mutations (GATA3 and ERBB3) in two of the specimen was not surprising, as these specimens represent pure PML lacking any invasive component, and thus should bear less genomic resemblance to breast cancer [70]. Interestingly GATA3 has been reported as mutated at a higher frequency in pure DCIS than in invasive cancer, suggesting a negative selection during transition to invasive cancer [52]. This particular splice-site mutation produces an alternative transcript resulting in a high numbers of neoantigens, perhaps subjecting mutated lesions to a more effective immune-surveillance [51]. The relative contribution of gene copy number alterations versus somatic mutations to cellular proliferation and clonal selection in normal and pre-malignant tissue is an active field of study [53,54,55] and progress in this field will require multi-modal molecular profiling approaches compatible with small amounts of archived tissue, such as the one described here.

Beyond the identification of drivers of growth and proliferation, the proposed approach can help measure the genetic heterogeneity in PML lesions. In breast DCIS, phenotypic heterogeneity associated with subtype and grade can co-exist within a specimen [72]. Additionally, immunostaining of key markers has revealed spatial heterogeneity within a duct and between ducts of a patient [73]. But the mutational landscape underlying such heterogeneity has not been thoroughly studied in pure DCIS at a genome-wide level for the technical reasons mentioned herein. Multi-region assessment of karyotypes, select gene copy number [74] and mitochondrial mutations [75] suggested significant heterogeneity in DCIS but its clinical significance and association with other progression risk factors has not been assessed. Copy-number heterogeneity has also been observed via single-cell sequencing from frozen DCIS patient specimens, albeit in the presence of an invasive lesion [15]. A similar approach has been developed in archived tissue specimens, but none have been used in large studies or in pure DCIS [76]. While single-cell sequencing may be able to scale up—both number of cells and number of samples, it cannot yet identify point mutations with high sensitivity. Hence, in the context of a clinical study, multi-region sequencing enabled by LCM may be preferable as it would increase the accuracy of the clonal evolution and enable the identification of driver mutations and mutational signatures [77]. Moreover, the results of multi-region genomic profiling would enable us to place somatic mutations and copy number alterations in the context of the surrounding extra-cellular matrix or stromal composition obtained via imaging and morphological studies, providing a granular view of the premalignant landscape as aspired by the pre-cancer atlas [78].

Conclusions

We present both methodological and analytical advances enabling targeted sequencing from small, archived PML lesions. Specifically, we showed that blunt-end ligation of the sequencing adapter to the fragmented and damaged DNA was key to increasing the quality of the exome sequencing from a minute amount of DNA extracted from archived specimens. A subsequent multi-caller analysis strategy, complemented by dedicated filters, identified somatic mutations, eliminating false-positive and residual germline mutations—two critical hurdles in the mutational profiling of PML. Importantly, the approach was developed on real world samples using a well-characterized test sample and further validated on micro-dissected PML regions. These advances now enable mutational profiling of PML lesions, unlocking the untapped registry of archived PML specimens to comprehensively investigate their molecular heterogeneity and assess the contribution of somatic DNA alterations to cellular proliferation and progression to invasive cancer. These tools may provide clues as to which PML should be treated to prevent invasive cancer and which can be left alone.