Rapid-CNS2: rapid comprehensive adaptive nanopore-sequencing of CNS tumors, a proof-of-concept study

Areeba Patel1,2*, Helin Dogan1,2*, Alexander Payne3, Elena Krause1, Philipp Sievers1,2, Natalie Schoebe1,2, Daniel Schrimpf1,2, Christina Blume1,2, Damian Stichel1,2, Nadine Holmes3, Philipp Euskirchen4, Jürgen Hench5, Stephan Frank5, Violaine Rosenstiel-Goidts6, Miriam Ratliff7, Nima Etminan7, Andreas Unterberg8, Christoph Dieterich9, Christel Herold-Mende8, Stefan M Pfister10,11,12, Wolfgang Wick14, Matthew Loose3, Andreas von Deimling1,2, Martin Sill10,11*, David TW Jones10,13*, Matthias Schlesner15*, Felix Sahm1,2,10*

https://github.com/areebapatel/Rapid-CNS2). A 10kb flank was added to the sites on either side to ensure optimal targeting by ReadFish (155 Mb). Guppy 4.2.2's fast basecalling (config dna_r9.4.1_450bps_fast) mode was used to run ReadFish. 12 samples were sequenced using a shorter panel (Rapid_CNS_B: neuropathology gene panel flanked by 10 kb on either side, total 15 Mb) with MinKNOW's in-built adaptive sampling protocol on the GridION. Three samples were sequenced in one run on the GridION. Live basecalling was turned off for all runs.
Variant and copy-number calling Samples were analyzed using a custom bioinformatics pipeline (areebapatel/Rapid-CNS2.git).
To classify nanopore sequencing derived DNA-methylation profiles of central nervous system tumors, an adapted ad-hoc random forest classifier based on nanoDx was established [5,6]. It was trained on the publicly available 450k methylation array reference data set of the Heidelberg methylation classifier version 11 (GSE90496) [3]. This data set was preprocessed as described in [3]. For each Nanopore sample, methylation calls overlapping the top 100,000 probes (by mean decrease in accuracy) derived from the Heidelberg methylation classifier were selected. These probes were variance filtered to select the top 10,000 most variable probes. A random forest model with 20,000 trees was trained using ranger [18]. Recalibration was performed by training one vs all generalized linear models for each class. These models were used to obtain a prediction and confidence score for the sample. Ground truth for archival samples was inferred from EPIC array data (from FFPE samples) using the Heidelberg methylation classifier v11b4 predictions.
Methylation families were inferred by aggregating over methylation subclasses from the reference set.

MGMT promoter methylation
Ground truth for MGMT promoter methylation status was inferred from EPIC array analysis and pyrosequencing (6 prospective diagnostic samples). A total of 59 samples (47 Rapid_CNS_A and 12 Rapid_CNS_B samples) were split into 70% training and 30% validation data. In the training data, each of the 212 CpG sites in the MGMT promoter region were subjected to a Student's t-test to compute the predictive value. 137 sites with p-value <0.01 were selected. The average over these sites was used to train a logistic regression based binomial classifier. The classifier was tested on the validation samples.
Shorter sequencing time and flowcell reuse Five Rapid_CNS_B libraries (PANEL_B_01, 02, 03, 05, 06) were split based on reads generated in the first 24h (before flushing and reloading) and those generated in the next 48h (after flushing and reloading). These 10 sub-libraries were independently analyzed using the aforementioned pipeline.
Since flushing and reloading could also enable loading a new sample, we used this analysis to assess potential for flowcell reuse.

Analysis pipeline
The Rapid-CNS 2 pipeline is available as a git repository (https://github.com/areebapatel/Rapid-CNS2). It requires a folder containing FAST5 files or a basecalled FASTQ file as input and generates SNV calls, CNV calls, MGMT promoter status and methylation classification as output (Suppl Fig   7). It can be run on an HPC cluster or a GPU workstation. Basecalling followed by SNV and CNV detection completes within 8 hours (for longshot, 12 hours for PEPPER-Margin-DeepVariant) while methylation calling and classification requires 24 hours depending on amount of data generated.
Live methylation calling has the potential to substantially reduce analysis time.  1b). Owing to its smaller target size, the 12 libraries sequenced using Rapid_CNS_B displayed higher on-target coverage as compared to those with Rapid_CNS_A. Key genes like IDH1, TERT, BRAF, etc also mirrored the increase in median coverage (Suppl Fig 1c). This could also in part be attributed to the higher sequencing capacity of the GridION.

Mutational analysis
On re-evaluation of samples with discrepant mutations, it was found that Nanopore libraries (PANEL_A_01 and PANEL_A_25) were generated using DNA from frozen sections with infiltration zone and low tumor cell content, whereas the corresponding NGS data was derived from bulk tumor tissue. For PANEL_A_12, the TERTp mutation was present in the Nanopore data (detected by mpileup), but was filtered by PEPPER-Margin-DeepVariant and longshot. Owing to major differences in technology (short read NGS vs long read Nanopore-sequencing), variant calling tools (mpileup/Platypus for NGS vs DeepVariant/longshot for Nanopore sequencing) as well as tissue type (FFPE tissue for NGS vs cryoconserved tissue for Nanopore sequencing), we restricted our validation to pathognomonic alterations.

Copy number analysis
Copy number plots obtained using the Rapid-CNS 2 pipeline demonstrated higher resolution and clear visualization of the copy number levels as compared to NGS panel sequencing (Suppl. Fig. 2 (left and centre)). Calculating depth of mapped reads, copy number variations detected were comparable to EPIC array results (Suppl. Fig. 2 (left and right)). Additionally, using smaller bin sizes (1kb, 10kb), genes covered by the copy number variations and their zygosity could also be accurately annotated. Resolution for copy number profiling was maintained across both panels (Rapid_CNS_A and Rapid_CNS_B) (Suppl Fig 2).

Methylation analysis
Owing to non-uniform methylation calling, Bady's MGMT-STP27 method could not be used directly for MGMT promoter assessment [2]. Comparing average methylation over each site in methylated vs unmethylated samples revealed a number of sites had poor predictive value (Suppl. Fig. 3a).
Averaging the methylation values over the 137 selected CpG sites showed a clear difference between methylated and unmethylated samples (Suppl. Fig. 3b). The test samples (yellow points) were in concordance with the thresholds proposed by the training data (25% cutoff). The logistic regression classifier trained on the training samples made accurate predictions in all testing samples (72 h runs) (Suppl Fig 3c).
Of the 10,000 probes selected for training in each sample, only ~1400 probes were common in all of the samples (Suppl Fig 3d). The out-of-the bag error for ad-hoc classifiers for each sample was between 0.18-0.20 (Suppl Fig 3e).
Data obtained from a 72h run usually covered >300,000 probes from the 450K array. Loading the entirety of the overlapping probe set from the training set for re-training warranted considerable memory. Methylation classification in Rapid-CNS 2 infers CpG importance from the Heidelberg methylation classifier by selecting sites from the top 100K probes to re-train the ad-hoc classifier.
This considerably reduced the time and memory required to perform methylation classification. On average, methylation classification (including I/O processes) took 10 minutes with 16 threads.
Complete scores and comparison with EPIC array analysis are reported in the Suppl. Table 2.

Diagnostic samples
Six prospective samples were run using Rapid-CNS 2 . As shown in the Suppl. Table 4, five of these were glioblastomas, while one was classified as an IDH mutant, 1p/19q codeleted oligodendroglioma. Histopathological diagnoses for all samples were confirmed by their Rapid-CNS 2 molecular profiling.
Rapid-CNS 2 also accurately profiled PANEL_B_08 as an ATRT-SHH tumor, which was reported as a glioma by histology (Suppl. Fig 4).
Flowcell reuse and shorter sequencing times Five libraries run using Rapid_CNS_B were split into two each, containing reads generated in the first 24h and those generated in the subsequent 48h after flushing and reloading respectively.
Flushing the flowcell depletes all previously loaded sample. We considered data generated after reloading the flowcell to be equivalent to loading a new library on a previously used flowcell. Since the same sample was loaded again, it avoided any sample or flowcell-related bias. From Suppl. Fig.   5a, each split library contained > 5 million reads. While there was no clear trend observed for the number of reads generated in the first 24h vs next 48h, mean on-target coverage for all libraries was similar to that observed with Rapid_CNS_A libraries sequenced for 72h (Suppl. Fig. 6c, 1a). As shown in Suppl. Fig. 6b and d,