Background

December 2019 saw a novel viral pneumonia emerge from a seafood market in Wuhan China later found to be a new type of Coronavirus, now known as Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) [1, 2]. On 11 March 2020, after approximately 118,000 cases had been reported globally, the World Health Organization (WHO) declared SARS-CoV-2 a global pandemic [3, 4]. SARS-CoV-2 is an ongoing pandemic that requires continuous surveillance with approximately 270,031,622 cases confirmed globally as of 14 December 2021 [3, 5].

Sequencing of SARS-CoV-2 allowed for the rapid identification of the virus and the development of diagnostic tests and other tools for a rapid response to the pandemic [6]. Sequencing provides genotypic information about a patient’s infection, which can be used to gain knowledge on the specific infecting strain, assist in identifying transmission within communities, and advance the development of new diagnostic methods, vaccines, and antivirals [7]. Multiple next generation sequencing (NGS) technologies have been used for SARS-CoV-2 sequencing, including Sanger, Illumina, ION torrent, and Oxford Nanopore Technology [8]. However, Illumina sequencing remains the most commonly used technology [9]. As of 05 November 2021, 4,892,742 SARS-CoV-2 consensus genomes had been deposited into the Global Initiative on Sharing all Influenza Data (GISAID) with over 65% from Illumina and approximately 25% from Oxford Nanopore Technology (ONT) [10].

A major challenge with whole-genome sequencing (WGS) is obtaining whole viral genomes from clinical samples promptly [11]. Illumina SARS-CoV-2 sequencing is generally limited by long sequencing times and the high cost and labour associated with library preparation for high-throughput sequencing [12]. Another limitation is their relatively short reads (2 × 300 bp), as genomes generally contain multiple repeated sequences, known as tandem repeats, that may be longer than the NGS reads and may result in gaps and misassemblies [13]. Due to the large footprint of most sequencers, portability can be a challenge which is unfortunate as there is generally a large distance between sample collection sites and sequencing laboratories [14]. Nanopore sequencing overcomes these challenges as they sequence in real-time and are long-read sequencing technologies that allow for portability and have a relatively low initial investment on sequencing equipment with the MinION costing $1000 [15]. ONT sequencing is, however, limited by the high number of false negatives and low sensitivity [16].

Short-read sequencing technologies are useful for population-level genetic analysis and clinical variant discovery as they provide low-cost, high-accuracy data when done in large batches. Long-read sequencing approaches, however, are well suited for de novo genome assembly, sequencing of genomes with long repetitive regions, copy number alterations, and complex structural variations [17]. Several studies have compared the sequencing of SARS-CoV-2 between Illumina and ONT platforms and have shown that despite the high error rates observed with ONT sequencing, highly-accurate SARS-CoV-2 consensus genomes can be achieved [18]. ONT sequencing, however, failed to detect short indels identified by Illumina sequencing [18]. There has also been a lower raw-read accuracy with nanopore sequencing when compared to Illumina sequencing [18, 19].

A comparison of SARS-CoV-2 WGS genomic coverage and variant detection between Illumina and Nanopore sequencing is necessary as it allows us to determine whether SARS-CoV-2 genomes produced by Nanopore sequencing can be reliably used for genomic surveillance and the development of diagnostic measures. As SARS-CoV-2 lineages differ by geographic location, this study aimed to determine whether Nanopore sequencing is a viable alternative to Illumina sequencing for rapidly identifying SARS-CoV-2 variants found within African countries. We hypothesize that Nanopore sequencing will produce consensus genomes that are comparable to consensus genomes produced by Illumina sequencing at a faster rate. SARS-CoV-2 sequencing results, for multiple runs, from the Illumina MiSeq and the ONT GridION were compared and although Nanopore sequencing was able to produce complete SARS-CoV-2 genomes, the quality observed was not as good as those obtained with Illumina sequencing. The ONT GridION can sequence up to 5 flowcells with 96 samples in a single run and is cheaper than sequencing with the Illumina MiSeq. These advantages can allow for more clinical facilities to sequence SARS-CoV-2 allowing for a greater response to the COVID-19 pandemic.

Results

Comparison of sequencing performance

To compare sequencing performance and runtime between the MiSeq and the GridION, Run116 was sequenced on both platforms (Table 1). A total of 93 samples were sequenced and 93 consensus genomes were produced after assembly using Genome Detective. The sequencing runtime for the MiSeq was 36 h, whilst the GridION had a runtime of 21 h. The MiSeq had an overall higher average coverage than the GridION, having coverages of 94.34 and 72.96%, respectively. There was also a higher number of consensus genomes that passed the QC used for GISAID submissions (> 80% genome coverage) from the MiSeq, 83 (89.2%), than the GridION, 29 (27.9%). The average coverage across the genome for the GridION (Fig. 1-A) was less uniform than that of the MiSeq (Fig. 1-B).

Table 1 Comparison of sequencing Run116 on both the MiSeq and the GridION
Fig. 1
figure 1

Comparison of GridION and MiSeq gene mapping for RUN116: Sequencing files from both the Illumina MiSeq and the ONT GridION were assembled using Genome Detective and average coverage across the 15 known genes was calculated to determine the sequencing coverage across the genome

Comparison of consensus genome quality of Nanopore and Illumina sequencing

Consensus genomes produced by the GridION and the MiSeq were uploaded to Nextclade to determine the genome quality. Nextclade classifies genomes as either good, mediocre, or bad, based on the amount of missing data, and the number of mixed sites, private mutations, clustered mutations, frameshifts, and misplaced stop codons. Both the GridION and the MiSeq had a total of 14 runs with 1255 and 1183 consensus genomes, respectively. The total number of consensus genomes produced by the GridION and the MiSeq was significantly different (p = 0.0053). The number of genomes the two platforms classified as good (p = 0.00280), mediocre (p = 0.00250), and bad (p = 0.00037) also differed significantly (Fig. 2).

Fig. 2
figure 2

Comparison of consensus genome quality obtained from the GridION and the MiSeq and analyzed on Nextclade: To compare the quality of consensus genomes obtained from the GridION and the MiSeq, consensus genomes from both platforms were uploaded to Nextclade and the results plotted on a double bar graph. Genome quality was broken down into three groups; good, mediocre, and bad, with the GridION represented in blue and the MiSeq represented in orange. Statistical significance (Wilcoxon rank sum tests) is represented by “*” (**: p < 0.01, ***: p < 0.001). Sequencing scores ranging between 0 and 29 are classified as good, 30 – 99 are classified as mediocre, whilst 100 and above are classified as bad

Comparison of genome coverage generated by the GridION and MiSeq

Identical samples (RUN116) were sequenced on both the GridION and the MiSeq and the genomic coverage was compared to determine the effect of sample quality on sequencing (Fig. 3-A). All the runs for both platforms were then compared (Fig. 3-B). A total of 86 consensus genomes were used from RUN116 after removing genomes with more than 100 mutations. Samples run on the MiSeq had a significantly greater genome coverage than the GridION (p = 8.1e-16). GridION genomes ranged from 35 to 100%, whilst MiSeq genomes ranged from 80 to 100%. The consensus genome coverage for all runs, 2351 genomes, was then compared. There was a significantly higher overall genome coverage observed with the MiSeq than with the GridION (p < 2.2e-16).

Fig. 3
figure 3

Comparison of GridION and MiSeq genome coverage: Fastq files for RUN116 from both the MiSeq and the GridION were assembled using Genome Detective and the consensus genome coverage was compared (A). The same was done for all genomes for both platforms (B). GridION samples are presented in purple, whilst Illumina MiSeq samples are presented in red. Statistical significance (Wilcoxon rank sum tests) is represented by “*” (****: p < 0.0001)

Comparison of Orf1ab- and S-gene coverage for GridION and MiSeq sequencing

To compare the depth of coverage of the ORF1ab- and S-gene for the GridION and the MiSeq, fastq files produced from both platforms were assembled on Genome Detective to produce consensus genomes. The results for each consensus genome were obtained and the coverages for the ORF1ab-gene (Fig. 4-A) and S-gene (Fig. 4-B) were compared. All 14 runs for each platform were compared and Wilcoxon rank sum tests were performed. The ORF1ab-gene coverage ranged from 35 to 100% for the GridION and 80 – 100% for the MiSeq. The S-gene coverage ranged from 25 to 100% for the GridION and 80 – 100% for the MiSeq. There was a statistically significant difference in coverage for both genes on the GridION and the MiSeq with p = 1.2e-15 (RUN116) and p = 1.7e-15 (all genomes).

Fig. 4
figure 4

Comparison of ORF1ab- and S-gene coverage on the GridION and the MiSeq: Fastq files produced by both platforms were assembled on Genome Detective and the coverage for the ORF1ab- (A) and S-gene (B) was compared. Consensus genomes from the GridION are represented in orange and genomes from the MiSeq are represented in blue. Statistical significance (Wilcoxon rank sum tests) is represented by “*” (****: p < 0.0001)

Effect of Ct score on sequencing using the GridION and MiSeq

A correlation was performed to determine the effect of Ct score on genome coverage (Fig. 5) and the number of reads produced by the GridION and the MiSeq during sequencing (Fig. 6). Due to the availability of Ct scores, three runs were used for each platform. Run101 (35 samples), Run111 (91 samples), and Run123 (64 samples), represented by graphs A, B, and C, respectively, were used for the GridION. Run100 (68 samples), Run109 (54 samples), and Run122 (88 samples), represented by graphs D, E, and F, respectively, were used for the MiSeq. A negative correlation was observed between Ct Score and genome coverage for all six runs. The GridION’s Runs 101, 111, and 123 had correlation coefficients of R = − 0.88 (p = 4.5e-12), R = − 0.45 (p = 7.2e-06), and R = − 0.31(p = 0.012), respectively. The MiSeq’s Runs 100, 109, and 122 had correlation coefficients of R = − 0.35 (p = 0.0039), R = − 0.19 (p = 0.18), and R = − 0.33 (p = 0.0017), respectively. We note a significantly strong negative correlation between Ct score and number of reads for all GridION runs, whereas a significantly negative correlation was only noted for Run122 sequenced on the MiSeq. Run100 and Run109 showed non-significant correlations.

Fig. 5
figure 5

Correlation between genome coverage and Ct score for samples sequenced on the GridION and MiSeq: A correlation was performed to determine the effect of Ct score on the consensus genome coverage obtained from the GridION and the MiSeq. Genome coverage was plotted on the y-axis, whilst the sample’s average Ct score was plotted on the X-axis. GridION runs are represented by graphs A (Run101), B (Run111), and C (Run123), which are represented as green, blue, and red, respectively. MiSeq runs are represented by graphs D (Run100), E (Run109), and F (Run122) and are represented as black, purple, and gold, respectively. Statistical significance (Spearman’s rank correlation test) is represented by “*” (ns: non-significant, *: p < 0.05, **: p < 0.01, ***: p < 0.001, ****: p < 0.0001). For both platforms, as the Ct score increased, there was a decrease in genomic coverage

Fig. 6
figure 6

Correlation between the number of reads produced during sequencing and sample Ct Score: A correlation was performed for the number of reads produced by the GridION and the MiSeq and Ct score for SARS-CoV-2 samples. The number of reads was plotted on the Y-axis, whilst each sample’s average Ct score was plotted on the X-axis. GridION runs are represented by graphs A (Run101), B (Run111), and C (Run123) and are shown as green, blue, and red, respectively. MiSeq runs are represented by graphs D (Run100), E (Run109), and F (Run122) and are shown as black, purple, and gold, respectively. Statistical significance (Spearman’s rank correlation test) is represented by “*” (ns: non-significant, ****: p < 0.0001). An increase in Ct score resulted in a decrease in the number of reads produced for all GridION runs and 1 Illumina MiSeq run (Run122)

Mutation analysis

To determine whether the number of mutations detected by GridION and MiSeq differed significantly, the number of mutations detected for each sample was compared for Run116 (Fig. 7-A) and all the runs (Fig. 7-B). The total number of insertions, deletions, and substitutions detected by both platforms were also compared for Run116 (Fig. 7-C) and all the runs (Fig. 7-D). A total of 181 consensus genomes obtained from the GridION and the MiSeq for Run116 were analyzed and a significant difference was noted in the number of mutations detected by each platform (Wilcoxon, p = 3.7e-08) with a greater number of mutations detected by the MiSeq (8 – 96 mutations) than the GridION (6 – 56 mutations). We also noted a significant difference (Wilcoxon, p = 1.5e-09) between the number of mutations detected from the genomes obtained from the MiSeq (1183 genomes) and the GridION (1255 genomes). There was a significant difference in the number of insertions (Wilcoxon, p = 8.2e-04) and substitutions (Wilcoxon, p = 5.3e-06) detected by both platforms for RUN116. However, when all runs were analyzed; only the number of insertions were significantly different between the two platforms (Wilcoxon, p = 7.5e-15).

Fig. 7
figure 7

Analysis of mutations in samples sequenced on the GridION and the MiSeq: Consensus genomes produced by Genome Detective were uploaded to Nextclade and the results were analyzed. RUN116 was run on both platforms and the number and type of mutations detected by each platform was compared using a Wilcoxon rank sum test (Fig. A and C). A consensus file for all runs, for each platform, was produced and uploaded to Nextclade and a Wilcoxon rank sum test was performed to compare the number and type of mutations detected by both platforms (Fig. B and D). GridION samples are represented in yellow, whilst MiSeq samples are presented in green. Deletions, insertions, and substitutions are represented in pink, green, and blue, respectively. Statistical significance (Wilcoxon p tests) is represented by “*” (ns: non-significant, ***: p < 0.001, ****: p < 0.0001)

Phylogenetic analysis

To determine whether there was a difference in the phylogenetic inference between consensus genomes generated by the GridION and the MiSeq, Run116 samples were sequenced on both platforms. A total of 93 consensus genomes from both the GridION and the MiSeq were uploaded to Nextclade and the results were compared. Of the 93 samples, 27 samples were classified within different clades (Table 2). A phylogenetic tree of the 27 samples was then created using IQTREE and visualized using FigTree (Fig. 8). Of the 27 samples, only one sample, highlighted in blue, was grouped on the same branch.

Table 2 Comparison of the genome coverage and assigned clade for run116 samples on Nextclade
Fig. 8
figure 8

Phylogenetic comparison between identical samples sequenced using both the GridION and MiSeq: A phylogenetic tree was created using IQTREE and visualized using FigTree for samples from Run116 sequenced on both the GridION and the MiSeq but classified in different clades by Nextclade. Only one of the 27 samples, represented in blue, clustered on the same branch. GridION genomes are annotated as ‘barcode*’, whilst MiSeq genomes are annotated as ‘K0*’

The table above highlights the 27 samples which were sequenced on both the MiSeq and the GridION but were classified in different clades by Nexclade. Clades identified by the GridION include 20A (n = 1), 20C (n = 22), and 20H (Beta, V2) (n = 4). Clades identified by the MiSeq include 20A (n = 20), 20C (n = 3), 20D (n = 1), and 20H (Beta, V2) (n = 3). There was also an overall higher genomic coverage for sequences from the MiSeq when compared to the GridION.

Discussion

SARS-CoV-2 has caused a global health crisis as it is highly infectious and risks mutations that could result in more lethal variants [1, 20]. A major factor in helping curb the spread of the virus and decreasing the infection rate is rapidly sequencing the virus to detect new strains and identify transmission chains [7]. The sequencing runtime on the MiSeq for Run116 was 36 h, whilst on the GridION it was 21 h. This 10-h decrease in sequencing time allows for 480 samples to be sequenced each day on the GridION in comparison to the 96 that can be sequenced on the MiSeq every 36 h. This is in agreement with reports that nanopore sequencing takes approximately 20 h as a rapid library prep kit supplied by ONT can be used [21, 22]. The lack of an image analysis step during nanopore sequencing facilitates real-time base-calling, which allows for the rapid detection of DNA for pathogen screening from clinical samples [23].

Studies have shown that Illumina sequencing may still be the most accurate way to sequence viruses [24]. The majority of errors noted between Nanopore and Illumina consensus genomes have been attributed to Nanopore sequencing errors [25]. Run116 samples were sequenced on both platforms to determine whether there was a significant difference in the sequencing coverage regardless of the sample. Consensus genome coverage was significantly greater with the MiSeq when compared to the GridION and this result was also observed when comparing all sequence runs. Genomic coverage can be affected by sequencing time and thus GridION coverage may have increased if left to sequence for longer. We also note a statistically significant higher sequencing coverage for the S-gene and ORF1ab-gene with the MiSeq than with the GridION. Nanopore technology has been shown to provide lower per-read sequencing coverage when compared to short-read sequencing [26]. Coverage biases seen with ONT’s sequencing protocol can be a result of truncated reads caused by pore blocking or fragmentation during library prep as transcripts are sequenced from the 3′ to 5′ end [27]. ONT has made error correction tools such as Nanopolish available to try and reduce the error rate observed with Nanopore sequencing [28]. In this study, variant calling was achieved using Nanopolish but we still note a significantly lower genome quality obtained from the GridION than the MiSeq. These low-quality genomes cannot be used to confidently acquire information on the infecting viral strain and are generally removed through a series of quality control checks [29]. Although more consensus genomes can be produced using the GridION than the MiSeq, the low-quality genomes which are removed would eliminate the advantage of having a large number of consensus genomes produced. It should be noted that the quality and coverage of consensus genomes for the ONT GridION can be increased by pooling lower samples as the number of reads and data produced will be shared across a smaller group.

Although Bull et al. 2020 shows that Nanopore sequencing was able to produce consensus genomes that were high quality, the SARS-CoV-2 viral variants that were available for analysis may not have been as diverse as the variants analysed in this study. This may have been due to the number of samples that were used for the study and the diversity of the samples as was as 157 samples were used in the study all of which came from Wales and Metropolitan Sydney. Furthermore, Samples were collected between March and April 2020 which may suggest that the viral variants in circulation were not as diverse as analysing samples from different African regions within a 1 year time frame as seen in this study.

Higher genomic coverage for the Illumina MiSeq has been associated with lower Ct scores [30]. Ct score is a value that refers to the number of cycles required to amplify viral RNA to a detectable level. There is therefore an inverse relationship between Ct score and viral load [31]. In this investigation, we also noted an inverse relationship between Ct score and genome coverage for both GridION and MiSeq sequencing. There is, however, a significantly stronger negative correlation seen with the GridION than the MiSeq, which may imply that the MiSeq’s sequencing capabilities are less affected by sample Ct score and as a result, can be used for sequencing of samples within the early stages of infection when viral load is still low. This was, however, limited by not having the same runs to compare between the GridION and the MiSeq. Further analysis is required as the number of samples analyzed for each run was low and inconsistent due to the availability of Ct scores received with sample metadata. Additional analyses should be conducted to understand characteristics such as coverage bias, sequence biases, and reproducibility for the GridION sequencing platform [26]. Sample quality may also have an effect on sequencing and thus it is very important to maintain a cold chain during storage of swabs and RNA.

Identifying mutations involves aligning a consensus genome to a reference genome and identifying changes within the consensus genome. This is important, as it allows us to identify gene variants that may play a major role in the diagnosis of diseases [32]. It has been shown that long-read sequencing platforms have a high error rate, which is mostly indels that are assumed to be randomly distributed within each read [33, 34]. Prediction and interpretation of protein sequences may, therefore, be critically affected due to frameshifts and premature stop codons that may be introduced by the indels [35].

There was a significantly greater number of mutations detected by the MiSeq than the GridION for identical samples sequenced on both platforms. Although Nanopore platforms have been shown to make a large number of indel errors, in this study the MiSeq had a significantly higher number of insertions than the GridION. Paired-end sequencing, utilized by Illumina MiSeq, produces twice the number of reads, for the same sample and library preparation efforts, as single-end sequencing. This allows for a more accurate read alignment and detection of indel variants [36]. Short read lengths have been shown to hinder the assignment of reads to parts of the genome that are complex, phasing of variants, resolving regions that are repeated, and the introduction of gaps and ambiguous regions in de novo assemblies. Longer reads can be used for sequencing of extended repetitive regions, allowing for the identification of mutations that are generally associated with disease [37]. The higher number of indels noted with GridION sequencing highlights that genomic surveillance using Nanopore sequencing should be conducted cautiously as incorrect information on a viral strain can be obtained.

The rapid increase in COVID-19 cases has been linked to different SARS-CoV-2 viral lineages [38]. Viral lineages are separated based on the number and type of mutations they contain that differ from the parent strain [39]. From the 93 consensus genomes analyzed from both platforms, 27 genomes were classified within different clades. These genomes had unique mutations and the clade differences noted between the two platforms were 20A – 20C and 20C – 20H(Beta, V2). As the number of indels and substitutions produced by the MiSeq and the GridION were significantly different, we can expect there to be differences in clade classifications as viral clades are subject to viral-defining mutations [20]. Table 2 shows that genomes from the GridION have lower coverages than genomes from the MiSeq. This may be one of the factors causing a difference in the clade assignment as errors arising from the amplification and sequencing process may result in incomplete genome coverage, which affects phylogenetic inference [40]. Rambaut et al., 2020 suggests that new lineages should only be proposed if the genome coverage exceeds 70% of the coding region. Degradation of RNA can result in the introduction of mutations, which may cause a variant change [41]. The GridION library for RUN116 was prepared simultaneously with that of the MiSeq and the amount of RNA used is also lower. Therefore, we can eliminate the possibility of RNA degradation and RNA input amount as factors that may have caused a difference in the variants called by each instrument. Lineages identified by the GridION need to be further analyzed to determine whether the mutations are valid or are a result of sequencing errors. Accurate identification of lineages can assist in identifying transmission chains and allow for the development of diagnostic methods and treatments [38].

Conclusions

The results of this study show that the ONT GridION is less ideal for SARS-CoV-2 genomic surveillance than the Illumina MiSeq but can be used to produce consensus genomes from samples of high quality and low CT scores. Healthcare facilities can, however, use ONT sequencing platforms to rapidly diagnose patients as the GridION can sequence up to 480 samples every 21 h. This may allow for the identification and isolation of isolate infected individuals, thus aiding in stopping the spread of the disease.

Methods

Study population

The study population consisted of positive COVID-19 male and female patients whose nasopharyngeal swabs were sent from routine PCR diagnostic services for genomic surveillance to the Kwazulu-Natal Research Innovation and Sequencing Platform (KRISP). A total of 2608 COVID-19 positive nasopharyngeal swabs were used for sequencing from 28 different runs split evenly between the GridION and MiSeq. Samples were randomized and were from South Africa, Angola, Malawi, Mozambique, and Zimbabwe.

Real-time PCR assays

Sample Ct scores were present in the metadata files accompanying samples brought in for sequencing. There were three RT-PCR assays used for these samples. Namely; Seegene-AllplexTM 2019-nCoV Assay, Roche-Cobas® SARS-CoV-2 Qualitative assay, and Thermofisher-TaqPath™ COVID 19 CE IVD RT PCR Kit.

Total nucleic acid extraction

RNA was extracted using the NA/gDNA kit on the automated Chemagic 360 system (Perkin Elmer) as per the manufacturer’s instructions. Briefly, samples were lysed using lysis buffer and proteinase K, followed by binding to silica magnetic beads. The beads were then washed to remove unbound samples, and the RNA was eluted. Extracted RNA was stored at − 80 °C before use.

Tiling PCR

Complementary DNA synthesis was performed using SuperScript IV reverse transcriptase (Life Technologies) in combination with random hexamer primers. This was then followed by gene-specific multiplex PCR using the ARTIC protocol [42]. Primers were designed on a primal scheme (http://primal.zebraproject.org/) to cover the SARS-CoV-2 whole genome. Primers generated were 400 base pair (bp) amplicons, with an overlap of 70 bp to cover the 30 kilobases (kb) SARS-CoV-2 genome. Purification of PCR products was performed using AmpureXP purification beads in a 1:1 ratio (Beckman Coulter, High Wycombe, UK) and quantification was performed using the Qubit double-strand DNA (dsDNA) High Sensitivity Assay Kit on a Qubit 4.0 instrument (Life Technologies).

Illumina MiSeq library preparation and sequencing

Sequencing libraries were generated using the amplicons generated by tiling PCR as described above. Indexed paired-end libraries were prepared using the Nextera DNA Flex Library Prep Kits (Illumina) as per the manufacturer’s instructions. Briefly, amplicons were tagmented to allow for unfragmented DNA to be cleaved and tagged. Each sample was barcoded with a unique barcode using the Nextera CD Indexes (Illumina) to enable downstream pooling of all libraries. Libraries were purified and normalized to 4 nM prior to pooling. The pooled library was denatured using 0.2 N sodium acetate and then diluted to a final concentration of 8 pM. The library was spiked with 1% PhiX Control v3 (adapter-ligated library used as a control), and the libraries were sequenced using a 500-cycle v2 MiSeq Reagent Kit on the Illumina MiSeq instrument (Illumina, San Diego, CA, USA). The full details of the amplification and sequencing have been previously published [30]. Fastq files produced from Illumina MiSeq were assembled using Genome Detective (https://www.genomedetective.com/) and the coronavirus typing tool [43]. Genome detective is a web-based application that is user-friendly and is used for the assembly of known viral genomes from NGS datasets [43]. Fastq files are uploaded to the application and read quality is visualized using FastQC. Low-quality reads are then filtered and the adapters trimmed with Trimmomatic [44]. DIAMOND, a protein-based alignment method, is used to identify candidate viral reads [45]. The Swissprot UniRef90 protein database viral subset is used to improve speed and sensitivity [43]. Short reads are sorted and placed into groups and metagenomic de novo assembly is performed on each group using SPAdes for single-ended reads or metaSPAdes for paired-end reads [46]. Each group is then identified using the taxonomy ID of the lowest common ancestor of the hits identified by DIAMOND [45]. Blastx and Blastn are used to search for candidate reference sequences against the NCBI RefSeq virus database. The results for all detected contigs are combined by the Advanced Genome Aligner and a score is calculated by Genome Detective at the amino acid and nucleotide level. The five best scoring references for each config are then used for the alignment [43].

ONT GridION library preparation and sequencing

Amplicons generated using the tiling PCR were prepared for nanopore sequencing using the ONT Native Barcoding Expansion Kits as per the manufacturer’s guidelines. Libraries were multiplexed on FLO-MIN106 flowcells and run on the GridION X5. Furthermore, a no-template control from the PCR amplification step was added to each plate before running. Sequencing performance was monitored, in real-time, using the MinKNOW software app. Sequencing was terminated after 21 h and the resulting reads were base-called using Guppy (4.0.14) and aligned to the Wuhan-Hu-1 reference genome (MN908947.3) using minimap2 (2.17-r941). Primer sequences were trimmed from the termini of read alignments and sequencing depth was capped at a maximum of 400-fold coverage using the ARTIC tool align_trim. Variant candidates were identified using Nanopolish [47].

Sequence analysis

Consensus genomes produced by both platforms were uploaded to Nextclade Online Tool v1.4.2 (2021-10-26) (https://clades.nextstrain.org/) for genome clade assignments, mutation calling, quality checks, and to determine the genome position on the SARS-CoV-2 phylogenetic tree. Nextclade is built on Nextalign and consists of three tools; Nextclade Web, Nextclade CLI, and Nextalign CLI, which all share the common C++ library of algorithms. Nextclade starts by performing a pairwise alignment of the query sequence to a reference sequence using Nextalign that uses a banded local alignment algorithm with affine gap-cost that are determined through seed matching. Alignment is only performed on sequences longer than 100 nucleotides by default, but this can be changed, as alignment of shorter sequences may be unreliable. Mutation calling is achieved by comparing the aligned nucleotide sequences, one at a time, with the reference nucleotide sequence. Depending on their nature, they are reported differently. The number of missing, and ambiguous bases are also reported. Nextclade places each query sequence on the reference phylogenetic tree by comparing the mutations on the query sequence with the mutations of every node and tip in the reference tree, and finding the node which has the most similar set of mutations. Clade assignment is achieved by placing sequences on a phylogenetic tree annotated with clade definitions [48]. A Maximum-likelihood (ML) tree was constructed using IQ-TREE and was visualized using FigTree v1.4.4 (https://github.com/rambaut/figtree/releases) [49]. Data visualization and statistical analysis were performed using ggplot2 v3.3.1 package and R v.4.1.1.

Statistical considerations

The non-parametric nature of the data influenced the use of a Wilcoxon test to compare the number of consensus genomes produced by the GridION and the MiSeq classified within each category of the online Nextclade sequence analysis tool. The Wilcoxon test was also used to compare the difference in genomic coverage, number, and type of mutations detected between the GridION and the MiSeq. Statistical correlations were performed between Ct score and genome coverage and Ct score and the number of reads for both platforms.

Ethics

The University of KwaZulu-Natal Biomedical Research Ethics Committee waived the requirement for informed consent and approved the study (protocol reference no. BREC/00001195/2020; project title: COVID-19 transmission and natural history in KwaZulu-Natal, South Africa: epidemiological investigation to guide prevention and clinical care). All methods were performed in accordance with the relevant guidelines and regulations. We also used de-identified remnant nasopharyngeal and oropharyngeal swab samples from patients testing positive for SARS-CoV-2 by RT–qPCR from public health laboratories in South Africa. Informed consent for study participation was not applicable for this study because de-identified (anonymous) remnant samples, which would have been otherwise discarded, were used.