Background

During the last 20 years, there have been major infection outbreaks that have caused significant morbidity, including ‘spill-over events’ where a coronavirus has entered the human population from an animal host. The SARS-CoV-2 virus has caused the COVID-19 pandemic, with > 185 million individuals infected and > 4 million deaths across > 215 countries [1]. The small size of the viral genome (~ 30 kbp) allows for rapid and high throughput sequencing within health systems. The resulting data is being shared in the public domain (e.g. NCBI and GISAID), with large and growing numbers of genome sequences available (> 2 million) [2]. Phylogenetic analysis of sequence variation can provide insights into the spread and evolution of the virus, and a temporal analysis can reveal mutations that might affect diagnostic efficacy and lead to vaccine escape. Potential transmission networks to inform infection control can be revealed through matching genomes, and clade-defining mutations can be found. The detection of amino-acid changes leading to increased virus fitness, maintained by sweeps increasing mutation frequencies (“positive selection”), can reveal sites that are functionally important. Important genotype–phenotype associations (e.g., for virulence) may be identified through finding signatures of convergent evolution, where the causal mutation occurs independently in unrelated branches of the phylogenetic tree. Despite there being available tools for the online visualisation of SARS-CoV-2 virus genetic diversity [2, 3] and selection analysis [4], timely phylogenetic analysis is computationally difficult with large numbers of sequenced isolates. Further, infection control sequencing hubs urgently need bioinformatic pipelines to accurately process sequence data into mutations, providing insights into local circulating viruses and to inform on importation and diagnostic robustness. To enable the processing of sequencing data, we have designed command-line and webserver tools (“COVID-Profiler”) to automate the analysis, visualisation and profiling of SARS-CoV-2, thereby assisting COVID-19 control decision making.

Implementation

The COVID-Profiler command-line tool has a number of functions, including variant characterisation, isolate profiling, primer design and evaluation, and phylogenetic analysis. For profiling, fastq files (compressed with gzip or uncompressed) are trimmed using trimmomatic [5] and aligned against the reference genome (NC_045512) using bwa-mem software [6]. Variant calling is performed using bcftools software [7]. User-defined depth thresholds can filter out variants. Remaining variants are annotated with their functional effect (e.g., gene, amino acid change) using a combination of bcftools csq and custom scripts, which account for the ribosomal slippage. A final genome is generated with bcftools consensus, and positions with low coverage [8] are masked. For fasta input, sequences are aligned to the reference using minimap2 [9] and variants are called with paftools.js software. Variant outputs are in files of vcf format. Users can aggregate data across several runs to create a table of mutations in text format. Primer conservation scores (e.g., for diagnostics) are calculated using blast, and an in-house script estimates the number of mismatches. Choropleth maps are rendered using plotly software. Sequence logo plots are generated using seqlogo software (github.com/betteridiot/seqlogo). Sequences are aligned using mafft software [10] and alignments are masked on non-coding ends. Phylogenetic trees are constructed using iq-tree software [11] with the best model found by iq-tree ModelFinder. The webserver allows for the online processing and profiling of uploaded isolate sequences. It is developed using the flask micro web framework. Users upload data in fastq or fasta format. All output files are available to download, and include identified mutations presented in a table format. Alignments (“bam” files) are visualised using pileup.js [12]. Mutations in those proteins that have been structurally characterised are visualised using a bio-pv.js (biasmv.github.io/pv/) library. Flowcharts depicting the bioinformatic pipeline and its inputs and outputs can be found in Additional file 1: Figure S1.

Results

To highlight the utility of the tool, we sought to characterize the diversity in a large dataset of 561,002 sequences from the GISAID database. Sequences were selected using a random sample of isolates between 24th December 2019 and 6th July 2021 allowing a maximum of 20,000 isolates per country. Using COVID-Profiler, the sequences were aligned, and 59,940 SNPs were identified (42,734 non-synonymous; 17,206 synonymous changes). We then aimed to characterize whether this diversity affected the performance of PCR diagnostics used by public health authorities, and whether it has been influenced by selective pressure.

Diagnostic performance

Rapid and accurate detection of infections through diagnostics is central to infection control. We sought to characterise the conservation of primer binding sites (PBSs) of 12 PCR diagnostics applied worldwide (Additional file 1: Table S1). Analysis by COVID-Profiler revealed that eleven diagnostics had high conservation of PBSs, but one had a high proportion (75.1%) of isolates with >  = 1 mismatch present in the forward PBS. The sequence logo plot reveals the mismatches were located at the 5’ end, and thus might not have a severe effect on binding (Fig. 1a). The number of isolates with mutations in the PBSs was observed to increase in all continents during the timeline of the pandemic (Fig. 1b). The rapid increase in allele frequency could result from selective pressure, sampling bias or genetic drift effects.

Fig. 1
figure 1

Results from the primer conservation tool. A A sequence logo plot indicating the location of mutations in the primer binding site (PBS). B Samples with mutations in PBSs across time and continent. C Convergent evolution analysis, where each genomic position is represented by a dot with the average number of origins indicated on the y-axis. The number of isolates containing the mutation is proportional to the size of the dot

Selective pressure

COVID-Profiler was used to determine whether the mutations were under selective pressure. Specifically, the dataset was used to find convergent evolution, which is defined by multiple independent origin of a mutation in phylogenetic space and is a signature of selective pressure. Across 1,000 independent subsets of the data (N = 500), we counted the number of independent origins of mutations within each subset. Most mutations (n = 49,443, 82.5%), had an average occurrence of at most one (Fig. 1c). However, several mutations had substantially higher counts with the top 10 mutations having occurrences greater than 6. These include mutations in the genes S (E484K/Q, L5F, N501T and P681H/R), N (M234I), nsp6 (L37F, F108L), orf8 (L68*, F120L) and the 3C-like proteinase (L90R). Of the mutations found in the spike protein (S), E484K and N501T have been classified as variants of concern with a high confidence of conferring an antigenic change (http://sars2.cvr.gla.ac.uk/cog-uk/). Interestingly, the Nucleocapsid protein (N) M234I mutation is located in a position which has been found to have immunological significance both in epitope prediction studies as well as in-vitro epitope mapping experiments[13, 14] as well being linked to isolates with false negative antigenic tests [15]. These observations demonstrate the utility in the detection of convergent evolution and more broadly the analysis of genomic diversity and tools which support this analysis.

Conclusions

The online COVID-Profiler tool will assist with the important analysis and profiling of SARS-CoV-2 viral sequences within clinical and infection control settings. With the vaccine rollout, it is important to use tools like COVID-Profiler to monitor mutations and the selective pressure driving their patterns of evolution. The tool can be extended to include additional functionality (e.g., assign strain types), as well as be modified for the analysis of other (re-) emerging pathogens.