Introduction

The coronavirus disease 2019 (COVID-19) pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) will soon pass its three-year point. Despite mass vaccination with boosters in various countries, vaccines seem ineffective in curbing community transmission, as vaccine breakthrough is a common phenomenon observed in many parts of the world [1,2,3]. It is believed to occur due to the rapid evolution of SARS-CoV-2, which has led to the generation of many variants. The most recent and dominant variant is Omicron [1, 4, 5]. This variant has swiftly diverged into at least eight clades based on the data available in GISAID (https://nextstrain.org). According to the WHO Technical Advisory Group on SARS-CoV-2 Virus Evolution, some lineages, such as XBB and BQ.1, are a cause of concern because of their potential to bring on a new wave of cases and fatalities (www.who.int).

To predict the probable impact of these genetic variants, comparisons of whole-genome sequences of representatives of each clade or lineage are of global interest. However, as interclade or lineage variants are to be expected, preserved insertions/deletions (indels) and substitutions in comparison to the original SARS-CoV-2 strain are important genetic features of new clades/lineages. The genome organization of SARS-CoV-2, based on the open reading frame (ORF) annotation of the original Wuhan-Hu-1 isolate and adding the 5’ and 3’untranslated regions (UTRs) and intergenic sequences (IGSs) is as follows: 5’UTR-ORF1AB-IGS-Spike-IGS-ORF3A-ORF3B-IGS-Protein E-IGS-membrane (MA)-IGS-ORF6-ORF7A-ORF7B-ORF8-IGS-Nucleoprotein NP-ORF-10-3’UTR [6, 7]. Intergenic sequences (IGSs) have been identified previously [8]. Transcription regulatory sequences (TRSs) have been identified at the junctions between these ORFs as well as at the 5′ end of the genomic RNA downstream of the leader sequence [9]. Therefore, the 3' end of any SARS-CoV-2 gene can be critical for the translation of the next coding region. Accessory proteins should also be examined because they contribute to the pathogenesis of SARS-CoV-2 [10, 11]. The 5’- and 3’-UTRs should be examined, as they have been demonstrated to play important roles in viral fitness and pathogenesis [12].

Here, we compared the genome sequences of members of various clades of the Omicron variant as well as the XBB and BQ.1 lineages. The objective of this study was to identify unique consensus insertion/deletions and amino acid variations of all coding and noncoding regions of the SARS-CoV-2 Omicron clades, including the XBB and BQ.1 lineages.

Methods

Fifteen to 25 sequences of lineages 21K, 21L, 22A, 22B, 22C, 22D, 22E, and 22F, as well as XBB and BQ.1, were randomly selected from the GISAID Nextstrain phylogeny on October 31, 2022 and downloaded. The total number of sequences in the dataset was 203. The whole genome sequence was aligned with the original SARS-CoV-2 sequence of the Wuhan-Hu-1 isolate (GenBank accession no. NC_045512) using Clustal Omega, available online at EMBL-EBI (www.ebi.ac.uk). In whole-genome sequence alignments, sequences that caused long gaps due to the presence of a long track of Ns were excluded. Individual coding regions were selected based on the Wuhan-Hu-1 coding DNA sequence (CDSs) using MEGA11 software [13]. In comparisons of coding regions or open reading frames (ORFs), sequences with a track of more than two unidentified nucleotides (NNNs) were excluded. Therefore, the final number of sequences of each clade and or lineage varies, as shown in Tables 1, 2, and 3. The sequences were translated prior to alignment. Polymorphic amino acids as well as gaps were tabulated manually. To assess the genetic relatedness of clades and lineages, three representatives of clades and lineages from different countries were randomly selected from the dataset as above and aligned using Clustal Omega. The 5’ and 3’ends were trimmed to produce sequences of equal length, with a total of 28,934 positions in the final dataset. The evolutionary history was inferred using the maximum-likelihood method and the Kimura 2-parameter model in MEGA 11 software [13]. The phylogeny was tested by the bootstrap method with 100 replications.

Table 1 Consensus amino acid substitutions/deletions in the ORF1AB protein of Omicron variant clades/lineages compared to Wuhan-Hu-1*
Table 2 Consensus amino acid substitutions/deletions in the spike protein of Omicron variant clades/lineages compared to Wuhan-Hu-1*
Table 3 Consensus amino acid substitutions/deletions of ORF3A, envelope protein, matrix, ORF6, ORF7B, ORF8, and nucleoprotein of Omicron variant clades/lineages compared to Wuhan-Hu-11

Results

The sequence dataset is available at GISAID with the identifier EPI_SET ID: EPI_SET_230327ca and https://doi.org/10.55876/gis8.230327ca. The number of sequences in the dataset of various clades and the XBB and BQ1 lineages and the number of sequences in each clade and lineage bearing consensus deletions and insertions are presented in Supplementary Material 1. The consensus indel pattern in all Omicron clades or lineages includes del-11260-11268 in ORF1AB and del-28346-28354 in NC. Other indels are unique to a clade or a lineage or shared between clades and/or lineages. The 21K clade has two unique deletions and an insertion (spike del-21960-21965, del-22167-22169, and ins-22178-22186). The deletion del-658-666 is unique to the 22A clade. Del-21606-21614 in ORF1AB and del-29732-29757 in the 3’-UTR are present in all clades except 21K. Del-21738-21743 in the spike coding region is not present in clades 21L, 22C, 22D, and 22F. The other deletion in the spike coding region, namely, 21966-21968, is present in 21K, 22F, and XBB.

Amino acid residues that are characteristic of all Omicron clades and the XBB and BQ.1 lineages in ORF1AB, the spike gene, and the combined ORF3A, envelope protein, matrix ORF6, ORF7B, ORF8, and NC are shown in Tables 1, 2, and 3, respectively. The substitutions that all of these clades and lineages have in common are I2235L, T3255I, P3395H, S3675Del, G3676Del, F3677Del, K3833N, P4715L, and I4175V in ORF1AB; G142D, S376P, S378F, K420N, N443K, S480N, T481K, E487A, Q501R, N504Y, Y508H, D617G, H658Y, N682K, P684H, N767K, D799Y, Q957H, and N972K in the spike protein; T9I in the envelope protein; Q19E and A63T in the matrix protein; and P13L, E31del, R32del, S33del and G204R in the NC. The amino acid substitutions/deletions present in all clades and lineages except the 21K clade are S135R, T842I, G1367S, L3207F, T3090I, R5716C, and T6564I in ORF1AB; T19I, L24del, P25del, P26del, A27S, V21G, T379A, D408N, and R411S in the spike protein; T22I in ORF3A; and S417R in the NC.

The unique substitutions in the 22K clade are K856R, L2084I, A2710T, and I3758V in ORF1AB and A67V, T95I, 143Vdel, 144Ydel, 145Ydel, N211del, L212I, ins215E, ins216P, ins217E, G499S, T550K, N859K, and L984F in the spike protein. The remaining clades resemble Wuhan-Hu-1 at those sites.

Substitutions found in both 22F and XBB include K47R in ORF1AB and V83A, Y144del, H146Q, Q183E, V213E, G255V, L371I, V448P, F489S, and F493S in the spike protein. Unique to the 22F clade and XBB lineage is the presence of a stop codon at position 8 of ORF8. Q556K, L3829F, Y4665H, M5557I, and N5592S in ORF1AB, as well as K447T in the spike protein are shared by 22E and BQ.1.

A phylogenetic tree is presented in Figure 1. The tree shows that the Omicron clades and lineages form three separated clusters with 100% bootstrap support. Clade 22K forms a unique cluster (cluster 1), while cluster 2 consists of clades 22B and 22E as well as the BQ lineage, and the other clades and lineages form cluster 3.

Fig. 1
figure 1

Phylogenetic tree, based on full genome sequences of three randomly selected representatives of the Omicron clades and lineages of SARS-CoV-2 rooted to the Wuhan-Hu-1 strain. The clade, country of origin, and EPI-ID are indicated in the isolate names. The evolutionary history was inferred using the maximum-likelihood method and the Kimura 2-parameter model in MEGA 11 software [13]. The phylogeny was tested by the bootstrap method with 100 replications. The tree with the highest log likelihood is shown. The percentage of trees in which the associated taxa clustered together is shown next to the branches. There were a total of 28934 positions in the final dataset.

Discussion

Consensus indels and amino acid variations in all coding and noncoding regions of the SARS-CoV-2 Omicron variant are of global interest, as this variant is evolving rapidly and has become the global dominant circulating variant, suppressing others. Scientific explanation is paramount to understanding the potential threat of subvariants or lineages. Most reports on SARS-CoV-2 variants have emphasized changes in the spike protein [3, 4]. However, examination of the whole genome is important for assessing the impact of mutations in subvariants [14].

Some indels and polymorphic amino acids are specific to the Omicron variant. All Omicron clades and lineages contain del-11260-11268 in ORF1AB and del-28346-28354 in NC. However, del-11260-11268 is not unique, as it is also present in the Alpha, Beta, and Gamma variants [15].

The deletion in the NC of SARS-CoV-2 is a probable indirect signature of its attenuation. The SARS-CoV-2 NC is an abundantly expressed RNA-binding protein that is critical for viral genome packaging [16]. The presence of del-28346-28354 in the NC can potentially alter the biology of the virus. In the coronavirus mouse hepatitis virus, a deletion in the NC resulted in a small-plaque phenotype in tissue culture [17]. Plaque size is also an indicator of dengue virus attenuation [18], in which the molecular determinants of small plaque size are mutations/substitutions in NS1, NS3, and the 3'-UTR [19]. It is therefore plausible that the deletion of the NC in SARS-CoV-2 is an indirect indicator of its attenuation.

The deletion in the 3’-UTR is another notable characteristic of the Omicron variant. This deletion is dominant in all clades/lineages, with the exception of clade 21K. This region might be important for recognition by the SARS-CoV-2 RNA-dependent RNA polymerase and cellular components for the initiation of anti-genomic (negative strand) RNA synthesis [12]. The deletion in the 3’-UTR is another probable indirect indicator of SARS-CoV-2 subvariant attenuation.

Although Omicron SARS-CoV-2 spread faster than other variants and became the dominant variant globally, it was reported to cause milder clinical signs [5]. The intensive care unit admission rates for Omicron-infected patients were much lower than those of Delta- and Delta-/Omicron-infected patients [20], suggesting that this variant has reduced virulence.

Since many dominant amino acid residues in the spike protein are uniformly divergent from Wuhan-Hu-1, individuals who have recovered from an Omicron infection might be expected to have protective immunity to all Omicron clades/lineages. Applying the template of spike protein residues and their possible functions as published previously [4], the consensus amino acid changes in Omicron clades/lineages relative to Wuhan-Hu-1 are located in the receptor-binding domain/receptor binding site (RBD/RBS) (S376P, S378F, K420N, N443K, S480N, T481K, E487A, Q501R, N504Y, Y508H), linear epitopes (S378F, K420N, E487A, Q501R, D617G, N767K), possible conformation-dependent epitopes (N682K, P684H), the S1/S2 cleavage site ((N682K, P684H), the fusion peptide (D799Y), and heptad repeat 1 (Q957H and N972K). With this pattern, it is expected that the Omicron variant has different biological characteristics than the original SARS-CoV-2 strain.

It was observed in this study that reversion of indels and mutations might have occurred in SARS-CoV-2. In this case, the term "reversions" or "reverse mutations", refers to any mutational processes or mutations that restores the wild-type phenotype to an organism already carrying a phenotype-altering forward mutation [21]. This phenomenon has been described for many viruses [22]. Our data show that the 21K clade has unique indels and substitutions, while the remaining sequence is homologous to that of Wuhan-Hu-1. The revertant virus evolved further with deletions and substitutions in various genome segments. The indels and substitutions in 21K seem to be unstable or generate lower virus fitness.

In this study, we confirm the clade separation reported by Nextstrain. XBB is close to the 22F clade, while BQ1 is close to the 22E clade. Clade 22F shares many amino acid substitutions with the XBB lineage, while clade 22E shares many with the BQ.1 lineage. This manuscript was drafted to provide valid data on the position of both lineages in SARS-CoV-2 phylogeny. This should suppress public speculation that XBB and BQ.1 are de novo subvariants and should demonstrate that the genetic make-up of these subvariants is similar to that of other members of the clade. The phylogenetic analysis (Fig. 1) also confirmed that the BQ.1 lineage is close to the 22E clades, while the XBB lineage is close to 22F clade.

Another note from this analysis is that the accessory proteins might not be critical for SARS-CoV-2 integrity. Without those proteins, SARS-CoV-2 remains viable. ORF8 is truncated in the 22F clade and XBB lineage. A stop codon at position 8 of ORF8 is dominant in that clade/lineage. Stop codons were also present in ORF6 and ORF7A of some strains (not shown). The accessory proteins have been described to contribute to the pathogenesis of SARS-CoV-2 [10, 11]. Deletions in ORF7 and 8 have been associated with milder symptoms [23]. ORF8 interferes with host immune responses in various ways, including downregulating MHC class I molecules [24], antagonizing interferon [25], activating interleukin 17, and cytokine storms [26]. It is therefore reasonable to suggest that the truncated ORF8 is additional indirect evidence of lower virulence or attenuation, especially in clade 22E.

This study does not provide information for prediction of the outcomes of SARS-CoV-2 infection. A scientific task force should be formed in each country to study the association of Omicron with the patient’s clinical status so that the public can be aware of whether any emerging variant or subvariant warrants a change in COVID-19 prevention protocols.

In conclusion, the indels and polymorphic amino acids across the whole genome of SARS-CoV-2 Omicron clades are either clade-specific or shared among clades. Del-28346-28354 in NC is unique to Omicron. Variation in the 3’-UTR are common to all clades/lineages, except clade 21K. Clade 21K has four unique indels and substitutions in ORF1AB and 14 in the spike protein, while the remaining sequence is homologous to that of Wuhan-Hu-1, which probably represents reverted indels/substitutions. ORF8 is truncated at amino acid 8 in the 22F clade and in the XBB lineage. Three indirect lines of evidence of SARS-CoV-2 attenuation in Omicron clades were identified, namely, the deletion in NC, the deletion in the 3’-UTR, and the truncation of ORF8 in the 22F clade and XBB lineage.