Comparative genotyping of SARS-CoV-2 among Egyptian patients: near-full length genomic sequences versus selected spike and nucleocapsid regions

Several tools have been developed for severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) genotyping based on either whole genome or spike sequencing. We aimed to highlight the molecular epidemiological landscape of SARS-CoV-2 in Egypt since the start of the pandemic, to describe discrepancies between the 3 typing tools: Global Initiative on Sharing Avian Influenza Data (GISAID), Nextclade, and Phylogenetic Assignment of Named Global Outbreak Lineages (PANGOLIN) and to assess the fitness of spike and nucleocapsid regions for lineage assignment compared to the whole genome. A total of 3935 sequences isolated from Egypt (March 2020–2023) were retrieved from the GISAID database. A subset of data (n = 1212) with high coverage whole genome was used for tool discrimination and agreement analyses. Among 1212 sequences, the highest discriminatory power was 0.895 for PANGOLIN, followed by GISAID (0.872) and Nextclade (0.866). There was a statistically significant difference (p = 0.0418) between lineages assigned via spike (30%) and nucleocapsid (46%) compared to their whole genome-assigned lineages. The first 3 pandemic waves were dominated by B.1, followed by C.36 and then C.36.3, while the fourth to sixth waves were dominated by the B.1.617.2, BA, and BA.5.2 lineages, respectively. Current shift in lineage typing to recombinant forms. The 3 typing tools showed comparable discrimination among SARS-CoV-2 lineages. The nucleocapsid region could be used for lineage assignment.


Introduction
In December 2019, life-threatening viral pneumonia was reported in Wuhan, Hubei, China, leading to a high fatality rate in China that then became a worldwide pandemic [1].This viral pneumonia was called coronavirus disease 2019 (COVID- 19), which is caused by a novel coronavirus named severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [2].In Egypt, the first (index) case confirmed with SARS-CoV-2 viral infection was reported on 14 February 2020 in Cairo [3].Egypt was the first African country to announce the presence of COVID-19.A month later (in mid-March), the first two whole genome sequences of SARS-CoV-2 isolated from Egyptian citizens were published [4].
On 11 March 2020, the World Health Organization (WHO) announced that the COVID-19 outbreak was a global pandemic.Since the start of the pandemic, Egypt has been ranked one of the top 5 countries reporting COVID-19 cases in Africa [5].
There are four phylogenetic groups of coronaviruses known as alpha, beta, gamma, and delta.SARS-CoV-2 is a beta-coronavirus that is further divided into 4 main lineages (A-D) [6].As SARS-CoV-2 is highly prone to multiple recombination events, mutations are introduced in its genome that could cause changes in antiviral susceptibility and viral transmission, resulting in the emergence of new recombinant variants [7].
According to the SARS-CoV-2 Interagency Group (Updated March.20,2023), there are 4 types of variants: variant of interest (VOI), variant of concern (VOC), variant of high consequence (VOHC), and variants being monitored (VBM).This classification is dynamic and always updated according to the circulating lineages.Currently, there are no SARS-CoV-2 variants described as VOI or VOHC [8].Another classification, the PANGO lineage system, depends on hierarchical evolution from a parent lineage.It consists of an alphabetical prefix and numerical suffix describing this hierarchy.
To avoid long names for lineages, another alphabetical synonymous is given such as "BA" stands for "B.1.1.529"(omicron variant) or "AY" stands for "B.In Egypt, from 3 January 2020 to 24 May 2023, there have been 516,023 confirmed cases of COVID-19 with 24,830 deaths reported to the WHO.Despite the recent announcement of the end of the pandemic on 5 May 2023, a relatively large number of SARS-CoV-2 sequences are still being submitted to GISAID.
In this work, we aimed to highlight the molecular epidemiological landscape of SARS-CoV-2 over the time since the first appearance of the pandemic in Egypt in February 2020.Furthermore, we aimed to describe discrepancies between the 3 well-known lineage assigner software tools (GISAID, Nextclade, and PANGOLIN) for SARS-CoV-2 typing and to determine their agreement according to the assigned lineage.Finally, we assessed the fitness of selected regions, such as spike (S) and nucleocapsid (N), compared to whole genome sequences for lineage typing.

Data collection
A total of 3935 sequences of SARS-CoV-2 isolated from Egypt with their associated metadata, including clade, lineage, and variant, were retrieved from the GISAID database [11] in April 2023 after applying the following filters: location: Africa/Egypt, and host: human.These sequences were either complete genome " > 29,000 bp" or partial genomic regions such as spike or nucleocapsid, which were all submitted across different governates of Egypt since the start of the pandemic in early 2020.A subset of data consisting of high-coverage complete genomic sequences (n = 1212) was utilized for lineage discrimination, and low-quality and/ or partial sequences were excluded from this analysis.We used a GISAID high coverage filter for the exclusion of lowquality sequences (Fig. 1).After that, we manually excluded a group of partial sequences (n = 81).A high coverage sequence is defined as a sequence with less than 5% ambiguous bases (NNNs) and no deletions or insertions without the verification of the submitter according to GISAID.
The 1212 sequences with a whole genome determined by GISAID were aligned against the sequence of the reference genome NC_045512.2 in BioEdit software [16] using the ClustalW multiple alignment tool.We extracted the sequences of the spike (S) region (3822 bp, 21,563-25,384) and nucleocapsid (N) region (1260 bp, 28,274-29,533) from the whole genome sequences for further analysis.

Genotyping and data analysis
The whole genome of the whole dataset (n = 3935) was subjected to genotyping by Nextclade and PANGOLIN online tools, while the genotyping data by GISAID were already available within downloaded metadata.After that, Two approaches for analysis were utilized.First, we compared agreement on lineage assignment of whole genome sequences between the 3 online freely available tools GISAID, PANGOLIN and Nextclade to detect discrepancies.After that, we investigated the agreement of lineage assignment between the S and N regions compared to the whole genome.

Data snapshot
The findings of this study are based on metadata associated with two datasets.The first dataset "EPI_SET_230521kt" (accessible at https:// doi.org/ 10. 55876/ gis8.23052 1kt) is composed of 3935 individual genome sequences.The collection dates range from 13 March 2020 to 12 March 2023, while the second dataset "EPI_SET_230415te" (accessible at https:// doi.org/ 10. 55876/ gis8.23041 5te) is composed of 1212 individual high-coverage whole genome sequences.The collection dates ranged from 9 March 2020 to 13 March 2023.Data were collected from Africa/Egypt.All sequences in these two datasets were compared relative to hCoV-19/Wuhan/WIV04/2019 (WIV04), the official reference sequence employed by GISAID (EPI_ISL_402124).

Statistical methods
We used the discriminatory power (D) (https:// insil ico.ehu.es/ mini_ tools/ discr imina tory_ power/ index.php) as a numerical index for the discrimination power between lineage assigner software tools.where N is the total number of isolates in the typing method, s is the number of distinct patterns discriminated by the tool, and nj is the number of isolates belonging to the j th pattern [17,18].
The agreement analysis was performed using the receiver operating characteristic curve (ROC) and area under the curve (AUC) method.We compared the categorical variables by the chi-square test.The alpha level was set at ≤ 0.05.We performed all statistical tests in Rstudio (R 4.2.3)[19].The pROC package [20] was used to generate the ROC/AUC curves.

Description of circulating lineages
The majority of Egyptian sequences included in the current research were submitted to the GISAID database in 2022 (1707/3935, 43.4%) followed by 2021 (1170/3935, 29.7%).The highest number of sequences submitted was in the fourth quarter in 2021 (N = 567) and the first quarter in 2022 (N = 678).
The ability of the 3 tools to assign a lineage was different.Among the whole dataset (n = 3935), a total of 427 sequences were unassigned by GISAID either due to being partial sequences or low-quality whole genome sequences.There were 516 sequences unassigned by PANGOLIN due to the inability of the tool to process the sequence.On the other hand, Nextclade was able to assign almost the whole dataset except 10 sequences.
Among all sequences (n = 3935), during the first wave, the most frequent lineage was B.1 (312 sequences), followed by C.36 (123 sequences).In the second wave, the situation was reversed, and C.36 (221 sequences) became the most common lineage, followed by B.1 (79 sequences).Despite the small number of submitted sequences during the third wave, C.36.3 (74 sequences) was the most frequent lineage, followed by C.36 (42 sequences).
In the fourth wave, the count of detected lineages was higher than in previous waves, indicating higher diversity among circulating lineages.A majority of sequences (n = 277) were unassigned, while the most frequent lineage was B.1.617.2 (140 sequences), followed by AY.122 (117 sequences).On the other hand, during the fifth wave, BA.1 (71 sequences) and BA.2 (79 sequences) became dominant compared to the previous circulating lineages.In the sixth wave, BA.5.2 (329 sequences) was the utmost circulating lineage, followed by BA.2 (145 sequences).
Lastly, in wave 7, it is noteworthy that starting from late 2022 (Q4) and early 2023 (Q1), the recombinant lineages (XBB) became predominant compared to other nonrecombinant lineages.Figure 2 illustrates the distribution of lineages according to GISAID, Nextclade, and PANGOLIN since the start of the pandemic and across the wave pattern in Egypt.

Discrimination between typing tools
Among the 1212 high-coverage whole genome sequences, the discriminatory power of the GISAID tool was 0.872 for differentiation between 72 lineages, while the discriminatory power of the Nextclade tool was 0.866 for differentiation between 62 lineages.Finally, the discriminatory power of PANGOLIN was 0.895 for differentiation between 57 lineages.

Overall agreement between tools
The 3 tools GISAID, PANGOLIN and Nextclade showed overall agreement with exact matches in 744/1212 (61.4%) of the whole genome sequences, while the remaining 468 (38.6%) sequences showed variable discrepancies among the 3 tools.Finally, the overall disagreement between the 3 lineage assigner tools was estimated to be 76/1212 (6.3%) (Table 1).

Agreement between regions
After S and N region extraction from the 1212 whole genome sequences, we compared the S and N lineage assignments to their corresponding assignments by Nextclade whole genome.There was a statistically significant difference (p = 0.0418*) between the agreement of the S region and N region with the whole genome lineage by Nextclade (Table 2).We observed that the N region agrees with whole genome with a higher percentage (46%) compared to the S region (29.9%).Hence, we concluded that the N region could be utilized as an alternative to the S region in lineage assignment for SARS-CoV-2 sequences.

Discussion
Globally, as of 25 May 2023, 15,608,522 SARS-CoV-2 genomic sequences have been submitted to GISAID.In the current work, as of 10 April 2023 (date of access), a total of 3935 SARS-CoV-2 genomic sequences submitted by Egypt since the start of the pandemic and over 3 years (March 2020-March 2023) were downloaded from the GISAID database.We aimed to highlight the shift in lineage assignment across wave patterns in Egypt.
The coronavirus genome averages approximately 29 (26)(27)(28)(29)(30)(31)(32) kb, which is identified as the largest genome size for an RNA virus [31].Indeed, whole genome sequencing is the best option for lineage assignment; however, few drawbacks, such as high cost and time consumption, exist in lowresource countries such as Egypt.Therefore, sequencing of smaller regions instead of the whole genome is considered more feasible under these circumstances.
According to the official site of the Egyptian Ministry of Health (https:// www.care.gov.eg), the first pandemic wave in Egypt was from April 2020 to September 2020, while the second pandemic wave was from October 2020 to March 2020.The peaks of the first and second waves of COVID-19 in Egypt were in mid-June and late December 2020, respectively.The first wave was dominated by B variants, especially B.1, similar to other parts in the world at that time.During the following second and third waves, a shift in lineage was observed in the C.36 and C.36.3 lineages.Then, the delta variant (B.1.617.2) and omicron variant (B.1.1.529)became dominant during the fourth and fifth waves.Interestingly, there was a current shift in prevalence of circulating lineages from dominant nonrecombinant forms such as B.1 and C.36.3 to recombinant forms such as XBB.1.9.1.These recombinant forms were circulating at low levels during the first year of the pandemic [32] Several software tools have been developed specifically for SARS-CoV-2 genotyping based on whole genome and/ or partial domain sequencing, such as GISAID, PANGO-LIN, and Nextclade.According to the European Centre for Disease Prevention and Control (ECDC), whole genome sequencing (WGS), or at least complete or partial S region sequencing, is the best method for assigning a specific lineage or variant [33].A study addressing the genomic diversity SARS-African including was conducted in 2021 [34].They analyzed a of whole sequences, which 971 highcoverage sequences were from Egypt. reported distribution lineages as C.36 (30.6%), followed by (25.2%),B.1.1 and B.1.617.2, with 5.1% each according to the PANGOLIN tool.
A previous Egyptian study reported a shift in lineage prevalence from B.1 to B.1.1.1 between wave 1 and wave 2 [35].However, we observed a shift in from B.1 to C.36 between wave 1 and wave 2 in our study.This disagreement be attributed to the current analysis being performed after end of pandemic waves.According to GISAID, the C.36 lineage was detected early during the pandemic (in May2020) in Egypt and continued to circulate within the country at variable levels.
In this study, we aimed to evaluate the discriminatory power of each tool.All 3 tools showed comparable discriminatory power: GISAID (0.872), PANGOLIN (0.895), and Nextclade (0.866).Because the 3 software tools exhibit different nomenclature and classification systems for lineage assignment, discrepancies between were expected.
Here, we can demonstrate one particular discrepancy due to the different nomenclature systems.Among 1212 sequences, AY* sublineages were detected in 184, 44 and 58 sequences according to GISAID, Nextclade, and PANGO-LIN, respectively.On the other hand, the B.1.617.2 lineage (parent lineage of AY*) was detected in 26, 166, and 152 sequences according to GISAID, Nextclade, and PANGO-LIN, respectively.This may be explained the improved ability of GISAID to classify sublineages to AY* rather than their parent lineage B.1.617.2.We confirmed this theory by ROC/AUC curves.All 3 tools showed high agreement with AUC > 85%, except in the case of lineage B.1.617.2, and the GISAID tool showed a poor AUC (57.5%) compared to PANGOLIN (94.7%).
Here, we conducted comparative analyses of COVID-19 genotyping derived based on the nucleocapsid region (28,533 in the NC_045512.2reference genome) and spike region (21,384 in the NC_045512.2reference genome) extracted from high-coverage whole genome sequences of 1212 COVID-19-infected patients from Egypt and submitted to the GISAID EpiCov database since the start of the pandemic.
In this study, we selected the Nextclade tool as the reference typing method for several reasons; it has a high ability to assign lineages (3925/3935, 99.7%) and hence can assign the majority of partial or low coverage sequences that were unassigned by other tools.Nextclade was able to assign almost the whole dataset except for 10 sequences.
Despite the presence of discrepancies in lineage assignment between the tools, all 3 agreed on assigning the most common lineage circulating per wave during the pandemic in Egypt.B1 was the most common in wave 1, C.36 was most common in wave 2, C.36.3 was most common in wave 3, B.1.617.2 was most frequent in wave 4, BA.2 was most frequent in wave 5, BA.5.2 was the most frequent in wave 6, and recombinant forms (particularly XBB.1.9.1) became predominant (Fig. 2).
We proposed that the N gene may be superior in lineage assignment compared to the S gene.A statistically significant difference (p = 0.04) was observed between S and N agreement with the whole genome, suggesting that the N region agrees with the whole genome more than the S region.Despite, the higher agreement of N region (46%) with whole genome compared to spike agreement (30%), both regions are maybe less sufficient than whole genome which is the best for lineage determination.To the best of our knowledge, this work is the first to explore the ability of another region other than the spike protein for rapid lineage assignment for SARS-CoV-2 sequences.

Conclusions
There is a current evident shift in lineage assignment toward recombinant forms (XBB), particularly XBB.1.9.1.The 3 lineage assigner tools (GISAID, Nextclade, and PANGO-LIN) showed comparable discrimination among the SARS-CoV-2 lineages.We concluded that the N region could be utilized for lineage assignment upon comparing versus spike.However, it is more important to obtain S sequences rather than N sequences since the mutations acquired in S gene are most relevant to escape from the neutralising antibodies.Therefore, for epidimological aspects, it is most

Fig. 3
Receiver operating characteristic/area under the curve (ROC/AUC) curves for lineage assignment agreement between Nextclade (reference typing method), GISAID, and PANGOLIN for  important to monitor S gene to be able to identify newly emerging variants.

Fig. 1
Fig. 1 Flowchart demonstrating inclusion criteria for the sequences according to aims of the study

Fig. 2
Fig. 2 Distribution of the top 5 SARS-CoV-2 lineages per wave according to 3 lineage assigner tools.A GISAID, B Nextclade, and C PANGO-LIN among the whole dataset (n = 3935) in Egypt.Figure was generated by Microsoft Excel version 2021

Table 2
Comparison between spike and nucleocapsid agreement with the whole genome in lineage assignment according to Nextclade among 1212 high-coverage whole genome sequencesThe bold p value indicates a significant difference between the agreement of the S and N regions to the genome at an level 0.05 whole genome 2 Chi-square test