The analysis of European and of Polish coronavirus sequences confirmed that the SARS-CoV-2 evolution is relatively stable. The number of derived haplotypes due to new mutations observed on single haplotype backgrounds was moderate, and isolates carrying such haplotypes were usually restricted to single populations (exemplified by G20419T in haplotypes GHI-8). This is consistent with the previous reports that coronaviruses change more slowly than most other RNA viruses, probably because of the “proofreading” activity of Nsp12 exonuclease; SARS-Cov-2 mutation rate underlying global diversity has been estimated at ~ 6 × 10−4 nucleotides/genome/year (Van Dorp et al. 2020).
A relatively large part of the population-specific haplotype diversity resulted from homoplasies. Homoplasic mutations are commonly found in the SARS-CoV-2 genome (Van Dorp et al. 2020; De Maio et al. 2020). Many homoplasies resemble hot-spot mutations—new alleles are found on a large variety of haplotypes, and their presence does not contribute to the stable evolution of the sequence. On the other hand, some of the homoplasic mutations remain stably associated with the specific haplotype background and can be used to trace sequence evolution. For example, C14805T (designed ho* in Fig. 1) was found on two different, but stable backgrounds—one in the S and another in the V clade. G11083T (ho**), with well-established homoplasic character (Van Dorp et al. 2020), was stably associated with the V clade, and highly recurrent among European sequences from other clades; it was not observed in any of the Polish sequences from the G superclade. The homoplasic character of other mutations in Polish haplotypes (e.g., those indicated by lowercase letters a and b in haplotype names in Fig. 1) was inferred from the analysis of reference European dataset; their stable association with single haplotypes in Polish isolates most probably reflected founder effects resulting from local outbreaks of SARS-CoV-2 carrying these mutations.
Recombination of two sequences appeared to be the most parsimonious explanation for the structure of some Polish-specific haplotypes, e.g., GH-3/GR-3-1; GR-1/GRM; LorV/G-d; GR-6-1/GR (see Fig. 1). Inspection of the haplotype structure in over 5000 European isolates revealed that up to 17% of all the variants could be explained by recombination. While some of these variants may reflect convergent evolution (involving recurrent mutations, back mutations) (Wertheim 2020), or even sequencing errors, recombination remains a plausible scenario, especially when more than one polymorphic sites are involved (as was the case of 11 haplotypes) and when the alleged mother variants are present at high frequency in the same populations. The active recombination of SARS-CoV-2 has been reported and discussed in several studies (Yi 2020; De Maio et al. 2020; Nie et al. 2020; VanInsberghe et al. 2020; Varabyou et al. 2020; Wertheim 2020). While the detailed results were not concordant, all the studies underscored the rare occurrence of recombinants. It has been suggested that the actual rate of recombination might be higher, but not detectable due to the low diversity of the SARS-CoV-2 sequences (VanInsberghe et al. 2020; Wertheim 2020). Interestingly, Koelle’s group (VanInsberghe et al. 2020), who did not confirm recombinants reported by Yi (2020), has reported five other recombinants in the analysis of 47,390 sequences grouped in 14 clades defined by 37 positions. In our study, the search for recombinants was based on the analysis of 263 haplotypes defined by 110 positions; the higher resolution could explain why more purported recombinants were revealed. It is worth mentioning that ~ 1.6% of the European sequences analyzed by us (80 isolates not included among 5013 used for the haplotype diversity analysis) were characterized by heteroplasmy, seen as ambiguous readouts at sites involved in clades or haplotypes definition (e.g., c.28881-3 on the G background, suggesting coinfection with GR; c.25350 on the G background, suggesting coinfection with G-11). While the possibility of sequencing error or contamination cannot be excluded, the presence of heteroplasmy in SARS-CoV-2, also reported in other studies (Tang et al. 2020b), implies double infection events, and speaks in favor of the possible role of recombination in the emergence of some haplotype variants. While, based on our data, it cannot be excluded that a part of the existing SARS-CoV-2 diversity is due to recent recombines events, the rare occurrence of isolates carrying the purportedly recombined variants suggests that these sequences did not proliferate extensively, consistent with the previous reports (VanInsberghe et al. 2020; Wertheim 2020). More data, and perhaps longer time, given the slow evolution of SARS-CoV-2 sequence, is needed to assess to what extent does recombination contribute to SARS-CoV-2 evolution (Wertheim 2020). Finally, while recombination is believed to underlie evolutionary jumps, which allow viruses to change their hosts (Su et al. 2016; Luk et al. 2019), the role of the present knowledge does not allow to assess whether recombination plays any role in SARS-CoV-2 acquiring specificity for human ACE2 receptor (Boni et al. 2020).
The frequency of certain haplotypes in different populations may change rapidly due to founder effects caused by local outbreaks, and this usually does not invoke selective advantage of such strains. While unsupervised assumption that the prevalence of any given SARS-CoV-2 strain indicates its increased virulence should be avoided, examples of the global spread of some coronavirus mutations deserve attention. The global increase of the G superclade frequency at the cost of S/V/L lineages, also seen in the Polish dataset, has led to the conclusion that the hallmark G superclade mutation, p.D614G substitution in the spike protein (A23403G), might be responsible for the increased virulence of the coronavirus (Brufsky 2020; Korber et al. 2020). The possible selective advantage of p.D614G facilitating interaction with the receptor on the surface of human cells is presently considered a plausible, albeit still not fully proven scenario (Zhang et al. 2020; Korber et al. 2020; Plante et al. 2020; Volz et al. 2020; Grubaugh et al. 2020).
The analysis of all currently available full-length SARS-CoV-2 sequences (n = 115) from Polish isolates revealed that most of the haplotypes seen in the analyzed set are also found at varying frequencies in other European countries (Fig. 1). Coronavirus strains, which circulate in Poland, appear therefore to originate in many independent transfers from various populations. This is consistent with the fact that the epidemic outbreak in France, Italy, Germany, UK, Finland, Belgium, and Sweden (Coronavirus update (Live) n.d.) preceded that in Poland by over two weeks, during which border restrictions were not yet imposed. By the time COVID-19 struck Poland, all major coronavirus clades were already present in Europe (Mercatelli and Giorgio 2020; Worobey et al. 2020; Yang et al. 2020; Mavian et al. 2020a; Pachetti et al. 2020). With no rigorous epidemiological interview (history of travel, contacts of infected individuals, etc.), it is impossible to state what was the country of origin for particular transmission cases.
Similar to the reference European dataset, the majority of the analyzed Polish isolates belonged to the G superclade, encompassing clades G, GH/GHI, and GR. Sequences representing the older SARS-CoV-2 lineages (L, V, and S) were sparse among Polish samples. The relative frequency and diversity of the G and GH/GHI clades in Polish data were comparable with that in the rest of Europe. The scarcity of Polish-specific haplotypes in these clades suggested that almost all isolates observed in the analyzed dataset represent direct transfers from other European countries, which did not result in extensive local transmissions, similar to early coronavirus introductions in France (Gambaro et al. 2020). In contrast, the frequency of the GR clade in Polish samples (60%), much higher than observed in the European dataset (30%), revealed a scenario consistent with the successful expansion of this clade in Poland (Fig. 2A, B). In addition, the GR clade was more diversified than in the rest of Europe, in terms of the proportion of different haplotypes. The discrepancy in the GR clade abundance between the set of 115 Polish sequences collected between March and mid-June, and the European reference sequences collected until April 9th was mostly due to the contribution of Polish sequences collected after April 9th; the frequency of Polish isolates collected until April 9th was much closer to that in the European dataset. Indeed, the recent study on the SARS-CoV-2 geographical and temporal distribution in Europe, encompassing the period from January to mid-June (Alm et al. 2020), has indicated that the frequency of the GR clade in April was ~ 30%, consistent with our calculations based on the manual analysis of the GISAID data. However, early in June, the GR clade frequency in Europe overtook that of the other clades, and since then is on the constant rise. In mid-June, the overall European frequency of the GR clade exceeded 50% (Alm et al. 2020), which is much closer to ~ 60% calculated for 115 Polish sequences collected until that date, and to ~ 50% reported in Alm’s paper for the subset of 79 Polish isolates. Overall, these observations suggest that the changes in the frequency of SARS-CoV-2 clades in Poland follow the trend consistent with that observed in the rest of Europe.
The detailed analysis of the whole SARS-CoV-2 genome allowed identification of population-specific low-frequency mutations, which defined new haplotypes and indicated the common origin of groups of isolates. Furthermore, the analysis of haplotype divergence due to the accumulation of mutations at sites not used for haplotypes definition provided clues regarding their independent history. In the DNA distance–based MDS analysis (Fig. 3 and Fig. 4), some of the Polish sequences carrying the same haplotypes formed tight clusters, apparently reflecting local COVID-19 outbreaks (e.g., samples carrying Polish-specific GR-9-a or GR-11 haplotypes), while others (e.g., carrying frequent European GR-3-1 or G haplotypes) were randomly spread, presumably representing independent transfers from a variety of sources. Similar clustering was revealed in the DNAML tree. It has to be emphasized that the phylogenetic tree was only presented to show clustering of some sequences. Given that the Polish set of isolates represented an incomplete and biased fraction of coronavirus cases in Poland, no root was assigned to the phylogenetic tree; no phylogenetic inferences were made, to avoid overinterpretation of the data (Mavian et al. 2020b).
Fairly complete knowledge of the genetic diversity of SARS-CoV-2 is important for medical epidemiology, diagnostics, and prevention (Mavian et al. 2020a). Assigning virus isolates to main European clades is the foundation for such efforts, but only the whole RNA genome sequencing allows the detection of population-specific mutations and haplotypes. While they may have little value for reconstructing the SARS-CoV-2 evolution on the trans/continental scale, they are essential for the attempts to explain local pathways of virus spread and to identify undocumented local sources of COVID-19 outbreaks. Furthermore, recognizing the local prevalence of specific haplotypes may have a substantial impact on the accuracy of population-specific diagnostic tests.
Understanding the SARS-CoV-2 genomic variability is of particular importance for designing therapies or vaccines (Van Dorp et al. 2020; Weissmann et al. 2020), as it allows selection of evolutionarily constrained regions of the coronavirus genome, which should be preferentially targeted to avoid rapid drug and vaccine escape mutants. Here again, information on the variability of strains circulating in a given population will help to adjust future medical interventions to the population-specific profile of infections.
Our study has obvious limitations related to the small number of whole-genome sequences from Poland available in GISAID. The present estimate of infected people in the country with the population of more than 37.8 million exceeds 70 thousand, and the actual number may be much higher. To alleviate this problem, extensive testing of the whole population should be implemented. Importantly, once SARS-CoV-2 infections become identified, the representative sets of sequences should be obtained, including those from asymptomatic cases; this will be the first step towards understanding relations between the SARS-CoV-2 genetic subtype and its virulence and severity of the disease course.