Allelic polymorphism. A total of 77 strains of anthrax pathogen were studied, including strains isolated at the site of a major reindeer epizootic on Yamal Peninsula (Russia) in the summer 2016 and the strains isolated from permafrost in Yakutia [10, 11]. B. cereus strains containing pXO2-like plasmids (B. cereus bv. anthracis CI and BC-AK) were also included in the sample. These strains are capable of causing anthrax-like disease in humans and great apes [18–20].
For the strains in the studied sample, the sequences of genes located on the pXO2 virulence plasmid, namely, capBCADE operon genes (capB, capC, capA, capD, and capЕ) responsible for B. anthracis capsule synthesis and acpA and acpB genes encoding capsule biosynthesis regulation proteins were assembled using the obtained whole-genome sequencing data.
The results allowed us to identify and describe mutations and to split the sample into sequence types (STs). In total, nine STs for the capD gene (D = 0.3546 [0.2178–0.4915]), five STs for the capA (D = 0.2691 [0.1436–0.3947]), acpA (D = 0.3636 [02347–0.4925]) and acpB (D = 0.1654 [0.0542–0.2767]) genes, four STs for the capC (D = 0.2293 [0.1073–0.3514]) gene, three STs for the capB (D = 0.0731 [–0.0066–0.1529]) gene and two STs for the capE (D = 0.0491 [–0.0173–0.1155]) gene were identified in the studied sample which included 79 strains. The STs identified, with the indication of their difference from the ST reference genome, which consists in the presence of certain mutations, are listed in Tables S1–S6. For each gene, the sequence type to which the Ames Ancestor reference strain belongs is designated ST1. Type numbering was performed in decreasing order of the number of B. anthracis strains included within it, followed by the ST to which the B. cereus strains belonged. All STs differed between themselves by nucleotide substitutions or the insertion, as is the case with the acpA gene.
To assess the phenotypic expression of the identified nucleotide polymorphism, or in other words, whether the nucleotide substitution at each identified position is synonymous or results in an amino acid substitution in the corresponding protein, or protein inactivation due to a stop codon, in silico translation of the nucleotide sequences into amino acid sequences was carried out (Tables S7–S12). The coordinates of the amino acid substitutions are given for complete translated protein sequences, disregarding their post-translational modifications, which have been described for the CapD protein only and consists in the cleavage of the 28 amino acid N-terminal signal sequence . 3D structures of the proteins encoded by the genes studied here have not been described with the exception of CapD [21, 22], therefore amino acid substitutions are shown without indication in which protein domain they are located.
The distribution of the strains in the studied sample over the translated CapB, CapC, CapA, CapD, AcpA, and AcpB protein types is given in Tables S13–S18. The assignment of strains to protein types was carried out in the same way as it was done in the case of nucleotide STs. In total, six CapD protein isoforms, five AcpA protein isoforms, four AcpB protein isoforms, three isoforms of each of the CapB and CapA proteins, and two isoforms of each of the CapC and CapE proteins were described.
The tables do not contain data for the capE gene since the studied sample appeared to be monomorphic with the only exception for the B. cereus BC-AK strain, which revealed the presence of two SNPs, namely, 49T → G (17L → V) and 138C → T (synonymous). The observed low variability may be partially accounted for by the small length of the capE gene (only 144 bp), as given the same frequency of mutation across the entire cap operon, the probability of mutation in such a small part of it is rather low.
The distribution of strains in the studied sample across genotypes (GTs), each of them including a particular combination of ST genes under study, is shown in Table 2. The genotypes were numbered according to the principle specified above. GT, which included the reference strain was designated GT1. The remaining GTs were numbered in decreasing order of the number of strains in which they were described, the last numbers having been assigned to the GTs characteristic of B. cereus. A total of 17 GTs (D = 0.8908 [0.8604–0.9212]) were described.
A phylogenetic tree showing the relationships between the identified GTs is shown in Fig. 1.
In summary, the present study described the allelic polymorphism of B. anthracis capsule formation genes and cap-operon expression regulation genes located on the pXO2 plasmid. Their STs were identified and strains in the study sample were clustered based on their GT, which represented a set of STs for individual genes. In other words, we virtually used multi-virulence-locus sequence typing (MVLST) to study the anthrax pathogen. Earlier, this method was first used by the authors to study this microorganism using the sequences of toxin formation genes located on the pXO1 plasmid. This pipeline was called MVLSTpXO1 genotyping. Accordingly, the genotyping pipeline suggested in this study may be called MVLSTpXO2, while the identified GTs, may be referred to as MVLSTpXO2-GT.
Analyzing the studied sample clustering based on MVLSTpXO2-GTs demonstrated that MVLSTpXO2 profiles correlated well both with the species identity (they were different for B. anthracis and B. cereus) and evolutionary line divergence within the B. anthracis species. However, using MVLSTpXO2 we were not able to accurately divide the sample into the groups corresponding to the canSNP groups that the strains belong to. At the same time  reported a strong correlation between MVLSTpXO1-GT of the strain and the canSNP group, which it belongs to, as well as, in some cases, their association with a geographical zone. It should be noted that MVLSTpXO2 genotyping based on gene sequences localized on the pXO2 plasmid had lower resolution (D = 0.6148 [0.488–0.7416]) compared to the similar approach, which used genes located on the pXO1 plasmid (D = 0.8951 [0.8656–0.9245]) .
In the A lineage, MVLSTpXO2-GT1 is the most common genotype, which includes strains belonging to almost all canSNP groups. Other GTs in lineage A differed from GT1 by one or two SNPs and contained a single strain or combined small subgroups of strains within the same canSNP group.
The exception was the Korean H9401 strain (canSNP group A.Br.005/007), which differed from MVLSTpXO2GT1 by two SNPs, namely, capD 796G → A (266V → I) and acpB 495A → G (synonymous). However, it is rather doubtful that these two markers are specific to the A.Br.005/007 group, as they are just as likely to be specific to the H9401 strain . The presence of these two mutations in other strains in the A.Br.005/007 group could not be verified because at the time of preparing the manuscript, no whole genome sequencing data were publicly available for the strains in this canSNP group.
Apart from the H9401 strain, two more strains showed the presence of unique markers, namely, the Tangail-1 strain (A.Br.001/002), which contained the capB 230T → C (77V → A) SNP and the Pollino strain (A.Br.011/009), which revealed the presence of the synonymous SNP capC 147T → C.
As stated above, in several cases, not only individual strains, but also small groups of strains within canSNP groups in lineage A showed unique markers. For example, MVLSTpXO2-GT3 genotype included four strains belonging to the A.Br.Aust94 group, which were isolated in the Caucasus region, namely, in the Stavropol krai, Chechen Republic, and Azerbaijan. These strains form an individual GT based on the presence of the unique synonymous SNP capC 351A → G (capC-ST2). However, this marker is absent in strain 1199 in the same A.Br.Aust94 group isolated in the same region, namely, in Dagestan.
To verify whether this capC 351A → G SNP is specific for the “Caucasian” A.Br.Aust94 strains, its presence was additionally tested in the genomes of 23 strains in this group isolated in different areas located outside the Caucasus region (Table 3). We were not able to detect this SNP in any of the strains included in this additional sample. Thus, it may be suggested that the A.Br.Aust94 capC 351A → G subgroup was formed in the Caucasus region and circulate there together with its “parental” form, and the capC 351A → G SNP may be used as a marker the presence of which in the strain indicates its region of origin.
The A.Br.Vollum canSNP group was distributed between three GTs in the studied sample. Two strains isolated in Central Asia belong to GT1, a common one for lineage A. Four American strains belonging to GT5 (n = 3) differed from GT1 by the presence of the unique acpB 1381 A → G (461I → V) SNP. GT5 is characterized by the presence of only this substitution. GT12 (n = 1) contains another unique SNP acpB 563C → T (188S → L) in addition to this substitution. In a further search for the presence of these SNPs, we identified nine strains in the additional A.Br.Vollum sample with different geographical origins belonging to GT1 (n = 4) and GT5 (n = 8) (Table 4). Thus, we may at least say that the A.Br.Vollum group has split into two subgroups with different GTs: GT1 (which includes both strains in this group which were isolated in the former Soviet Union) and GT5. GT12 is apparently a strain-specific genotype.
Markers characteristic of the B and C lines are worth mentioning separately. It has been demonstrated that all the studied strains in lineages B and C differ from lineage A strains by the presence of the common SNP acpA 853G → A.
The strains in the B.Br.CNEVA canSNP group isolated in Central Europe have a synonymous capD 234T → C SNP. However, the only strain in this group originating from the SRCAMB (Obolensk, Russia) did not reveal the presence of this mutation. Unfortunately, there is no data left on where this strain was isolated, so the presence or absence of this capD 234T → C SNP cannot be considered as a conclusive marker of geographical origin. At the same time, the presence of this strain in SRCAMB (Obolensk, Russia) still very likely indicates that this strain was isolated in the former Soviet Union. In this case there may be some reason to assume that this SNP does have phylogeographic relevance. However, strain 44 is the only available strain in the B.Br.CNEVA group for which the isolation site is not in Central Europe, which made it impossible to carry out any additional study on the distribution of the capD 234T → C SNP among B.Br.CNEVA group strains isolated outside this region.
Evolutionary lineage С (canSNP group C.Br.001) is represented in the studied sample by two strains (2002013094 and 2000031021) isolated in the United States. These strains have the same nucleotide sequences for all the genes studied and therefore fall into the same GT according to the results of phylogenetic analysis of joined sequences. According to the results of the analysis carried out in this study, these strains proved to be phylogenetically closer to B. cereus than the other strains in the studied sample. The following SNPs are common between these strains and nonanthracis (B. cereus) strains: capC 239C → T (80T → M), capD 208C → T (70H → Y), capD 667A → G (223K → E), and capD 1135T → A (379F → I).
One of the aims of this study was to examine the anthrax pathogen as an infection, whose outbreaks may potentially be triggered by climatic changes, such as permafrost melting leading to the growth of spores previously trapped in frozen soil layers . Previously, several researchers have suggested the possibility of outbreaks of several diseases, including anthrax, caused by climatic changes [26–28].
Strains isolated during the Yamal outbreak, as well as one of the strains isolated from the frozen soil layers in Yakutia, were previously assigned to evolutionary lineage B. Several mutations, which are phylogenetically significant markers, have been described in the genomes of these strains here and in the earlier studies [23, 24]. The discovery of anthrax bacilli in the Holocene alluvial deposits in Yakutia is of much interest . The most likely genetic dating, though, indicated that these strains were conserved between the 13th and 16th centuries. The geological position of the find below the seasonal thawing layer may probably indicate that they are several thousand years old. If the latter is true, then it appears that anthrax bacteria (apparently, as spores) are able to prevent the accumulation of mutations in accordance with the molecular clock hypothesis during their long-term preservation in permafrost leading to the effect of “curiously modern DNA for an ancient bacterium” . The study of such strains is valuable not only in terms of the risks of outbreaks resulting from climate change, but also for better understanding of the evolutionary history of anthrax pathogen and other bacilli.
MVLSTpXO2 genotyping of 40 B. anthracis and 2 B. cereus strains, whose genomes are deposited in GenBank (Table 1) revealed the presence of a nine bp insertion in the acpA gene in several strains belonging to lineages В and С, which at closer look proved to be part of an imperfect tandem repeat ATA**GATA. It has been demonstrated that the number of tandem repeating units is three in evolutionary lineage A strains, while it is four in lineage B strains belonging to the B.Br.001/002 (n = 3) and B.Br.CNEVA (n = 4) groups and in lineage С. In such a way, we identified a previously unknown VNTR locus on the pXO2 plasmid of the anthrax pathogen genome, which we named VNTRacpA.
The structure of tandem repeats itself determines the specific nature of the most likely mutations at these loci, which are insertions and deletions of the certain repetitive sequence. Furthermore, the frequency of these events is significantly higher than the average frequency of other mutations in the genome [30, 31]. Therefore, the presence of only two alleles of this tandem repeat in the acpA gene in the sample of 42 phylogenetically different strains suggests its low variability. It may be assumed that four repeats of this motif were characteristic of the B. anthracis 'ancestral genome' prior to its segregation into geographically and genetically distinct groups. It is possible that reducing the repeat fold to three resulted in an AcpA protein variant, which provided the corresponding strain with some selective advantage, for example, more efficient regulation of pathogenicity factor expression, with the archaic four-repeat variant being preserved only in the B and C strains (Table S5).
Previously, several VNTR loci, including those located on the pXO2 plasmid, have been suggested for MLVA genotyping of the anthrax pathogen. However, the loci located on this plasmid represent clusters of 2–3 bp long repeats, with expensive equipment thus being needed to determine their length . The locus described here represents a region containing three or four 9 bp long tandem repeats, which makes it possible to separate the amplified fragments in the agarose gel.
The results showed that the identified polymorphism allows using the number of repeats at this locus as a diagnostic marker to differentiate between the evolutionary lineages. Therefore, we assessed the possibility of using VNTRacpA for MLVA genotyping in this study.
Unfortunately, the method of nucleotide sequence assembly using whole genome sequencing data employed in this study did not allow the number of repeats in the VNTRacpA region to be unambiguously determined in the strains from SRCAMB (Obolensk, Russia) including those belonging to lineage B. For this reason we designed PCR primers flanking the described repeat region and tested them on the DNA isolated from the SRCAMB (Obolensk, Russia) B. anthracis strains included in the studied sample. We used the MLVA genotyping approach, which involved the amplification of the VNTRacpA locus and the amplified fragments separation in the agarose gel.
Figure 2 shows an example of using the VNTRacpA locus for MLVA genotyping of B. anthracis strains and differentiating evolutionary lineages А and В. The results allowed us to differentiate between the strains in evolutionary lines A and B with high confidence and confirmed the hypothesis of the presence of two tandem repeat alleles, specific to the two evolutionary lines of the anthrax pathogen. However, the analysis revealed that the theoretical length of the amplified fragments differed from the observed one. In particular, the theoretical length of the fragment containing three tandem repeats, was 245 bp, and of the fragment containing four repeats, 254 bp (Fig. 2). The observed lengths were 300 and 309 bp.