Background

Shigella flexneri is the predominant cause of shigellosis in the developing world [1], making appropriate subtyping tools for tracking S. flexneri epidemiology vital to global public health. The S. flexneri serotyping scheme differentiates isolates serologically based on the expression of the major type specific somatic antigen (I-VI) and common group factor antigens (3,4 designated Y and 7,8 designated X) [2]. The common group factor antigens account for the complex intra-serotype relationships. Currently, there are 15 established serotypes. Traditional S. flexneri serotyping is performed by slide agglutination using antiserum raised in rabbits against type specific and group factor antigens. Recently, Sun et al.[3] published a multiplex PCR approach for molecular serotyping of S. flexneri. This method differentiates the 15 accepted serotypes based on known differences in (i) their gtr genes encoding the type specific antigens I, II, IV, and V, group factor antigen 7,8 (X) and 1c (gtrI, gtrII, gtrIV, gtrV, gtrX, and gtrIC) (ii) the oac gene that mediates O-acetylation modification in serotypes 1b, 3a, 3b, and 4b and (iii) the wzx 6 for detection of serotype 6.

Public Health England (PHE) holds an historic collection of 16 S. flexneri Type strains isolated between 1949 and 1972. Strains belonging to this set have been used to produce standardised antiserum for the phenotypic serotyping scheme at PHE for over 60 years. To increase the utility of this collection, we report the draft whole genome sequences of the 16 PHE S. flexneri Type strains in order to facilitate a greater understanding of how whole genome phylogenies compare to typing data generated from diagnostic and molecular serotyping targets.

Methods

Bacterial strains

The 16 strains of S. flexneri analysed in this study are shown in Table 1. Strains used in this study were serotyped by slide agglutination using both commercially available monovalent antisera (Denka Seiken, Japan) and monoclonal antibody reagents (Reagensia AB, Sweden) and in-house antisera raised in rabbits [4] to all type specific somatic antigens and the group factor antigens. All strains were tested using the PCR serotyping assay described by Sun et al. [3].

Table 1 Comparison of the phenotypic and genotypic serotyping

Genome sequencing and analysis

Genomic DNA was isolated from an overnight culture using the Wizard kit (Promega, Madison, Wisconsin, USA) and was sequenced at the Wellcome Trust Sanger Institute (WTSI) and PHE. Paired end libraries where each pair was 100 bp in length were generated on the Illumina Hiseq 2500 instrument (San Diego, California, USA). Resulting FASTQ reads were processed using Trimmomatic v0.27 [5] to remove bases with a PHRED score of less than 30 and read length less than 50 bp after quality trimming. High quality reads were then mapped to the reference strain, S. flexneri serotype 2a strain 2457 T (AE014073.1) [6], using BWA v0.6.2 and Single Nucleotide Polymorphisms called using GATK v2.5.2 in Unified Genotyper mode [7]. Positions in the reference genome where GATK mapping quality was below 30 and genotyping quality was below 50 in any strain were excluded from further analysis. Single Nucleotide Polymorphisms (SNPs) were defined as the sub-set of high quality positions (MQ > 30, GQ > 50) where the base identified varied from the reference position. De novo assembly was performed using Velvet v1.2.3 [8] with K-mer selected using VelvetK (Table 2) (http://www.vicbioinformatics.com/software.velvetk.shtml). A maximum likelihood phylogenetic tree was drawn using MEGA v5.1 with 500 bootstraps based on an alignment of 10632 SNPs called against the S. flexneri serotype 2a strain 2457 T reference genome.

Table 2 Genome statistics for the S. flexneri genomes sequenced in this study

Genomic data deposition

Wellcome Trust Sanger Institute sequence data is available in the Short Read Archive under the following accession numbers (serotype): ERS088060 (1a); ERS088061 (1b); ERS088062 (1c); ERS088063 (2a); ERS088064 (2b); ERS088065 (3a); ERS088066 (3b); ERS088067 (3c); ERS088068 (4a); ERS088069 (4b); ERS088071 (5a); ERS088072 (5b); ERS088073 (6); ERS088074 (X); ERS088075 (Y); ERS088076 (E1037).

Findings

Mapping of the sequencing reads to the 4.6 Mbp S. flexneri serotype 2a strain 2457 T reference genome resulted in 99–455 times coverage, with between 731 and 47787 SNPs compared to the reference genome (Table 1). De novo assembly resulted in an average N50 of 31621 with an average of 447 contigs (Table 2).

The phylogenetic relationships of the S. flexneri Type strains showed the somatic antigen structure and phylogenetic relationships were broadly congruent for strains expressing type specific antigens III, IV and V, but not I and II (Figure 1). In addition, serotype 3a was more closely related to the serotype X isolate than isolates expressing serotypes 3b and 3c. Serotype 3c was phylogenetically closely related to serotype 3b but differed phenotypically as it failed to agglutinate with the 3,4 (y) group factor antigen. Serotype 3c is not longer included in the current serotyping scheme [3] as it is very rarely identified (nine isolates submitted to GBRU since 2004).

Figure 1
figure 1

Midpoint-rooted phylogenetic tree of S. flexneri type strains based on 10631 variant positions in the core genome, node labels are bootstrap values based on 500 bootstraps. S. flexneri serotype 6 is very distantly related to the S. flexneri strains described in this study (Table 1) and is therefore excluded from the tree.

It has long been reported that the somatic O antigen of S. flexneri serotype 6 differs considerably from that of the other S. flexneri serotypes and that strains of S. flexneri serotype 6 resemble strains of S. boydii immunochemically [9]. Consistent with previous studies and phenotypic information, serotype 6 formed an out group from the other S. flexneri serotypes sequences (data not shown) [10] being more closely related to Shigella boydii CDC 3083–94 (GenBank: CP001063.1); differing by 47 787 SNPs from S. flexneri 2a (Table 1) and approximately 7300 SNPs from S. boydii CDC 3083–94 (data not shown).

In 1972, colleagues in our laboratory reported a provisional new serotype, designated E1037, frequently submitted to PHE between 2004 and 2013 (276 isolates submitted to GBRU since 2004). Phylogenetically, E1037 is closely related to Serotype 4a (Figure 1). Other groups have supported the extension of the accepted classification scheme to include this novel type [11, 12].

The presence of key diagnostic and molecular serotyping genes was also determined. We confirmed the presence of the ipaH gene (the target gene for the detection of Shigella species in diagnostic PCR assays) in all the PHE Type strains. It was not possible to de novo assemble the complete ipaH gene in any strain analysed here due to the presence of multiple homologues of ipaH in the genome. However, all 16 genomes showed the presence of the entire length of ipaH by either BLAST comparison of multiple contigs or mapping to the S. flexneri 2a 2457 T reference genome.

The molecular serotyping detailed in Sun et al. [3] correlated with the phenotypic data for all isolates tested (Table 1). The provisional type, E1037, was the only Type Strain to contain a copy of the plasmid-mediated seroconverting Ipt-O (opt) gene [12]. In contrast to the serotype 5 strain described by Sun et al. (2012) [3], both PHE serotype 5 Type strains encoded an additional oac gene which was intact according to de novo assembly and the presence of the oac gene was confirmed by PCR [3]. The 5a and 5b serotypes were differentiable by the presence (serotype 5b) or absence (serotype 5a) of gtrX (Table 1).

Future directions

The PHE S. flexneri Type strain data set has been used in the validation and evaluation of genotypic and phenotypic assays and has facilitated the study of phylogenetic relationships within this species during outbreak investigations (unpublished observations). Analysis of the genome sequences, in conjunction with the phenotypic serotyping data, provided new insights into this historic strain set. Comparisons with the PCR serotyping scheme highlighted the need to add novel variants [13] in order to maintain a comprehensive collection of relevant Type strains.