Introduction

Infections caused by multidrug-resistant (MDR) Escherichia coli sequence type (ST) 131 are increasing worldwide1,2. ST131 are extraintestinal pathogenic E. coli (ExPEC) associated with bloodstream and urinary tract infections and typically possessing extended-spectrum beta-lactamase (ESBL) genes3,4,5, or more rarely carbapenemase genes6. MDR ST131 is a major cause of ExPEC infection because it has an extensive range of virulence factors7,8,9,10,11 and may be highly pathogenic to hosts. ST131 has been reported in healthcare and community settings around the globe, and its dominant lineage Clade C is fluoroquinolone-resistant (FQ-R)12,13. Clade C has a type 1 fimbrial adhesin gene H30 variant (fimH30)10,14 and can offset the fitness costs of antimicrobial resistance (AMR), plasmid acquisition and maintenance through compensatory mutations at regulatory regions in contrast to FQ-susceptible Clades A and B15.

Historically, E. coli population structure was inferred from allelic variation at seven housekeeping genes to assign ST complexes via multi-locus sequence typing (MLST)16, or at 51 ribosomal genes for rST (ribosomal MLST)17. Outbreak investigation necessitates sufficient biomarker density to allow isolate discrimination, which is only possible with genome sequencing to allow profiling of all AMR genes18,19. Recent work applied core genome MLST (cgMLST) of 2,512 genes, but computational limitations meant examining 288 ST131 genomes where only a single specimen per rST was examined across 1,230,995 SNPs from a 2.33 Mb core genome, within a larger set of 9,479 diverse E. coli20. Given that rST1503 alone may account for ~81% of ST131 and that outbreaks may comprise a single rST21, our understanding of E. coli ST131 transmission dynamics and diversity within single STs may limit inferences of past, present and emerging MDR outbreaks.

A deeper investigation of MDR ST131’s population structure, selective processes and ESBL gene evolution can illuminate its mechanisms of AMR, host colonisation and pathogenicity10,14. Exploring the evolutionary origins, transmission and spread of outbreaks requires extensive sampling to link variation at AMR genes with inferred adaptive and epidemiological patterns22, and previous work suggests a high-resolution large-scale approach to bacterial epidemiological based on genomic data address these questions23.

Deducing evolutionary relationships based on the core genome permits the discovery of novel accessory genome events24. ST131 evolution has been punctuated by plasmid conjugation, plasmid recombination and mobile genetic element (MGE) rearrangements, particularly of the cefotaximase (CTX-M) class of ESBL genes, blaCTX-M-14/15/2725,26,27 that allow third-generation cephalosporin-resistance28. These blaCTX-M gene changes correlate strongly with ST131 subclade differentiation, such that the most common one (C2) is typically blaCTX-M-15-positive29. ESBL and other virulence factor genes likely drive extraintestinal niche colonisation but vary across environments depending on MGE-driven mobility10,15,29,30. When coupled with host immunity, this environmental niche effect results in negative frequency-dependent selection (NFDS) in the ST131 accessory genome, leading to a variable AMR gene repertoire31 that has not yet to be explored within ST131’s subclades. In addition, applying an evolutionary pangenomic approach with core and accessory genome variation within subclades may inform on the origin of new genetic ST131 lineages.

Here, we aggregated all available ST131 Illumina HiSeq read libraries, and automated quality-control, genome de novo assembly, DNA read mapping and ESBL gene screening in the largest ST131 set examined thus far to reconstruct a core genome phylogeny and evaluated the epidemiology of clades and subclades. We established that the two most common C subclades (C1 and C2) co-circulated globally and that their ESBL gene composition was flexible. We hypothesise that the diversity of accessory genomes in isolates with near-identical core genomes due to ST131’s ability to retain newly acquired genes may be driven by environmental pressures.

Results

Collation, screening and generation of 4,071 high quality draft ST131 genome assemblies

We collated accession IDs and linked metadata for 4,071 high quality de novo genome assemblies whose DNA was isolated in 1967–2018 from diverse sources across 170 BioProjects (Supplementary Table S1) following thorough filtering steps (Fig. 1, see Methods). 4,070 genomes from Illumina HiSeq read libraries assembled using Unicycler had with N50s of 195,830 ± 57,037 bp (mean ± standard deviation), lengths of 5,136,890 ± 121,402 bp, 124.3 ± 74.8 contigs and 4,829 ± 142 genes (Supplementary Table S2). The sole PacBio assembly (AR_0058) had five contigs with a N50 of 4,923,470 bp and was 5,132,452 bp long.

Figure 1
figure 1

Methods summary: 4,870 read libraries were downloaded from Enterobase. 718 uninformative ones were excluded. Of those assessed, four were long read libraries (PacBio) and the rest were short paired-end reads (Illumina HiSeq). The adapters of the four PacBio and 4,147 Illumina reads were trimmed using Cutadapt and Fastp, respectively. The resulting adapter-free reads were assembled using Unicycler. An iterative genome quality check eliminated three PacBio and 77 Illumina libraries, yielding 4,071 as the final collection. Cleaned reads after Quast filtering were examined with Roary using Prokka annotation to evaluate the pangenomic diversity. Phylogenetic construction was performed by RAxML on the core genome. The assembled genomes were annotated and screened for AMR genes (including blaCTX-M-14/15/27) and their context. Genetically distinct clusters from the phylogeny were determined using Fastbaps. Distances between the core and accessory genomes of isolate pairs were estimated using Poppunk based on k-mer differences.

We assembled the pangenome of 4,071 ST131 isolates using NCTC13441 as a reference with Roary v3.11.232 resulting in 26,479 genes, most of which were rare (Supplementary Fig. S1). BlaCTX-M-15-positive NCTC13441 was isolated in the UK in 2003, and was from subclade C2. 3,712 genes present in all isolates formed the core genome, and of 22,525 CDSs in the accessory one, 1,018 were shell genes in 15–95% of isolates and 21,507 (81% of the total) cloud genes in less than 15% of isolates (Supplementary Fig. S2). Cloud gene rates were a function of sample size, which explained most (r2 = 0.846, p = 0.00012) of cloud gene number variation, but not that of core (r2 = 0.162), soft core (r2 = 0.258) nor shell (r2 = 0.001) genes.

Population structure classification shows three dominant ST131 C subclades

Clades A (n = 414, 10.1%), Clade B (n = 420, 10.3%) and Clade B0 (n = 13, 0.3%) were relatively rare in comparison to the 3,224 isolates in Clade C (79%) based on fimH typing. This showed 91% of Clade A had fimH41, 66% of Clade B had fimH22, 99% of Clade C had fimH30, and unexpectedly Clade B0 had fimH30, not fimH27 (Table 1). Nine isolates were fimH54, of which eight were in Clade B33.

Table 1 Number of ST131 in Clades A, B, B0, C0, C1 and C2.

Clustering of the 4,071 isolates based on 30,029 core genome SNPs with Fastbaps identified nine genetically distinct subclades (clusters 1–9) and two groups of heterogeneous or rare isolates (clusters 10 and 11) (Fig. 2). Clade A was mainly assigned to clusters 2 (n = 407, 98.3%) and 11 (n = 7, 2.7%) (Supplementary Fig. S3). Clade B isolates were in clusters 1 (n = 90, 21.4%), 3 (n = 96, 22.9%), 5 (n = 64, 15.2%), 7 (n = 115, 27.4%), 8 (n = 4, 1.0%), 10 (n = 34, 8.1%) and 11 (n = 17, 4.0%). All isolates of Clade B0 were in cluster 8, suggesting that it was a subclade within Clade B. As a consequence of the heterogeneity of clusters 10 and 11, whose members came from all three clades, these isolates were assumed to be unassigned to proper groups due to their rare number.

Figure 2
figure 2

(a) A maximum likelihood phylogeny of 4,071 ST131 genome collection and (b) the distribution of these isolates across countries and over time. The phylogeny shows Clades A (n = 414 genomes, pale green), B (n = 420 genomes, dark green), B0 (n = 13 genomes, orange), C0 (n = 52 genomes, blue), C1 (n = 1,121 genomes, bright green) and C2 (n = 2,051 genomes, purple). The phylogeny constructed with RAxML from the 30,029 chromosome-wide SNPs arising by mutation was visualized with iTol. The inner colored strip surrounding the tree represents the subgroups formed from Fastbaps clustering and the cluster (1–11) associated with each isolate. The outer colored strip surrounding the tree is the fimH allele: H41 for Clade A (pink), H22 for Clade B (blue), H30 for Clade C (black) and other alleles (white). The histograms in (b) show the distribution of sampling across countries, and that out of the 4,071 genomes isolated from 1999 to 2018, 2,051 belong to C2 (pink), with most isolates coming from 2002–2017.

Clade C (n = 3,224) had three main subclades determined by Fastbaps: C1_6, C2_4 and C2_9. C1 had 1,121 isolates: 1,113 isolates in Fastbaps cluster 6 (referred to as C1_6) with eight unassigned in cluster 10 (Fig. 3). C2 had 2,051 assemblies: 1,651 in cluster 9 (C2_9) and 386 in cluster 4 (C2_4). C0 (n = 52 isolates) was mainly assigned to cluster 11, consistent with its heterogeneous nature10. 13 C2 genomes were assigned to cluster 10 and one to cluster 6.

Figure 3
figure 3

(a) A maximum likelihood phylogeny of 3,224 ST131 Clade C isolates and (b) the distribution of these isolates across countries and over time. The phylogeny shows C0 (n = 52 genomes, blue), C1 (n = 1,121 genomes, bright green) and C2 (n = 2,051 genomes, purple). As per Fig. 2, the phylogeny constructed with RAxML from the 30,029 chromosome-wide SNPs arising by mutation was visualized with iTol. The inner colored strip surrounding the tree represents the subgroups formed from Fastbaps clustering and the cluster (1–11) associated with each isolate. This indicated most C0 were in Fastbaps cluster 11 (n = 52 genomes) with a single isolate in cluster 10 (grey). Of the 1,121 C1 isolates, 1,113 formed Fastbaps cluster 6 (green) and eight were assigned cluster 10 (black). The C2 subclades corresponded to Fastbaps clusters 9 (C2_9, n = 1,651 isolates, pink) and 4 (C2_4, n = 386, dark purple). The middle colored strip indicates the isolates with an interrupted mppA gene (black) relative to the wild-type intact version (grey). The outer colored strip is the blaCTX-M allele: 2,416 genomes had blaCTX-M-14, blaCTX-M-15, or blaCTX-M-27 genes, 1,790 genomes had blaCTX-M-15 (mainly C2), 177 genomes had blaCTX-M-14 (mainly C1) and 424 genomes had blaCTX-M-27 (mainly C1). The M27PP1 locus gain denoting the C1-M27 lineage (red box) was found in 468 C1_6 genomes (though independent events occurred too). The histograms in (b) show the distribution of sampling across countries, and that since 2002 both C1_6 (green) and C2_9 (pink) were common until the emergence of C2_4 (blue) and to a lesser extent C0 (brown).

Epidemic subclades C1 and C2 co-circulate globally but with stable frequencies

NFDS in the accessory genome driven by AMR gene acquisition, ecological niche colonisation ability and host antigen recognition has stabilised the relative frequencies of ST131 and its clades over time relative to other STs29,31. Here, this pattern could be present for the clades A, B, C and three main C subclades, C1_6, C2_4 and C2_9 spanning 2002–2017 (1,596 out of 1,614 isolates that had year of isolation data) if their relative rates stabilised after emergence (Supplementary Fig. S4).

Of the 1,724 isolates with geographic information, 819 were from Europe, 499 North America, 294 Asia, 80 Oceania, 20 South America, and 12 Africa (Fig. 3) - the remaining 2,347 isolates (58%) had no geographic data (Supplementary Table S3). C1_6 was more common in North America (OR = 1.57, 95% CI 1.25–1.96, p = 0.0004) and rarer in Europe (OR = 0.67, 95% CI 0.53–0.81, p = 0.0004). C2_4 was more frequent in Asia (OR = 1.75, 95% CI 1.18–2.56, p = 0.019) and less so in North America (OR = 0.61, 95% CI 0.40–0.91, p = 0.042). Consequently, there was limited global population structure separating C1 and C2 isolates, suggesting that they have co-circulated globally for some time.

Based on common ancestry with Clade B, the origin of Clade C was in North America because Clade B isolates from 1967–1997 were solely isolated in the USA until one isolate in Spain in 1998. This fitted previous work timing the origin of C to 1985, fimH30 to 1986, and the FQ-R C1/C2 ancestor to 199121 (or 198629), consistent with a North American source. However, the earliest C representative here was from Norway in 1999 and was a blaCTX-M-negative blaTEM-1B-positive FQ-R genome from C2_9 (ERR191263334).

The origin of C2_4 was unclear: although the earliest isolate was in 2008 from the USA, the most basal branches within C2_4 had isolates spanning a range of continents, and C2_4’s long ancestral branch implied that it originated prior to 2008 (Supplementary Fig. S5). The isolates most closely related to C2_4 were a group of 11 assigned to C2_10 that had limited country and year of isolation data bar one from the UK in 2011, one from the USA in 2009, one from the Netherlands (Supplementary Fig. S3). The next most closely related group were nine C2_9 isolates that had no geographic data bar two from Australia in 2017.

Variable bla CTX-M-14/15/27 gene prevalence across time, geography and ST131 subclades

Alignment of the 4,071 assemblies against blaCTX-M-14/15/27 genes and CARD with BLAST showed that these ESBL genes were more common in Clade C (75%) than Clade A (45%) than Clade B (4%) (Fig. 4). Few isolates were both blaCTX-M-14/15-positive (0.4%) or blaCTX-M-15/27-positive (0.3%) (Table 2). Of the 2,408 blaCTX-M-positive Clade C isolates, 1,782 isolates were blaCTX-M-15, 424 isolates were blaCTX-M-27, 177 isolates were blaCTX-M-14, 15 isolates were blaCTX-M-14/15, and 10 isolates were blaCTX-M-15/27 (Supplementary Fig. S6) such that the rates were highest in C2_4 (93.8%) followed by C0 (90%), C2_9 (82.6%) and C1_6 (57%) (Supplementary Fig. S7). The earliest blaCTX-M-positive Clade C genome was from Canada in 2000 (ERR161284, C2_912,). 88% (339 of 386) of C2_4 isolates and 81% (1,338 of 1,651) of C2_9 isolates were blaCTX-M-15-positive with limited geographic or temporal structure (Fig. 4). This reiterated that the C2 ancestor was blaCTX-M-15-positive whose gains of other blaCTX-M genes were likely rare local events.

Figure 4
figure 4

Frequencies of blaCTX-M alleles in C subclades C1_6 (red), C2_4 (green) and C2_9 (blue).

Table 2 ST131 subclades’ blaCTX-M-14/15/27 genes. B0 (n = 13) is not shown because it had no blaCTX-M genes.

C1_6 had a different blaCTX-M gene rates to Clade C: blaCTX-M-27 (38%) was more common than blaCTX-M-14 (14%) or blaCTX-M-15 (6%) (Table 2). Of the blaCTX-M-positive isolates from subclade C1, the earliest was from 2002 (with a blaCTX-M-14-gene). The earliest blaCTX-M-27-positive and blaCTX-M-15-positive ones followed in 2004 and 2008, respectively. C1_6 was found in Japan only until detection in both China and Canada in 2005. BlaCTX-M-14 (but not blaCTX-M-27) was more common in Asia (OR = 4.4, 95% CI 2.21–8.85, p = 0.00007) as previously35, and both blaCTX-M-15 and blaCTX-M-27 were global.

Previous work showed that C1 subclade C1-M27 with the prophage-like genomic island M27PP1 was a key driver of MDR ST13136. Here, 468 (42%) of C1_6 isolates had M27PP1, and 97% of these were blaCTX-M-27-positive (397 of 410 with blaCTX gene data) as expected (Supplementary Table S3)36. These formed a single clade (Fig. 3) that has spread globally since the earliest date for a C1-M27 isolate here in 2004 in Japan (DRR05099736,). 74 of these C1-M27 isolates subsequently acquired the related prophage-like genomic island M27PP2, most likely as a single ancestral event given their monophyly here. Five subclade C2 isolates and one from Clade A had M27PP1: all three of the C2_9 isolates with blaCTX gene data were blaCTX-M-27-positive, and the earliest of these was from Japan in 2010 (DRR051016)36.

Diverse genomic locations of the bla CTX-M-14/15/27 genes’ contigs across ST131 subclades

Screening of the 505,761 contigs from the 4,071 assemblies for blaCTX-M-14/15/27-positive ones identified diverse structures and contexts based on annotation with MARA (Supplementary Table S4). 90 genomes from C2_9 had a blaCTX-M-15 gene in a transposition unit (TU) flanked by a 1,658 bp 5′ ISEcp1 and 3′ orf477 as a 2,971 bp ISEcp1-blaCTX-M-15-orf477Δ TU followed by a 3′ 5.8 Kb Tn2 (Supplementary Fig. S8, Supplementary Table S5), verified previously using long reads26. Normally this TU is on an IncF plasmid35, but selected C2_9 assemblies had a chromosomal insertion of this TU at the mppA gene (Supplementary Fig. S9) as shown previously21. Here, 13 additional genomes with source information had this insertion, indicating that C2_9 had spread to Thailand (as in14), Singapore and the Democratic Republic of Congo by 201437, consistent with a global spread. One C2_4 isolate from Pakistan in 2012 also had this TU inserted at mppA (SRR161005138), affirming that insertions at mppA recur due to local homology to ISEcp1’s 3′ inverted repeat39,40.

Inter-clade but not intra-clade accessory genome divergence

Accessory genome composition varies across genetically distinct groups due to ecological niche specialisation15: this was supported here at the clade level across the 4,071 isolates by a positive correlation of pairwise core and accessory genome distances measured with Poppunk23 (Fig. 5). This matched work on an E. coli dataset with n = 218 ST13129, and was evident in a higher shell gene number in Clade B compared to Clades A and C (Table 3), suggesting higher diversity in Clade B (Supplementary Fig. S10).

Figure 5
figure 5

The distribution of core (π, x-axis) and accessory pairwise genome distances (a, y-axis) with blue dots indicating isolate pairs and the contours indicating dot density (higher in yellow). Top left: All 4,071 assemblies displayed pairwise differences such that the contours indicated the three main clades: Clade A at π = 0.0038, a = 0.15; Clade B at π = 0.0014, a = 0.18; Clade C at both π = 0.0005, a = 0.08 and π = 0.0001, a = 0.09. Top right: 1,113 subclade C1_6 assemblies had a peaks mainly at π ≤ 0.001, a = 0.06. Bottom left: 386 subclade C2_4 assemblies had peaks at π = 0.0006, a = 0.045 and π = 0.0001, a = 0.040. Bottom right: 1,651 subclade C2_9 assemblies had peaks at π = 0.0007, a = 0.065 and π = 0.0, a = 0.090. Results for 2,416 blaCTX-M-positive Clade C assemblies and 52 subclade C0 assemblies were similar. Within subclades C1_6, C2_4, C2_9, isolates had more diverse accessory genomes compared to their core ones.

Table 3 The pangenome composition of ST131 clades and subclades showed stable core, soft core and shell genomes with open pangenomes (alpha).

Previous work on E. coli41 employed a metric of pangenome openness (alpha) that was similarly applied to our Roary pangenome results here to compare with previous findings that n = 648 Clade C isolates had a marginally more open genome (smaller alpha) than n = 140 from Clade B, and both were more open than n = 70 Clade A isolates31. Although our initial results revealed more open pangenome (alpha) in Clade B (0.762) than Clade A (0.807) or Clade C (0.806) or the whole collection (0.823), alpha was higher for small (<250) sample sizes here (Fig. 6) as indicated before41. Alpha estimates averaged across the sample size placed Clade C as more open than Clade A or Clade B (Table 3), highlighting a partial dependence of alpha on sample size that was removed once the sample sizes >250 when the relative rate of new genes became constant (Supplementary Fig. S11). The average alpha for 250 n 386 showed Clade C (0.716) was more open than Clade B (0.753) than Clade A (0.795) or the whole collection (0.771) (Supplementary Fig. S12) The upper limit of n = 386 was corresponded to the smallest group size, which was for C2_4.

Figure 6
figure 6

Top: The average number of genes in the ST131 pangenome (y-axis) increased as the 4,071 genomes were added (x-axis) indicating an open pangenome for the whole collection (black), as well as its clades and subclades: Clade A (blue), Clade B (green), Clade C (grey), subclade C1_6 (red), subclade C2_4 (orange) and subclade C2_9 (brown). Below: Alpha varied with numbers of genomes sampled (shown here for >30 genomes) and was more independent from sample number once the number of genomes examined about >250. Note that the x-axis’ log10 scale.

Within the C subclades, the pairwise core and accessory genome distances were not correlated: the accessory genomes varied extensively even with nearly identical core genomes (Fig. 5). Subclade C2_4 had a more open pangenome (0.696) than subclade C1_6 (0.755) or subclade C2_9 (0.822), which was evident when adjusting for the differing sample sizes (Fig. 6), though the average alpha placed C2_4 (0.754) and C1_6 (0.749) as about equally more open than C2_9 (0.799, Table 3). As above, the relative alpha levels were retained when alpha was averaged from >250 isolates up to the smallest (subclade C2_4) sample size with C2_4 (0.705) and C1_6 (0.731) as more open than C2_9 (0.742) (Supplementary Fig. S12).

Given that the high accessory genome diversity within subclades independent of core genome composition, the observed shell gene numbers were compared to expected values adjusted for sample number and gene frequency category change to investigate shell gene overlap across clades and subclades (see Methods). Pooled groups with divergent accessory genomes should have more net shell genes, whereas similar accessory genomes should have fewer shell genes. Clades B and C together had 6% less shell genes, whereas Clade A had an excess of 1% when combined with Clade C and 6% with Clade B (Supplementary Table S6). Within subclades, there was a small shell gene excess for C1_6 combined with C2_9 (3%), but C2_4′s shell gene composition differed from both C1_6 (22% excess) and C2_9 (23% excess, Supplementary Table S6). The same trend was observed for C2_4 combined with Clade A (41% excess) or Clade B (5%) in contrast to C1_6 (16% with Clade A, -8% with Clade B) and C2_9 (16% with Clade A, -11% with Clade B), indicative of more unique shell genes in subclade C2_4.

Discussion

By collating all available ST131 genomes to produce 4,071 high-quality draft assemblies, we reconstructed their phylogenetic relationships to show that ST131 was dominated by subclades C1 and C2. For isolates with blaCTX-M gene data, subclade C2 was 98% blaCTX-M-15-positive in contrast to C1 that had either blaCTX-M-27 (66%) or blaCTX-M-14 (24%) genes. Although the subclade C1 ancestor may have been blaCTX-M-14-positive, blaCTX-M-27’s increasing levels in C1 and its higher ceftazidime resistance due to a D240G substitution also in blaCTX-M-1542 indicated it will become more common. Although the subclades had different origins and ancestral ESBL gene compositions, both have become global with relatively consistent frequencies and minor differences in rates due to differing evolutionary patterns after emerging14. This worldwide co-circulation suggested newer lineages could become globally disseminated, with implications for infection control if they have altered host adhesion abilities (like fimH3043) or AMR variants (like FQ-R or blaCTX-M-15). This was highlighted by the emergence of C2_4, the C2_9 subgroup with an ISEcp1-blaCTX-M-15-orf477-Tn2 TU mppA chromosomal insertion, and many other contemporary examples such as blaOXA-48-producing ST13144. Within C1, the blaCTX-M-27-positive C1-M27 lineage emerged in this study as an increasingly common cause of infection globally. Tracking plasmid, MGEs and ESBL genes must be a key component of disease monitoring to consider potential future bacterial outbreaks’ spectrum of AMR.

Horizontal DNA transfer allows E. coli adapt to new ecological niches and contributes to its dynamic accessory genome45 where the cloud gene number increases with isolate number and diversity. Our analysis of this large collection’s core (3,712) and accessory (22,525) genes extended previous work showing that 283 predominantly ST131 isolates had 16,236 genes in an open pangenome with a core of 3,079 genes46, 21% less than the core genome count here. Nonetheless, ST131’s accessory genome may be streamlined: a more genetically diverse set of 1,509 E. coli including 266 ST131 had a core genome of 1,744 genes and a 62,753 cloud genes29, and an E. coli-Shigella core genome had 2,608 genes among a total of 128,193 genes41.

NFDS posits that genes associated with adaptation to new hosts, antibiotics and competitors using the same resources remain at intermediate frequencies31. Pangenome openness and shell gene sharing across clades supported inter-clade structure resulting from ecological specialisation47, with Clade A more different to Clades B and C. Within subclades, isolates with minimal core genome differences could have divergent accessory genomes, implying that plasmid, ESBL gene and MGE changes may be detected better using pangenomic approaches than assessing the core alone48.

Global coordination of data processing and bioinformatic interpretation can help identify, trace and control disease outbreaks49, for which resolving recent transmissions may be limited by sampling50. Expanding numbers of non-human isolates and the diversity of geographic regions sampled would help clarify potential sources of E. coli’s ESBL genes, for which there was no evidence of retail meat51 or livestock52 as reservoirs for blood stream infections thus far, though transfer of bacteria has occurred53. Better epidemiological information coupled with genome-sequencing54 could allow inference of adaptations across lineages55, such as blaCTX-M-15-positive ST8313, a putative descendant of ST131 subclade C25.

Methods

Study selection and data extraction

Data on 4,870 E. coli ST131 genomes and linked metadata was collected using an automated text-mining algorithm using a Python implementation of Selenium (Selenium-python.readthedocs.io) from Enterobase (https://enterobase.warwick.ac.uk)56 on the 10th of September 2018 as previously described57. This was used to download read libraries the European Nucleotide Archive (ENA) (www.ebi.ac.uk/ena)58 and NCBI Short Read Archive (SRA) databases as FASTQ files, restricted to complete libraries not labelled as “traces” (Fig. 1). Of the initial 4,870 read libraries, 4,264 were paired-end (PE) Illumina HiSeq ones and four were PacBio, in addition to the PacBio-sequenced NCTC13441 genome used as a reference in this study. 495 libraries predominantly from Illumina MiSeq platforms were not examined to avoid platform-specific artefacts.

Illumina HiSeq read data quality control, trimming and correction

Of the above 4,264 PE Illumina HiSeq read libraries, 4,147 passed stringent quality control. This was implemented using Fastp v0.12.359 to trim sequencing adapters, remove reads with low base quality scores (phred score <30) or ambiguous (N) bases, correct mismatched base pairs in overlapped regions and cut poly-G tracts at 3′ ends (Supplementary Table S2). Individual bases in reads were corrected by BayesHammer in SPAdes v3.1160. Quality control metrics were examined at each step: across the whole collection as a batch report using MultiQC v1.461 and on individual FASTQ files using FastQC v0.11.8 (www.bioinformatics.babraham.ac.uk/projects/fastqc/). 117 (2.7%) Illumina HiSeq libraries did not pass quality control.

Illumina HiSeq read library genome assembly

The 4,147 Illumina HiSeq libraries passing quality control were de novo assembled using Unicycler v4.6 in bold mode to merge contigs where possible62. This used SPAdes v3.1263 to generate an initial assembly polished by Pilon v1.2264, which ran iteratively until no further corrections were required by the optimal assembly. This approach was similar to Enterobase’s56, though Enterobase used BBMap in BBTools65, SPAdes v3.10 and BWA66 during assembly20.

Reference PacBio genome quality control and assembly

The ST131 reference genome NCTC13441 was isolated in the UK in 2003 and was in subclade C267. It had a 5,174,631 bp chromosome with 4,983 protein-coding genes and one pEK499-like type IncFIA/FIIA plasmid with two blaCTX-M-15 gene copies (accession ERS530440). Although four further PacBio read libraries were initially included to test genome assembly contiguity and ESBL gene context using longer read libraries, only one passed assembly annotation screening (AR_0058, accession SRR574973238). Its adapters were removed using Cutadapt v1.1868 followed by excluding duplicate reads with Unicycler v0.4.6. Base correction was implemented during genome assembly with Unicycler via SPAdes v3.12, and the genome assembly was iteratively polished by Racon v1.3.1 until no further corrections were required69. This 5,132,452 bp assembly had five contigs and 5,506 genes was assigned to C1 and had no ISEcp1.

Quality checking and annotation identifies 4,071 genome assemblies for investigation

For the 4,147 Illumina HiSeq assemblies and single PacBio assembly, quality was verified with Quast v5.070 based on the N50, numbers of predicted genes and open-reading frames, and numbers of contigs with mis-assemblies. The quality of the short read de novo assemblies was comparable to previous work whose requirements required assembly length in the range 3.7–6.4 Mb with <800 contigs and <5% low-quality sites20. Initial annotation of 4,147 Illumina HiSeq assemblies using Prokka v1.1071 suggested 77 assemblies had a distinct gene composition and should be excluded because they were either genetically divergent, did not assemble adequately, or had sub-standard read libraries. As a result, 4,070 Illumina HiSeq genome assemblies were selected (Supplementary Table S2) and aligned against the reference genome NCTC13441 and PacBio assembly AR_0058 (Supplementary Table S3). These assemblies had lengths in the range 4.3–6.1 Mb with a mean and standard deviation of 5,137 ± 121 Kb. This identified 4,829 genes on average per assembly (range 3,942 to 5,749, Supplementary Fig. S1). The variation in numbers of genes per assembly was largely explained by the total assembly length (r2 = 0.959). 53% of the 4,071 had no source data and 12% of the remainder had a non-human source.

Pangenome analysis to identify the core and accessory genomes

We created a pangenome based on the 4,072 annotation files using Roary v3.11.232 with a 100% BLAST v2.6.0 identity threshold using the MAFFT v7.310 setting72. The resulting concatenated core CDS alignment spanning 1,244,619 bases and 3,712 genes scaffolded using NCTC13441 was used for core genome analyses. 242 soft core genes were found that may have been due to assembly errors or other artefacts. Pangenomes for each clade, C subclade and various combinations were also created for accessory genome evaluation.

Phylogenetic reconstruction, population structure and subclade assignment

A maximum likelihood phylogeny was generated based on the core genome alignment of 4,071 genome assemblies with NCTC13441 as a reference across 30,029 SNPs (with 26,946 alignment patterns) for 50 iterations of RAxML v8.2.11 with a GTR model and gamma substitution rate heterogeneity73. 88% (3,585) of the assemblies were genetically unique. The total execution time on an Ubuntu v16.04 computer server with 256 Gb RAM using 52 threads was 24.4 days. Phylogenies were drawn and annotated using iTol v4.3.274.

Clade classifications were initially based on published ST131 fimH phylogenetic analyses associating clade A with fimH41, B with fimH22, B0 with fimH27, and C with fimH3021. To classify the large number of isolates in the C subclades, we clustered the 30,029 core genome SNPs as a sparse matrix using a hierarchical Bayesian clustering algorithm implemented in Fastbaps v1.0 (Fast Hierarchical Bayesian Analysis of Population Structure75) in R v3.5.3 with packages ape v5.3, ggplot2 v3.1.1, ggtree v1.14.676, maps v3.3.0 and phytools v6.60. This used default parameters except for a Dirichlet prior variance of 0.006.

The C1-M27 lineage was identified based on the prophage-like 11,894 bp M27PP1 region using the 3′ end of accession LC20943036, which was largely intact in the isolates with this element. The 19,352 bp M27PP2 prophage-like region was also examined using the 5′ end of the same accession. The monophyletic nature of the C1-M27 group was verified by examining their phylogenetic proximity, which contrasted with the paraphyletic M27PP2-positive C1 isolates, as well as other isolates from Clade A and C2 that were diverse.

ESBL gene screening and contig visualisation

We screened for contigs with blaCTX-M-14/15/27 genes across the 4,071 assemblies’ 505,761 contigs using BLASTn77 alignment of these three genes individually, and the Comprehensive Antibiotic Resistance Database (CARD) v3.0 requiring 100% identity for any match with a contig. Selected blaCTX-M-14/15/27-positive contigs were visualised using the Multiple Antibiotic Resistance Annotator (MARA)78, R v3.5.2 and EasyFig v2.2.279 to examine the local contig, MGE and gene annotation. A minority of isolates had incomplete contigs due to the small contig lengths. Frequencies of ST131 clades, subclades and their blaCTX-M-14/15/27 genes across geographic regions and time were examined with R packages dplyr v8.0.1, forcats v0.4.0, ggplot2 v3.1.1, ggridges v5.1, grid v3.5.2, plotly v4.9.0, plyr v1.8.4, purr v0.3.2, questionr v0.7.0, readr v1.3.1, rentrez v1.2.1, stringr v1.4.0, tibble v2.1.1, tidyr v0.8.3, tidyverse v1.2.1 and XML v3.98–1.19.

Pangenome analysis to find shared and unique accessory genomes

Roary assigned genes to the core (c) and the accessory genomes, including the soft core (s), shell (a) and cloud (d). The expected shell gene number (E[ap]) from the Roary output for a given pooled set of isolates (p) taken from groups i = 1..k was determined based on the shell gene number of group i (ai) weighted by the sample size (ni) corrected for the deficit in core (ci) and soft core (si) gene numbers: \({\boldsymbol{E}}[{{\boldsymbol{a}}}_{{\boldsymbol{p}}}]=\frac{{\sum }_{{\boldsymbol{i}}=1}^{{\boldsymbol{k}}}{{\boldsymbol{n}}}_{{\boldsymbol{i}}}{{\boldsymbol{a}}}_{{\boldsymbol{i}}}}{{\sum }_{{\boldsymbol{i}}=1}^{{\boldsymbol{k}}}{{\boldsymbol{n}}}_{{\boldsymbol{i}}}}-\frac{{\sum }_{{\boldsymbol{i}}=1}^{{\boldsymbol{k}}}{{\boldsymbol{n}}}_{{\boldsymbol{i}}}({{\boldsymbol{c}}}_{{\boldsymbol{i}}}-{{\boldsymbol{c}}}_{{\boldsymbol{p}}})}{{\sum }_{{\boldsymbol{i}}=1}^{{\boldsymbol{k}}}{{\boldsymbol{n}}}_{{\boldsymbol{i}}}}-\frac{{\sum }_{{\boldsymbol{i}}=1}^{{\boldsymbol{k}}}{{\boldsymbol{n}}}_{{\boldsymbol{i}}}({{\boldsymbol{s}}}_{{\boldsymbol{i}}}-{{\boldsymbol{s}}}_{{\boldsymbol{p}}})}{{\sum }_{{\boldsymbol{i}}=1}^{{\boldsymbol{k}}}{{\boldsymbol{n}}}_{{\boldsymbol{i}}}}\). The excess fraction of shell genes observed was (ap − E[ap])/E[ap]. Similarly, the expected cloud gene number E[dp] was computed from the cloud gene number of group i (di) weighted by the sample size (ni) adjusted for the difference in core (ci), soft core (si) and shell (ai) gene numbers: \({\boldsymbol{E}}[{{\boldsymbol{d}}}_{{\boldsymbol{p}}}]=\frac{{\sum }_{{\boldsymbol{i}}=1}^{{\boldsymbol{k}}}{{\boldsymbol{n}}}_{{\boldsymbol{i}}}{{\boldsymbol{d}}}_{{\boldsymbol{i}}}}{{\sum }_{{\boldsymbol{i}}=1}^{{\boldsymbol{k}}}{{\boldsymbol{n}}}_{{\boldsymbol{i}}}}-\frac{{\sum }_{{\boldsymbol{i}}=1}^{{\boldsymbol{k}}}{{\boldsymbol{n}}}_{{\boldsymbol{i}}}({{\boldsymbol{c}}}_{{\boldsymbol{i}}}-{{\boldsymbol{c}}}_{{\boldsymbol{p}}})}{{\sum }_{{\boldsymbol{i}}=1}^{{\boldsymbol{k}}}{{\boldsymbol{n}}}_{{\boldsymbol{i}}}}-\frac{{\sum }_{{\boldsymbol{i}}=1}^{{\boldsymbol{k}}}{{\boldsymbol{n}}}_{{\boldsymbol{i}}}({{\boldsymbol{s}}}_{{\boldsymbol{i}}}-{{\boldsymbol{s}}}_{{\boldsymbol{p}}})}{{\sum }_{{\boldsymbol{i}}=1}^{{\boldsymbol{k}}}{{\boldsymbol{n}}}_{{\boldsymbol{i}}}}-\frac{{\sum }_{{\boldsymbol{i}}=1}^{{\boldsymbol{k}}}{{\boldsymbol{n}}}_{{\boldsymbol{i}}}({{\boldsymbol{a}}}_{{\boldsymbol{i}}}-{{\boldsymbol{a}}}_{{\boldsymbol{p}}})}{{\sum }_{{\boldsymbol{i}}=1}^{{\boldsymbol{k}}}{{\boldsymbol{n}}}_{{\boldsymbol{i}}}}\). The excess fraction of cloud genes observed was (dp - E[dp])/E[dp].

Pangenome openness (alpha) was quantified from Roary results as Δn = kN−alpha where Δn was the number of new genes across N genome assemblies with n genes in total41 with R packages poweRlaw v0.70.2, igraph v1.2.4.1 and VGAM v1.1.1 (Supplementary Fig. S11). This power-law regression approximated Heaps’ law (from n = ĸNgamma for alpha = 1−gamma) such that an open pangenome has alpha <1 and a closed one alpha >180. Previously, diverse E. coli had alpha = 0.625 where alpha had a partial negative correlation with N41. Similarly, alpha was ~0.877 for ST131 clade C, ~0.898 for B, ~0.958 for A, and ~0.951 for all ST131, suggesting alpha was higher when genetically distinct clades were combined31.

Accessory genome composition across clades and subclades

The relative pairwise genetic distances of the core (π) and accessory (a) genomes were compared for each clade, each C subclade and all blaCTX-M-positive clade C isolates using Poppunk (Population Partitioning Using Nucleotide Kmers), which can distinguish closely related genomes23. Poppunk used variable length pangenome k-mer comparisons with Mash v2.181 and a Gaussian mixture model to examine the correlation of π and a per sample pair. This annotation- and alignment-free strategy complemented the approaches of Fastbaps, RAxML and Roary.

List of abbreviations

AMR: Antimicrobial resistance

CARD: Comprehensive Antibiotic Resistance Database

cgMLST: Core genome MLST

CTX-M: Cefotaximase

E. coli: Escherichia coli

ENA: European Nucleotide Archive

ESBL: Extended-spectrum beta-lactamase

Fastbaps: Fast Hierarchical Bayesian Analysis of Population Structure

FQ-R: Fluoroquinolone-resistant

Inc: Incompatibility group

MARA: Multiple Antibiotic Resistance Annotator

MDR: Multidrug-resistant

MGE: Mobile genetic element

MLST: Multi-locus sequence typing

NCBI: National Centre for Biotechnology Information

NFDS: Negative frequency-dependent selection

PE: Paired-end

Poppunk: Population Partitioning Using Nucleotide Kmers

rST: Ribosomal MLST

SRA: Short Read Archive

ST: Sequence type

TU: Transposition unit

SNP: Single Nucleotide Polymorphism