Introduction

Healthcare-associated infections (HAIs) are common throughout the world, and in industrialised countries the burden of HAIs exceeds that of all other communicable diseases combined1. The major causative agents are opportunistic bacterial pathogens, which are generally viewed as commensals but can take advantage of the weakened immune system and altered microbiome of hospitalized patients to cause disease2,3,4,5. Particularly concerning are HAIs caused by Gram-negative bacteria including Klebsiella pneumoniae, which is intrinsically resistant to ampicillin and increasingly displays acquired resistance to multiple additional drugs (multidrug resistance, MDR). Of particular concern, this organism readily acquires extended-spectrum beta-lactamase (ESBL) or carbapenemase genes that confer resistance to third-generation cephalosporins or carbapenems, respectively, leaving very limited options for antimicrobial therapy6. K. pneumoniae is amongst the leading causes of HAIs in hospitals globally, including urinary tract infections (UTI), pneumonia, wound infections and sepsis2,7. Recent studies confirm it is also a leading cause of neonatal sepsis in Africa8 and Asia9. K. pneumoniae is a clear opportunist, colonizing the human gut, nasopharynx and skin at high frequency. K. pneumoniae is the ‘K’ in the ESKAPE pathogens, the group of opportunistic pathogens that together account for the majority of clinically significant MDR HAIs3,10. ESBL and carbapenemase-producing (CP) K. pneumoniae combined make up the fastest-growing cause of drug-resistant infections in European hospitals11.

Asymptomatic K. pneumoniae colonization has been shown to be a source of HAIs, with attack rates estimated between 4-35% in colonized hospital inpatients12,13,14,15,16,17. Comparatively little is understood about K. pneumoniae pathogenesis, in contrast with closely related ‘true’ pathogens of the family Enterobacteriaceae, such as Salmonella or Shigella. Nearly all K. pneumoniae express the basic traits required for human infection, which are core to all strains: expression of O antigen lipopolysaccharide (LPS) and polysaccharide capsule (K antigen) (encoded by diverse K and O loci), the siderophore enterobactin (ent locus), and type I and type III fimbriae (fim and mrk loci)18,19. So-called ‘hypervirulent’ K. pneumoniae clones—which are typically hypermucoid, aerobactin-producing, and carry the K. pneumoniae virulence plasmid (Kp-VP) and serum-resistant capsular (K) types such as K1, K2, K518,20,21—are associated with community-acquired invasive infections such as liver abscess and ophthalmitis22,23. However, there is no evidence such strains are more likely than others to cause opportunistic infections in the hospital setting. The acquired siderophore yersiniabactin (ybt locus) has been identified as a virulence factor relevant to nosocomial infection;22,24,25,26 indeed a recent study of infection risk amongst patients colonized with CP sequence type (ST) 258 K. pneumoniae confirmed carriage of yersiniabactin (ybt) and modified O antigen synthesis as independent bacterial risk factors for subsequent infection26. K. pneumoniae also show propensity for nosocomial spread within the hospital environment, with sinks, drains, medical devices and cleaning products all being demonstrated as potential reservoirs of infection27,28,29. However, the relative contribution of patients’ own gut bacteria, vs nosocomial acquired bacteria, to the burden of K. pneumoniae HAI remains unclear, as does the question of whether bacterial features contribute to propensity for nosocomial infection.

Here we aimed to dissect the burden of K. pneumoniae HAIs in a tertiary hospital in Australia, via whole genome sequencing (WGS) of all clinical isolates identified in the hospital microbiological diagnostic laboratory for a one-year period. WGS studies have previously unveiled extensive diversity in the global K. pneumoniae population, which comprises hundreds of phylogenetically distinct lineages with variable gene content18,22. Additional diversity is harboured by related species in the K. pneumoniae species complex (KpSC), which includes seven species and subspecies that are difficult to distinguish by MALDI-TOF or biochemical tests30,31,32. However, the implications of this population structure to the role of K. pneumoniae as an opportunistic pathogen in the hospital setting are unclear, due to the limited focus of most WGS studies on either (i) MDR (CP or ESBL) HAI or (ii) hypervirulent community-acquired infections. Each of these clinical manifestations are associated with just a subset of clonal lineages (a few dozen common CP lineages, and fewer than a dozen hypervirulent ones18,33,34,35). Hence WGS studies focused on these subsets of infections reveal little about the diverse agents underlying the general burden of opportunistic infections. WGS has the advantage that it can also be used to identify nosocomial transmission clusters, although this has mainly been applied to investigation of CP or ESBL HAIs36,37,38,39, or restricted to blood isolates18,40,41, and so the overall contribution of nosocomial transmission to total burden of K. pneumoniae HAI has not previously been well characterised.

Results

Infection burden

A total of 362 clinical isolates of KpSC were identified at the microbiological diagnostic laboratory during the 1-year study period, collected from 318 patients (Fig. 1 and Table 1). The patients were 55% female and ranged in age from 20 to 97 years old, with median age 70 years. The median age for females was significantly higher than for males (75 vs 67, p = 0.001 using Wilcoxon rank-sum test, see Fig. 1d). The majority of patients had UTI (66%), 15% had pneumonia and 10% wound/tissue infections (Fig. 1a and Table 1). Ten percent of patients had disseminated infections (bloodstream and/or cerebral spinal fluid (CSF) isolates); most had no other isolate and the primary site of infection was not known (Table 1). UTIs were more common in females whilst pneumonia was more common in males (see Fig. 1d and Table 1). No statistically significant differences were observed in gender or age distribution for other infection types (Table 1). KpSC clinical isolates originated from specimens taken in 49 clinical units/wards; those contributing the greatest number of isolates were the emergency department (n = 72, majority causing UTIs (n = 54) plus 12 disseminated infections), ICU (n = 41, majority causing pneumonia (n = 25) plus four disseminated infections), and haematology ward (n = 15, majority causing pneumonia (n = 5) or disseminated infections (n = 7)). Sixteen isolates were collected in outpatient clinics (n = 15 UTI, n = 1 pneumonia).

Fig. 1: Characteristics of clinical isolates identified as K. pneumoniae in the hospital microbiological diagnostic laboratory.
figure 1

a Monthly isolate counts (total 362 isolates, from 318 patients), coloured by specimen type (maroon = disseminated, yellow = urine, blue = respiratory, green = wound, grey = other). Red line shows frequency of 3rd generation cephalosporin resistant (3GCR) isolates per month, according to the right-hand y-axis (central point indicates frequency, error bars indicate standard error). b Specimen types (coloured as per panel a) according to inferred mode of acquisition of infection (inferred from hospital contact in past 30 days (nosocomial) or past year (healthcare associated), see Methods). c Bars show proportion of isolates resistant to each drug (AK, amikacin; AMC, amoxicillin–clavulanic acid; AMP, ampicillin; CAZ, ceftazidime; CIP, ciprofloxacin; CRO, ceftriaxone; FEP, cefepime; CN, gentamicin; MP, meropenem; NOR, norfloxacin; SXT, trimethoprim-sulfamethoxazole; TIM, ticarcillin–clavulanic acid; W, trimethoprim; TOB, tobramycin; TZP, tazobactam-piperacillin). MDR = resistant to ≥3 drug classes other than ampicillin; 3GCR = resistant to CRO and/or CAZ. d Characteristics of patients (n = 318) from whom K. pneumoniae was isolated. Age distribution: points indicate individual patients (those with ≥1 3GCR isolate are coloured red). Boxplot elements are: center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points, all individual patients. Specimen type: stacked bars are coloured as per panel a to indicate the first specimen type, for each patient, from which K. pneumoniae was isolated. e 3GCR rates (amongst first isolate per patient, n = 318 patients), stratified by (i) day of onset of infection, relative to day of hospital admission = 0; (ii) mode of acquisition, inferred from hospital contact in past 30 days (nosocomial) or past year (healthcare associated) (see Methods); (iii) specimen type (note one patient had 3GCR disseminated infection, after an initial susceptible UTI).

Table 1 Characteristics of patients with KpSC clinical isolates.

Only 40% of patients had their first clinical KpSC isolate collected >2 days into their current hospital admission (i.e., meeting the standard definition of nosocomial onset) (Table 1). Taking into account prior contact with the hospital network in the last 1–12 months to ascertain likely mode of acquisition (nosocomial: onset on day >2 and/or prior hospital admission within 30 days; HA: hospital or outpatient contact in last 12 months and onset on day ≤2; CA: no such contact and onset on day ≤2; see Methods for details), we estimate that 49% of KpSC infections were nosocomial and a further 37% were HA. Just 13% of infections could be considered CA, of which 67% were UTIs and 19% wound infections (Fig. 1b). Forty percent of CA infections were admitted via the emergency department (n = 12 UTI, 3 disseminated and 2 wound infections) and 12% via ICU (n = 4 wound infections and 1 pneumonia). Pneumonia was significantly more common amongst nosocomial acquired infections (21%; vs 9.5% of HA and 7.1% of CA, p = 0.005 for test of difference in proportions), whilst wound infections were significantly more common amongst CA infections (19%; vs 5% of HA and 10% of nosocomial, p = 0.045 for test of difference in proportions) (see Fig. 1b).

The frequencies of AMR phenotypes per KpSC isolate are shown in Fig. 1c. Most isolates (63%) were susceptible to all drugs tested except ampicillin. The remaining 37% had acquired resistance to ≥1 drug tested, 21.3% were MDR (acquired resistance to ≥3 drug classes), and 19.6% were 3rd generation cephalosporin resistant (3GCR, of which 96% were MDR). At the patient level, four patients (1.4%) had ≥1 carbapenem (meropenem) resistant isolate and 46 (16%) had ≥1 3GCR isolate. 3GCR KpSC infections (using first clinical isolate per patient) were significantly associated with nosocomial infection, whether defined as onset >2 days after admission (OR 1.12 and p = 0.01, vs day 0–2 and adjusting for patient age, sex, specimen type using logistic regression), or accounting for other recent admissions (OR 1.13 and p = 0.05, vs CA and adjusting as above). Whilst no temporal or seasonal trends in monthly isolation rates were detected, either for individual infection types or in aggregate (p > 0.1 using Bartel’s rank test), the 3GCR frequency showed an increasing trend (p = 0.036 using Bartel’s rank test for trend, see Fig. 1a), rising from mean 15% (range 10–21% per month) in the first nine months of the study to mean 34% (27–46% per month) in the final three months (p = 0.0002, using test of difference in proportions of 3GCR between the two periods).

Genomic diversity

All clinical isolates identified as KpSC were subjected to WGS, yielding 328 pure isolate genomes from 289 patients for detailed comparative genomic analysis (Methods, Supplementary Fig. 1). WGS confirmed the majority of pure-culture isolates were K. pneumoniae (82.3%); the rest were K. variicola subsp. variicola (13.7%), K. quasipneumoniae subsp. similipneumoniae (3.7%) and K. quasipneumoniae subsp. quasipneumoniae (n = 1) (Table 2 and Fig. 2a). There were no significant associations between species and infection type, onset or acquisition; however, K. pneumoniae infections were more likely to display acquired resistance phenotypes compared to the other taxa combined (32.9% vs 14.5%; p = 0.01 using test for difference in proportions) (see Table 2).

Table 2 Frequency of genetic features and infection characteristics, by species.
Fig. 2: Core genome phylogenies for K. pneumoniae species complex isolates.
figure 2

a K. pneumoniae, b K. variicola, c K. quasipneumoniae. Trees shown are subtrees extracted from a maximum likelihood phylogeny inferred from a complex-wide alignment of core gene SNVs for all 328 sequenced clinical isolates (i.e., each species tree is rooted using the others as outgroups), available at https://microreact.org/project/kaspahclinical. The 21 ‘common’ lineages, each identified in ≥3 patients, are labelled in blue with their sequence type (ST). Columns indicate (1) infection type (disseminated, urinary tract, respiratory, wound/tissue, other); (2) antimicrobial resistance (ESBL, MDR, ESBL + MDR, ESBL + MDR + CP); (3) yersiniabactin (Ybt+, identified in K. pneumoniae only); (4) aerobactin (Iuc+, identified in K. pneumoniae only); coloured as per inset legend. ESBL = extended spectrum beta-lactamase, MDR = resistant to ≥3 drug classes other than ampicillin, CP = carbapenemase producing.

We assessed genomic diversity of the clinical isolates in terms of phylogenetic lineages, gene content, plasmid content, AMR and acquired virulence determinants, and surface antigen synthesis loci. We inferred a maximum likelihood core-genome phylogeny (Fig. 2) and used this to cluster the 328 genomes into 182 lineages (138 K. pneumoniae, 35 K. variicola, 9 K. quasipneumoniae) representing distinct strain types that have been separated from other lineages over many years of evolution (see Methods). These correlated very closely with the 179 unique STs defined by 7-locus MLST (see ST line vs lineage line in Fig. 3a; assignments of individual genomes to lineages and STs is given in Supplementary Data 1 and can be explored against the phylogenetic tree using the interactive viewer at https://microreact.org/project/kaspahclinical). Twenty-six patients contributed more than one sequenced isolate. In most cases (n = 21, 81%), isolates from the same patient matched at the lineage level, consistent with a single infecting strain and we classified this as a single infection episode (within-patient pairwise distances ranged from 0–16 SNVs, median 1 SNV). Of the remaining cases, three patients had one lineage identified in urine followed by a disseminated infection with a second lineage 2–19 days after; one patient had different MDR lineages (ST347 and ST491) detected in wound swab specimens collected from the same site three days apart; and one patient had one lineage (sensitive ST520) isolated from sputum and a second lineage detected 32 days later in both blood and sputum (norfloxacin-resistant ST111). We therefore classified these five patients as each having two distinct infection episodes, bringing the total number of genomically-defined infection episodes to 294. The cumulative counts of infection episodes, lineages and STs during the study are plotted in Fig. 3a. All curves were nearly linear (linear regression, adjusted R2 ≥ 0.97), and the slopes of the lineage/ST linear regression curves were 69% that of total infection episodes (67% considering K. pneumoniae only, dashed lines in Fig. 3), indicating extensive diversity of strains underlying the total infection burden.

Fig. 3: Genomic diversity of clinical isolates identified as K. pneumoniae in the hospital microbiological diagnostic laboratory (one per infection episode, n = 294).
figure 3

a Accumulation of unique infection episodes (total 294 in 289 patients), lineages or sequence types (STs) over the study period. Solid lines = all K. pneumoniae species complex (KpSC); dashed lines = K. pneumoniae only (total n = 238). b Distribution of pairwise gene content Jaccard similarity between genomes, calculated using either all genes (including n = 3095 core genes present in all strains), or the n = 4,001 common accessory genes (each present in 5–95% of genomes sequenced). Boxplots elements are: center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points, outliers. c Distribution of the number of acquired antimicrobial resistance (AMR) genes per genome, stratified by detection of an extended spectrum beta-lactamase (ESBL) gene. d Number of AMR genes per genome that were attributed to plasmid- or chromosome-derived contigs using Kraken. Counts in each category are given for each of the n = 294 genomes. Boxplots elements are: center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points, outliers. e Theoretical coverage of infections (y-axis) by vaccines targeting capsular antigens encoded by increasing numbers of K loci (x-axis, K loci ordered from most to least common in this data set as shown in Supplementary Table 1). Black line shows cumulative coverage of all unique infections (n = 294); red line shows unique 3GCR infections (n = 47); other lines show coverage of different infection types (maroon = disseminated, yellow = urine, blue = respiratory, green = wound, grey = other). f Frequency of O antigen (LPS) synthesis loci, stratified by specimen type (coloured as per panel e).

The mean number of genes per genome was 5031 (IQR 4936–5161), including 3,095 core genes that were present in all genomes. The pangenome comprised 23,075 genes in total, of which 4067 were very common (present in ≥95%), 4,001 were common (present in 5–95%), and 15,007 were rare (present in <5%) or very rare (9845 in <1%) (Supplementary Fig. 2). Gene content was largely conserved within lineages (median pairwise Jaccard similarity 0.93 for all genes including core genes, and 0.75 across common genes) but was quite distinct between lineages of the same species (median values 0.79 and 0.32 for all genes and common accessory genes, respectively) (Fig. 3b). We used multiple methods to assess plasmid load and diversity. Mob markers (associated with distinct plasmid mobility types42) were detected in genomes from 89% of infection episodes (median n = 2, IQR 1–3), and rep markers (associated with distinct plasmid replicons43) were detected in 84% (median n = 5, IQR 2–9). A total of 55 uniquely distributed rep markers were identified, including 25 that were present in ≥5% of infection episodes and 18 in ≥10% (Fig. 4). The number of unique mob and rep markers were significantly correlated across genomes (Pearson correlation coefficient = 0.614, p < 1 x 10−15), and both were significantly independently associated with total DNA in contigs predicted to be plasmid-derived (see Fig. 4). In total, 252 infection episodes (86%) had predicted plasmid load ≥5 kbp (89% with >0 bp, 85% with ≥10 kbp). Amongst these presumptive plasmid-positive genomes, median plasmid load was 233 kbp (range 7.3–704 kbp, IQR 171–306 kbp).

Fig. 4: Measures of plasmid number and diversity.
figure 4

Data used for this analysis was the first isolate per unique infection episode, for 294 genomically-defined infection episodes. Distribution of total DNA sequence (bp) in contigs assigned as plasmids are shown, stratified by a number of mob genes identified per genome and b number of uniquely distributed replicon markers identified per genome. Each point represents a unique genome sequence and is coloured according to whether ESBL genes were detected in the genome (inset legend). Boxplots elements are: center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range. Linear regression model fit: Total plasmid DNA (kbp) ~79 + 18 *(#mob) + 16 *(#rep); p = 8 x 10−4 for mob; p < 1 × 10−15 for rep. c Frequency of individual mob types. d Frequency of 25 common replicon markers (each present in ≥5% of infections).

AMR determinants

Screening for known AMR determinants identified 60 acquired AMR genes in genomes from 73 infection episodes (24.7%), including 44 (15%) carrying ESBL genes and 5 (1.7%) with carbapenemase genes. Seven distinct fluoroquinolone resistance-associated gyrA/parC mutations were detected in genomes from 18 infections (6%; including n = 15 that also had acquired AMR genes). The number of acquired AMR genes per isolate was bimodal; amongst those with any acquired AMR genes, the median was 10 genes and the interquartile range (IQR) 6-11 genes (range 1-22) (see Fig. 3c). ESBL genes were concentrated in 15 lineages (8%; 13 K. pneumoniae, 2 K. variicola; see Fig. 2). Six lineages accounted for 82% of ESBL infections (corresponding to ST323, ST29, ST491, ST340, ST231, ST17; all except ST17 were linked to nosocomial transmission, see below). Most (98%) ESBL-positive genomes carried other acquired AMR genes (median 10, IQR 9–11; see Fig. 3c), consistent with the high rate of the MDR phenotype amongst 3GCR isolates (96%). Carbapenemase genes were identified in K. pneumoniae ST340 (n = 2 infections, blaIMP-4) and ST231 (n = 3 infections, blaOXA-48). The majority of acquired AMR genes were predicted to be plasmid-borne (67.5%), and 8% chromosomally located (the rest were unassignable, Fig. 3d). Chromosomally integrated AMR genes were identified in genomes from 12 infection episodes and confirmed by long-read sequencing (Supplementary Data 1). These included six genomes (three ST231, three ST340) with blaCTX-M-15 integrated in the chromosome via ISEcp1, and one (ST29) with an entire 243 kbp MDR plasmid fused with the chromosome as previously reported12. Three chromosomes (K. variicola ST616 and ST1456) carried a novel acquired fosA homolog (closest relative being fosA7, 9% nucleotide divergence) in addition to the intrinsic fosA gene; however only one of these isolates (INF136) had elevated fosfomycin MIC (128 mg/L, vs wildtype range 16–32 mg/L, measured using agar dilution) so it is not clear whether this gene confers resistance.

Concerningly the 16S rRNA methylase gene rmtB (which confers high-level resistance to aminoglycosides) was found in isolates from three patients, displaying resistance to amikacin, gentamicin and tobramycin in addition to 3GC, amoxicillin-clavulanic acid, ticarcillin-clavulanic acid, tazobactam-piperacillin, ciprofloxacin, norfloxacin and trimethoprim/sulfamethoxazole. One was resistant to meropenem (MIC ≥ 16 mg/mL) and the other two had elevated MIC (1 mg/mL, compared to the wildtype <0.25 mg/mL for 96.4% of isolates). These three isolates were all ST231 and harboured three quinolone-resistance mutations (GyrA-83I, GyrA-87Y, ParC-80I) and a blaCTX-M-15 insertion in the chromosome; an IncC plasmid carrying rmtB plus 11 other AMR genes including the ESBLs blaCTX-M-15 and blaVEB-1, ermB (azithromycin resistance) and arr-2 (rifampicin resistance); and an IncL plasmid carrying the blaOXA-48 carbapenemase.

AMR phenotypes for key drugs were quite well predicted by known AMR determinants, however there were instances of unexplained resistance for most drugs (Table 3). Of the 47 3GCR infections, 42 carried known ESBL genes (n = 33 blaCTX-M-15 (n = 3/33 also carried blaVEB-1), n = 6 blaCTX-M-14, n = 1 blaCTX-M-3, n = 1 blaCTX-M-62, n = 1 blaSHV-12). The remaining five isolates were MDR and had tested positive for ESBL production in the diagnostic laboratory, but the corresponding genome sequences lacked ESBL genes and four lacked any acquired AMR genes. Re-testing of stocked cultures confirmed that four remained resistant to ceftriaxone (and were MDR), but one (INF255) had regained susceptibility to cephalosporins, fluoroquinolones, aminoglycosides, and trimethoprim/sulfamethoxazole. Another (INF018) was an ST323 separated from other ESBL-positive ST323 strains by ~30 SNVs. We therefore speculate that all five isolates initially had ESBL/MDR plasmids upon first isolation and testing, but these were lost during culture for DNA extraction (this would also account for most unexplained resistance for other drugs except trimethoprim).

Table 3 Comparison of AMR phenotype vs genotype.

The ybt locus encoding the acquired siderophore yersiniabactin was detected in isolates from 33% of K. pneumoniae infection episodes associated with 48 lineages (Fig. 2) but was not detected in other KpSC. Fourteen ybt locus types were identified (including the plasmid-borne ybt 4 in 14 strains from 11 STs, as previously reported24). Presence of ybt was not significantly associated with the presence of ESBL or other acquired AMR genes, however the additional acquired virulence factors iuc, iro, rmp and clb were exclusively found in ybt + isolates, with overall frequencies of 2.7% (iuc) or 3.1% (iro, rmp, clb). Iuc, iro and rmp were mostly co-located, and restricted to five known hypervirulent clonal groups associated with virulence plasmids (CG86, CG23, CG66, CG420, CG91/subsp. ozaenae; Table 4). Isolates harbouring the complete rmp locus were confirmed to be hypermucoid via the string test (except for a single ST86 isolate). Three clb variants were identified in five STs (Table 4), located with ybt in ICEKp10.

Table 4 Infections associated with carriage of acquired virulence factors.

Surface antigens

Capsule (K) and LPS (O) antigens have been proposed as targets for vaccination against K. pneumoniae infection44. K biosynthesis loci were confidently identified for n = 284 (96.6%) infection episodes, spanning 91 distinct K locus (KL) types (including 4 novel loci, KL167–170, see Supplementary Table 1). Forty-one K loci (45%) were identified only once each (Supplementary Table 1). Eight KL types were found in ≥3% of infection episodes each (Supplementary Fig. 3), together these accounted for 33% of all infection episodes including 57% of 3GCR cases (Fig. 3e). These top eight KL types were each associated with a dominant ST (Supplementary Fig. 3), suggesting that their comparatively high prevalence was driven by local clonal expansion (and potentially transmission) of specific lineages. However, all except one (KL21) were also found in multiple STs (4–7 STs each) indicating that these K types are also widely distributed across lineages (Supplementary Fig. 3). Operons for synthesis of the most common capsular polysaccharide sugar components, mannose (man) and rhamnose (rml), were present in the K loci of 64% and 29% of unique infection episodes respectively (Supplementary Table 1); these were not significantly associated with infection type (p = 0.71 and p = 0.19, respectively, using Chi-square test).

The theoretical coverage provided by multi-valent vaccines targeting increasing numbers of KL (ordered by KL frequency in the population) is shown in Fig. 3e. KL diversity was similar for each type of infection (Simpson’s diversities between 0.93 and 0.97), and theoretical vaccine coverage was also similar (Fig. 3e). Overall, 16 KL types (each with frequency ≥2%) would need to be targeted to cover 50% of infections (79% of 3GCR), and 31 KL types (each with frequency ≥1%) to cover 70% of infections (89% of 3GCR) (Fig. 3e). Enhanced coverage of 3GCR infections is attributable to high numbers of the ESBL-producing strains ST323 (KL21) and ST29 (KL30), which were transmitted in the hospital (discussed below).

Twelve distinct O types were predicted amongst 284 infections with typeable O loci (96.6%), with similar diversity observed within each of the four most common infection types (Simpson’s diversities between 0.75 and 0.83) (Fig. 3f). The most common O types were O1, O2 (including subtypes O2afg, O2a and O2ac) and O3 (including subtypes O3a and O3b), which together accounted for 77% of infection episodes (including 70% of 3GCR); O1 and O2 accounted for 49% (36% of 3GCR). The common 3GCR STs were O3b (ST323) and O1 (ST29). The mannose-containing O types (O3, O5, OL104) accounted for 35% of infections, but were not associated with infection site (p = 0.33 using Chi-square test).

Species hybrids

As hybridization between KpSC members has been reported previously22 we screened our genome collection for evidence of cross-species hybridization (see Methods) and identified 12 K. variicola clinical isolates whose genomes harboured imports of between ~100 kbp and ~1 Mbp of sequence from K. pneumoniae. These represent 12 unique infection episodes and 8 lineages, i.e., 27% of K. variicola infections and 23% of K. variicola lineages. Four further hybrids were identified amongst isolates reported previously from screening swabs at the same hospitals12,31

Ten of the hybrids belonged to K. variicola ST681 (3 UTI, 2 respiratory, 1 disseminated, 1 unknown, 3 throat swabs). One clinical respiratory isolate and two subsequent screening throat swabs were isolated from a single ICU patient. Genomic comparisons with publicly available ST681 genomes suggest that our ST681 isolates were in fact hybrids of K. variicola ST681 and K. pneumoniae (see Methods and Fig. 5). One isolate (INF232; from a woman in her 90s with UTI at Hospital D) comprised a ST681 K. variicola genome backbone with a 281 kbp recombinant region whose sequence closely matched K. pneumoniae (≤0.5% divergence, Fig. 5a). This isolate harboured KL143 (man-positive) and a truncated form of the O3/O3a locus (broken by an insertion of IS903B and likely non-functional). The nine other local ST681 isolates were very closely related to one another (0–7 pairwise SNVs) and shared with INF232 the 281 kbp K. pneumoniae import and also a second recombination import of 311 kbp, which spanned the K and O biosynthesis loci and resulted in import of intact KL10 (man/mannose-positive) and O1/O2v2 (O2afg) loci. Again, the imported region showed close homology (≤0.5% divergence) with K. pneumoniae, in which KL10 was originally described (Fig. 5a).

Fig. 5: Details of hybrids identified between different species of the Klebsiella pneumoniae species complex.
figure 5

Plots show nucleotide identity to K. variicola (red) or K. pneumoniae (blue) sequences in sliding windows along each genome. For a and b, homology between different genomes is shown in grey. Position of the capsule (K) biosynthesis locus is shown in yellow and is directly adjacent to the outer lipopolysaccharide (O) synthesis locus.

The other six hybrids all included recombinant regions spanning the K locus, resulting in import of various capsule loci from K. pneumoniae (see Fig. 5b–d). Five were associated with infections (UTI, respiratory, wound, sepsis) and one was from a rectal screening swab in the ICU study31. Four of these hybrids belonged to ST925, each comprising a K. variicola ST925 backbone with a different recombinant block spanning the K locus, between 90 and 565 kbp in size, apparently imported from K. pneumoniae (≤0.5% divergence; see Fig. 5b) and encoding distinct K and O types (KL9/O1, KL28/O3, KL102/O1, KL169/OL104). These strains, isolated from four different patients at Hospital A, differed from one another by >12,000 SNVs in the non-recombinant backbone regions, confirming they were not related to one another by recent local transmission. The other two hybrids were novel singleton STs (ST3095 and ST3060), also comprising K. variicola with one or two imported regions from K. pneumoniae (393 to 1043 kbp in size, Fig. 5c, d).

Genomics-informed understanding of disease burden

Of the total 182 lineages associated with 294 unique infection episodes, 139 (76.4%) were unique to an individual patient. These singleton lineages accounted for nearly half (47.3%) of all the infections, which most likely originate from the patients’ pre-existing gut microbiome. The remaining infection episodes were associated with 43 lineages that were detected in multiple patients, including 21 ‘common’ lineages (11.5% of all detected) that were each isolated from ≥3 patients and accounted for 37.8% of all infection episodes (labelled in Fig. 2). These comprised 20 K. pneumoniae and the hybrid K. variicola ST681 cluster. Isolates belonging to the common lineages were significantly and independently positively associated with ESBL, ybt, man-positive K loci, and nosocomial onset; and negatively associated with non-K. pneumoniae species and rml-positive K loci (Fig. 6a).

Fig. 6: Features of common lineages, and contribution of common and transmitted lineages to infection burden.
figure 6

a Genetic and patient characteristics associated with common lineages (identified in ≥3 patients), across n = 294 unique infection episodes. Circles indicate odds ratios estimated in a single multivariable logistic regression model with all 10 predictors; lines indicate 95% confidence intervals for those odds ratios; unadjusted p-values are shown, significant (p < 0.05) associations are coloured black. Binary variables: ESBL (extended spectrum beta-lactamase detected); Kp spp (K. pneumoniae species); Mannose+ KL (K locus includes man operon); Yersiniabactin (ybt detected); Onset day 3+ (isolated from specimen collected on day 3 or later); Patient sex (male); Mannose+ OL (O locus includes man operon); Aerobactin (iuc detected); Rhamnose+ KL (K locus includes rml operon). Continuous variable: Patient age (years). b Transmission network showing 12 clusters, defined as infection episodes in different patients identified less than 45 days apart with isolate genomes separated by ≤25 SNVs (mean 0.7 SNVs) and with plausible epidemiology (onset of non-index case occurred during hospital stay, at least 2 days after index case, and patients spent time in same hospital). Details of clusters are given in Supplementary Table 2. c Infections of different classes (ESBL+/−, ybt+/−) stratified by lineage type (transmission cluster, as defined in b; other common lineage, identified in ≥3 patients; lineage identified in 2 patients; or singleton lineage identified in 1 patient). d Monthly frequencies of ESBL+ and/or transmission-linked isolates. These analyses were completed using one isolate per unique infection episode (n = 294); where there were ESBL+ and ESBL- variants of the same strain associated with the same infection episode, the ESBL+ variant was included in order to best reflect the nature of the ESBL infection burden.

The frequency of the common lineages could potentially reflect nosocomial transmission, or a higher propensity to cause disease in colonized patients. We defined probable nosocomial transmission clusters as those with ≤25 pairwise SNVs between genomes isolated from the same hospital within 45 days and with plausible epidemiological links (Fig. 6b, see Methods). This identified 12 clusters of 2–9 patients each, involving 11 STs and associated with 41 infection episodes (14%) (Supplementary Table 2). Mean pairwise distance between clustered isolates was 0.7 SNVs (median 0, IQR 0–0, range 0–22). As expected, these infections were significantly associated with onset several days into the hospital stay (median onset day 4 for infections in transmission clusters, vs day 1 for other infections, p = 0.007 using Wilcoxon test). Notably one of these clusters involved the ST681 hybrid strain, which infected six ICU patients over a 2.5-month period. Patient age was independently associated with transmission (OR 0.98 [95% CI, 0.96–0.996], p = 0.02 in multivariable logistic regression model); but patient sex and bacterial virulence factors were not (Supplementary Table 3). Two-thirds of the transmission clusters involved ESBL+ strains (Fig. 6b); by comparison, just n = 4/139 (2.9%) of singleton lineages were ESBL+ (OR 61 [95% CI, 11–422], p < 1 x 10−7 for association between ESBL+ and transmission, using Fisher’s exact test). ESBL carriage was a strong predictor of onward transmission of a lineage: we estimate a crude probability of onward nosocomial transmission to be 28% (n = 8/29, [95% CI, 11–44%]) for unique ESBL+ strains and 1.7% (n = 4/236, [95% CI, 0–3.3%]) for ESBL- strains.

Overall, probable transmission clusters accounted for 55% of all ESBL+ infection episodes and 6.1% of ESBL- infection episodes (Fig. 6c). Assuming the first clinical isolate from each cluster represents the index case, this implies that 29 infection episodes (9.9%), including 21 ESBL+ infections (45%), resulted from nosocomial transmission (note this is a lower limit as it is possible that the first clinical isolate was also acquired in the hospital from an unsampled source, such as asymptomatic colonization of another patient or staff member, or an environmental reservoir). Transmission-linked ESBL+ infections occurred throughout the study period but were concentrated in Dec 2013 to Feb 2014 (Fig. 6d), during which time they accounted for 88% of ESBL+ infections (vs 38% in earlier months, p = 0.005 using proportion test). This was associated with clusters of ST29 (n = 8 patients) and ST323 (n = 4 patients) bearing the same blaCTX-M-15 plasmid (described previously12), and the highly resistant ST231 strain noted above (n = 3 patients).

The probable nosocomial transmission clusters account for only seven of the 21 common lineages (ST29, ST45, ST231, ST323, ST340, ST491, ST681). A further three common lineages carried acquired virulence factors (ST86, ybt plus virulence plasmid, n = 4 patients; ST792, ybt plus clb, n = 4 patients; ST133, ybt plus clb, n = 3 patients) and were associated with community onset (see Table 4). The remaining 11 common lineages were fairly typical in terms of AMR (89% susceptible, vs 80% for all other infections, p = 0.2), yersiniabactin (34% vs 24%, p = 0.1) and onset (53% day 0–2, vs 54%, p = 0.9), and it is unclear why they should be detected frequently in clinical infections. Notably, these lineages are also common in clinical isolate collections from other settings (see Methods): genomes of ST17, ST20, ST35, ST37, ST111, ST336, ST629, ST661 have all been reported in five other continents (Africa, Asia, Europe, North America and Latin America; n = 12–200 public genomes each) and ST27, ST221 and ST412 in two or three continents (n = 11–17 public genomes each). In contrast, amongst the other 156 lineages detected in this study, just n = 15 (9.6%) have been sequenced from five other continents and n = 71 (46%) have not been sequenced from any other continents.

Only 38 infections were classed as true community-onset, and these were not significantly different from healthcare linked infections in terms of age, sex, AMR, hypervirulence determinants, or K/O types (Supplementary Table 4a). Nosocomial onset of infection (i.e., day 3 or later of hospital stay, vs earlier or outpatient onset) was significantly positively associated with male sex, ESBL carriage and rml-positive capsule, independently and in a multivariable model (see Supplementary Table 4b). Similar results were observed when including those with a prior hospital admission within 30 days in the definition of nosocomial onset (Supplementary Table 4c).

Discussion

Here we analysed all clinical isolates identified as K. pneumoniae in a hospital clinical microbiology laboratory for a one-year period (Fig. 1 and Table 1), and found remarkable genomic diversity in the underlying population of organisms (Figs. 2 and 3). Consistent with previous studies, we found that 19% of isolates identified as K. pneumoniae by MALDI-TOF in fact belonged to other common members of the wider K. pneumoniae complex (Table 2)12,22,30,31. However even amongst the isolates confirmed as K. pneumoniae, in this single 1-year local snapshot of disease-associated strains, we detected huge genetic diversity in the form of 138 phylogenetic lineages bearing 78 distinct capsular biosynthesis loci (half of all K loci ever described), 60 acquired AMR genes and 55 plasmid replicons, with just 80% of genes shared pairwise between lineages (Figs. 2 and 3).

The sheer scale of genomic diversity associated here with clinical disease supports the view of K. pneumoniae as a classic opportunistic pathogen namely that: (i) any member of the population has the potential to cause disease in hospitalized patients whose underlying health is sufficiently compromised; and (ii) much of the hospital-associated disease burden stems from extraintestinal ‘escape’ by the patients’ own colonizing strains, rather than acquisition of the bacteria through nosocomial transmission12,31,45. Indeed only 10% of infections in this study were attributed to WGS-supported nosocomial transmission. Furthermore, our data suggest that most of the common lineages, were not transmitted within the hospital system; rather the reason they were detected in multiple patients is because they circulate widely in the human population. Indeed in many cases these lineages showed evidence of global dissemination. The reasons for the apparent success of these global clones within the human host population are not yet clear; however, as a group they were enriched for ESBL genes, the ybt locus, as well as man-positive and rml-negative capsule types. Notably, many of these lineages (including ST17, ST35, ST37, ST111, ST629, ST661) are amongst those frequently detected in food animals (cows, pigs and/or poultry46,47,48,49), so may constitute animal-adapted strains to which humans are frequently exposed via the food chain.

Our data reveal a clear association between AMR and nosocomial onset (using either definition). Notably, whilst nosocomial transmission added relatively little to the overall infection burden (~10%), we estimate that it roughly doubled the burden of ESBL infections. In particular, the rise in overall ESBL frequency observed during December 2013 and February 2014 (Fig. 1a) can be attributed to transmission of ST323, ST29 and ST231 during this period (Fig. 6d). Consistent with this, we estimate the crude risk of transmission resulting in secondary infection/s was negligible for non-ESBL infections (95% CI, 0–3.3%) but substantial (95% CI, 11–44%) for ESBL infections. These observations support the notion that interventions aimed at preventing cross-transmission in hospitals (e.g., hand hygiene, or seek-and-contain approaches to CP infections) could have a significant impact on reducing the total burden of AMR infections. However, the data also suggest that the underlying burden of opportunistic K. pneumoniae infection, which originate from diverse strains present in the gut microbiome of patients, might still remain high unless this source of infection is specifically targeted (e.g., by colonization or colonization-density screening)50.

One caveat of these analyses is the use of simple genetic and temporal distance rules to define WGS-supported nosocomial transmission (see Methods). However, the transmission clusters identified using these simple rules are concordant with those we identified previously in related contemporaneous studies12,31. The latter studies incorporated patient movement data, as well as carriage screening isolates, to detect silent transmission in the Hospital A ICU31 and in geriatric wards of Hospital C12. These studies found strong evidence for transmission of ST231, CG323, and ST681 in ICU31, and for transmission of CG29, CG323 and ST340 more widely in Hospital A. These transmission events also accounted for all instances of MDR K. pneumoniae colonization and infection detected at Hospital C, to which Hospital A geriatric patients are often referred for longer term care12. Similar detailed analyses of ST27, ST35, ST111, ST133, ST412, and ST792 in the Hospital A ICU found no evidence to support intra-hospital transmission of these clones, consistent with the analysis in the present study31. Hence, we consider the lack of detailed patient movement data to confirm transmission of the novel clones identified in the present study to be a minor limitation. Another limitation of the study is the reliance on stored clinical isolates for WGS. This provides the opportunity for evolution of the isolate during storage and passage, between the initial identification and susceptibility testing in the clinical laboratory and the later subculture and DNA extraction for WGS. Indeed, we identified five cases of probably plasmid loss, based on comparison of the initial susceptibility data and later WGS data. Additionally, reliance on single representative isolates means that we were unable to assess whether patients were co-infected with multiple KpSC strains. Five of the 26 patients from whom multiple isolates were captured had distinct lineages identified by WGS. We defined these as distinct infection episodes, however it is possible that some of these instances represent a single prolonged episode of co-infection with two distinct strains, whereby both strains were present in both specimens, but a different strain happened to be picked for storage from each specimen. However even if all five of these patients actually had a single episode of co-infection rather than multiple unique infections episodes, this would reduce the total number of unique episodes by only 1.7%, which would have little impact on the overall picture of diversity or associations with clinical features.

Besides AMR, we noted some significant associations between other bacterial factors and infection traits. Virulence plasmid-encoded loci (aerobactin, salmochelin, rmp) were associated with known hypervirulent clones and community onset of infection, often with diagnosis made upon presentation to the emergency department or ICU (Table 4). The ybt locus was associated with common lineages (Fig. 6a). This is consistent with previous observations that (i) ybt is enriched amongst clinical infection isolates compared to asymptomatic carriage isolates (in the range of >30% vs <10%);22,24,51 (ii) a recent report that ybt+ ST258 have a higher attack rate than ybt- ST258 in colonised patients;26 and (iii) the known mechanism by which yersiniabactin can enhance the potential for extraintestinal infection, by evading Lcn2-mediated host immunity25,52. Notably the presence of the man operon in the K locus (correlated with presence of mannose in the expressed capsule53) was also associated with common lineages (Fig. 6a). Mannose-containing capsules have been shown to be recognised by the mannose receptor on human and murine macrophages, promoting clearance and resulting in lower virulence (higher LD50) in a murine infection model54. Hence, we hypothesise that any advantage conferred by man+ K loci likely relates to the process of colonization rather than infection. Consistent with this, the overall prevalence of man+ K loci in our clinical isolates (64%) was the same as in a collection of n = 464 community carriage isolates recently published from Norway51 (65%). Presence of the rml operon in the K locus (expected to produce rhamnose-containing capsule53) was negatively associated with common lineages (Fig. 6a) and positively associated with nosocomial onset (using either definition, Supplementary Table 4b, c) but not nosocomial transmission. Indeed, the frequency of rml + was particularly high amongst patients whose first clinical isolate was collected after at least a week into their hospital stay (41% vs 23%; OR 2.3, 95% CI, 1.3–4.2, p = 0.003 using Fisher’s exact test). Thus, we hypothesise that K. pneumoniae with rhamnose-containing capsules have reduced virulence, as they are apparently less able to establish an infection until the patient’s condition deteriorates sufficiently in hospital. Consistent with this interpretation, the frequency of rml+ KL was low (10%) amongst infections diagnosed in the emergency department.

In contrast to the findings above, we did not find any evidence of association between infection site and genomic features of the bacteria, but rather infection site was associated with patient demographics and infection onset. Specifically, UTIs were significantly associated with age and female sex; respiratory infections with nosocomial onset; and wound infections with community onset. This is consistent with there being no genetically-determined tissue tropism (or limited effect size of bacterial factors, which were underpowered to detect here), and suggests the outcome of the host-pathogen interaction is primarily determined by the patient’s health status and vulnerabilities, consistent with the concept of opportunistic infection. The likely exception is the so-called ‘hypervirulent’ strains that express virulence plasmid-encoded siderophores and hypermucoidy;18,20,21 such strains were rare in our study, but were mostly detected in community-onset infections (bold in Table 4). The spectrum of K. pneumoniae disease identified here (~two-thirds UTI, 15% respiratory, ~10% wound, ~10% sepsis) mirrors patterns in other hospitals around the world55, and the genomic diversity we uncovered is consistent with WGS studies of unselected bloodstream isolates18,40,41 and even from AMR-selected studies;39,55,56 hence our results are likely to be broadly representative of the K. pneumoniae clinical picture in other hospital settings.

Aside from K. pneumoniae, the clinical significance of the other species in the K. pneumoniae complex remains open to investigation, although it is clear that both K. variicola and K. quasipneumoniae are capable of causing disease in hospitalized patients30,57,58. However, neither of these species was particularly prevalent in our study (18% of all infections, combined). Interestingly, within just a single year of collection in our hospital, we detected 12 K. variicola / K. pneumoniae hybrid isolates (8 unique strains or variants), all of which involved imports of K. pneumoniae into a K. variicola background and resulted in import of a capsular biosynthesis locus from K. pneumoniae (Fig. 5). Large-scale recombination between K. pneumoniae lineages, centred mainly around the K locus, has been reported several times; the best-known example is the emergence of the carbapenemase clone ST25859,60. However to our knowledge very few species hybrids have been reported previously22,46,61, and those reported appear to represent sporadic isolations consistent with the expectation that cross-species hybrids have compromised fitness. However, in the present study the KL10 ST681 hybrid strain showed evidence of local transmission in the hospital, spreading to cause silent gut colonization of three patients and various infections in six patients, demonstrating that it is clearly fit to transmit. Notably, the imported capsular synthesis locus KL10 is relatively common in K. pneumoniae clinical isolates62, with a recent study of the >13,000 publicly available K. pneumoniae genomes in GenBank reporting frequencies of 2.1% amongst human blood isolates, 1.5% amongst other human isolates, and 0.3% and 0.5% amongst animal and environmental isolates, respectively63. Hence, we hypothesise this may have contributed to the hybrid strain’s ability to colonise or infect humans.

Overall, we show that by sequencing all clinical isolates, we can gain a much more nuanced view of the burden of K. pneumoniae infections. WGS clarified the burden of infection in this setting resulted mainly from diverse strains present in the patients’ own gut microbiomes, including a very low frequency (<3%) of hypervirulent strains and enriched for a small number of successful lineages associated with AMR, yersiniabactin, and man+  K loci; on top of which there was an additional burden (~10%) resulting from nosocomial transmission that is strongly associated with ESBL.

Methods

Ethics approval and consent to participate

This project complies with all relevant ethical regulations. Ethical approval for the project was granted by the Alfred Hospital Ethics Committee in Melbourne, Australia (Project numbers #550/12 and #526/13). A consent waiver was granted for the inclusion of limited patient data related to clinical isolates, extracted from hospital and laboratory records by hospital staff who normally have access to the data, and shared in deidentified form with research staff for analysis in this project.

Setting and sample collection

The Klebsiella Acquisition Surveillance Project at Alfred Health (KASPAH) was conducted over a one-year period from April 1, 2013 to March 31, 2014 in Melbourne, Australia. All clinical isolates identified as K. pneumoniae by the Alfred Health microbiological diagnostic laboratory as part of routine care were included in the study, if they were reported by the laboratory as associated with infection (see details below). Four hospitals in the Alfred Health Network are served by this laboratory. All patients from whose specimens the isolates were cultured were recruited into the study (consent was not required). Relevant clinical data was extracted from the laboratory and hospital records at the time of recruitment (date of specimen collection, specimen type and referring hospital, patient age and gender). Clinical review of hospital records was undertaken retrospectively for all participants, 4 years post-recruitment, in order to classify each isolate as community-acquired (CA), healthcare-associated (HA), or nosocomial in origin (details below).

Clinical isolates

Clinical isolates were included in the study when the treating physician referred a specimen to the diagnostic service of the microbiology laboratory for analysis based on clinical suspicion of infection, and K. pneumoniae was then identified and reported as a pathogen according to the in-house standard operating procedures as previously described31. Species identification was performed using matrix-assisted laser desorption ionization-time of flight (MALDI-TOF) (Vitek MS®, bioMerieux Marcy l’Etoile, France). All K. pneumoniae identified from sterile sites (blood cultures, cerebrospinal fluid, deep tissue biopsies, pleural fluid) and from cultured prosthetic material (e.g., central venous catheters) were reported as pathogens, as long as other enteric or skin flora was not detected. For other specimen types, a K. pneumoniae infection was deemed present if sufficient concentrations of neutrophils were seen on microscopy or Gram stain, and K. pneumoniae was found to be the sole organism present or the predominant organism if the sample was also expected to contain normal flora. K. pneumoniae would be reported as an infection in the absence of neutrophils if the patient was neutropenic. The vast majority of isolates resulted from wound swabs, sputum samples or urine samples. Where K. pneumoniae was identified in urine samples or wound swabs along with other enteric bacteria (e.g., E. coli), the laboratory reported this as mixed enteric flora; K. pneumoniae isolated from such specimens were excluded from the study. Wound swabs were collected when signs of infection were present (i.e., purulent discharge) and reported as K. pneumoniae only when other enteric or skin bacteria were not also identified. K. pneumoniae was reported as clinically significant when it was the predominant isolate from a well collected sputum specimen. When samples were clearly from the oral cavity (indicated by the presence of saliva (macroscopically), squamous epithelial cells (microscopically) and mixed oral flora on culture), then a small amount of K. pneumoniae was not reported as pathogenic but rather as ‘mixed oral/enteric flora’ and such isolates are not included in this study. Laboratory data entry was done using Microsoft Access 2013 and Excel 2013.

Three distinct infection acquisition statuses were defined: community acquired infections, nosocomial infections, and healthcare-associated. Community-acquired (CA) infections were defined by isolation of K. pneumoniae from an outpatient or on day 0, 1 or 2 of current admission as an inpatient, and with no recorded prior contact with the Alfred Health Network (either as an inpatient or outpatient) in the previous 12 months. Nosocomial infections were defined by isolation of K. pneumoniae on day 3 or later of the current inpatient admission, or with recent inpatient admission (in the last month). K. pneumoniae infections in patients not meeting the criteria for nosocomial infection but having some recorded contact with Alfred Health in the last 12 months (including as an inpatient 1–11 months prior to the current infection, or with one or more prior outpatient visits up to 12 months prior to the current infection), were considered as healthcare-associated (HA).

Antimicrobial susceptibility testing

All clinical isolates were subjected to antimicrobial susceptibility testing in the clinical microbiological diagnostics laboratory upon isolation (i.e., in 2013-2014), using the Vitek2 GN card and interpreted using 2020 EUCAST breakpoints. Antimicrobials tested were: ampicillin (to which K. pneumoniae are intrinsically resistant via chromosomally encoded beta-lactamases), amoxicillin-clavulanate, ticarcillin-clavulanate, tazobactam-piperacillin, cefazolin, ceftazidime, ceftriaxone, cefepime, norfloxacin, ciprofloxacin, amikacin, gentamicin, tobramycin, trimethoprim and trimethoprim/sulfamethoxazole. If the susceptibility pattern suggested an ESBL enzyme was present, this was confirmed using the method of Jarlier64. Isolates were classified as MDR based on acquired resistance to three or more classes of antimicrobials (i.e., not counting ampicillin resistance which is intrinsic) as previously defined65. Selected stored isolates were re-tested in 2021 in order to investigate issues identified upon sequencing: INF034, INF155, INF167, INF255, INF018 (whose genomes were ESBL-negative but ceftriaxone resistant) were re-tested via Vitek2 GN cards; INF307, INF048, INF136 had novel fosA genes in the chromosome and were subjected to agar dilution in triplicate to assess MIC to fosfomycin.

DNA extraction and whole genome sequencing (WGS)

All isolates were subjected to DNA extraction for Illumina sequencing in 2015, using a phenol:chloroform method via phase lock gel tubes (5PRIME) as previously described31. Barcoded Illumina DNA libraries were prepared using Nextera XT or TruSeq protocols and sequenced on the HiSeq 2500 platform, generating paired-end reads of 125 bp each. Eighty-seven isolates (26%, see Supplementary Data 1) were later subjected to fresh DNA extraction using a protocol based on GenFind (Beckman Coulter) reagents (doi: 10.17504/protocols.io.p5mdq46), and multiplex long-read sequencing with an Oxford Nanopore Technologies (ONT) MinION device (as described previously66), to facilitate assembly, pan-genome and plasmid analyses.

Species analysis and quality control of WGS data

A total of 362 infection isolates from 318 patients were included in the study. Two of these isolates failed the Illumina library preparation step prior to sequencing. Three of the sequenced read sets were excluded from further analysis because preliminary analysis showed that the sequences were dominated by non-K. pneumoniae DNA (two were predominantly Klebsiella oxytoca and one was predominantly Acinetobacter baumannii). This could be due to either mixed culture with K. pneumoniae and another bacterium, or contamination following the initial identification of K. pneumoniae. Since the original identification was recorded as K. pneumoniae, and the presence of K. pneumoniae DNA was confirmed by sequencing, we include these three specimens in the reporting of K. pneumoniae clinical isolates; but excluded them from further genomic analysis. A further 29 DNA sequences were excluded from genomic analysis due to either (i) failing quality control thresholds (mean read depth <25×, coverage of reference sequence <85%), or (ii) suspicion of mixed Klebsiella strains (ratio of heterozygous:homozygous core gene variant sites ≥2%). The remaining 328 WGS-confirmed K. pneumoniae isolates (from 289 patients) underwent detailed genomic analyses. A flow chart of isolate and genome processing is given in Supplementary Fig. 1, details of isolates and WGS data accessions are given in Supplementary Data 1.

Genome assembly, annotation and pan-genome analysis

Illumina short reads were trimmed using Trim Galore v0.5.0 (https://github.com/FelixKrueger/TrimGalore) (default settings) and assembled de novo using SPAdes optimised with Unicycler v0.4.767. Where ONT reads were available, these were processed as described previously66 and combined with Illumina reads to generate hybrid assemblies using Unicycler v0.4.7. Contigs were annotated with Prokka v1.13.368. Pan-genome analysis was performed using panaroo v1.1.269 with default settings (sequence identity threshold 95%, protein family sequence identity threshold 70%, length difference threshold 95%).

Single nucleotide variant analysis and multi-locus sequence typing

Single nucleotide variants (SNVs) were identified by mapping Illumina reads against the ST23 K. pneumoniae strain NTUH-K2044 reference genome (NC_012731.1), using the mapping pipeline RedDog v1b.10.3 (https://github.com/katholt/reddog). RedDog uses Bowtie2 v2.2.570 to map reads and SamTools v1.271,72 to call SNVs with Phred quality score ≥30, as described previously31. Multi-locus sequence typing (MLST) was performed, and sequence type (ST) assigned based on the 7-locus scheme73, by analysing assemblies with Kleborate v2.0.063. Novel STs were submitted to the K. pneumoniae BIGSdb-Pasteur database23 for allele assignments. To identify other geographic continents from which STs identified in this study have previously been reported, we used MLST data reported for 13,156 whole genome sequences publicly available in RefSeq in July 2020 (available as Supplementary Data 2 in Lam et al. 63).

Phylogenetic analysis

Core genes were defined as those that were annotated in the reference genome and present (coverage ≥95% and mean read depth ≥5×) in all of the sequenced isolates based on the mapping analysis. A maximum likelihood phylogenetic tree was inferred from an alignment of all homozygous SNVs (n = 690,727 SNVs) identified within 3,135 core genes in the 328 genomes, using FastTree v2.1.874. The tree file is available via MicroReact (https://microreact.org/project/kaspahclinical). Phylogenetic clusters were defined using a patristic distance threshold of 0.01 and were extracted from the trees using R to define lineages. The threshold for clustering was determined by assessing the distribution of pairwise distances, which showed an inflection point at patristic distance d = 0.01 (0.0044% of pairs cluster using d = 0.01, compared with 0.00025% using d = 0.005, or 0.0376% with d = 0.015). The patristic distance 0.01 in our tree equates to ~6900 core SNVs, or 0.13% nucleotide divergence.

Surface antigen biosynthesis and acquired virulence loci

Capsule locus (KL) types and lipopolysaccharide O antigen (O) types were identified from the resulting assemblies using Kaptive v2.0;62,75 KL and O types with a match confidence of ‘good’ or better (as described at https://github.com/katholt/Kaptive) were reported; genomes with a match confidence of ‘low’ or ‘none’ were investigated through manual exploration of the assembly graph in Bandage v0.8.176. Putative novel loci were extracted and annotated with Prokka68 followed by manual curation. Loci that could not be resolved via manual inspection were marked as “unknown” (i.e., if the assembly graph was fragmented in the region of the K/O locus, or because there was not a single unambiguous path through the locus). Kleborate v2.0.063 was used to screen each genome assembly for key acquired virulence factors that are significantly associated with invasive infections in humans:22 yersiniabactin24 (ybt), salmochelin35 (iro), colibactin24 (clb), aerobactin35 (iuc), and regulators of the mucoid phenotype (rmp locus, rmpA2).

Plasmid analyses

Plasmid content was assessed using multiple methods. Replicon (rep) markers were identified by screening assemblies against the PlasmidFinder database43 using BLASTn (80% identity and 80% coverage thresholds). Mob types (mob) were assigned using iterative PSI BLAST as described previously42. Contigs were assigned as being of plasmid or chromosomal origin using Kraken as previously described77 (all other contigs were marked as ‘unknown’).

Genetic determinants of antimicrobial resistance (AMR)

Kleborate v2.0.0 was used to screen each genome assembly for acquired resistance genes and known chromosomal mutations associated with resistance to fluoroquinolones, colistin and carbapenems63. The detected AMR determinants were used to predict resistance to ceftriaxone (based on presence of acquired ESBL genes), meropenem (acquired carbapenemases and ompK35/36 alleles), ciprofloxacin (known gyrA and parC mutations and acquired qnr genes, but not aac(6’)-Ib-cr as there is no evidence this gene can raise the MIC above the breakpoint for clinical resistance in the absence of other determinants78), gentamicin (acquired resistance genes defined in CARD v3.0.879), trimethoprim (acquired dfr genes), and trimethoprim/sulfamethoxazole (acquired dfr genes plus sul genes). In line with established norms for reporting on accuracy of susceptibility testing in clinical laboratories80, and translation of these principles to reporting on WGS-based identification of resistance81,82, results were expressed in terms of major and minor errors. A major error was defined as a phenotypically susceptible isolate that carried one or more determinants of resistance for the specified drug; a very major error was defined as a phenotypically resistant isolate in which no known resistance determinant for that drug was identified in the genome.

Detection of species hybrids

The genome collection was screened for hybrids by using BLASTn to align contigs against a set of reference assemblies for each Klebsiella species. The BLAST alignments were then used to assign per-species sequence identity to each position in the contig. Each assembly’s overall species composition was then quantified, based on assignment of genomic regions to the closest matching species, and hybrids identified as those with ≥3% of the genome assigned to a second KpSC species. This analysis was implemented in a Python script, available at http://github.com/rrwick/Klebsiella-assembly-species.

Transmission analysis

Pairwise core gene SNV counts were extracted from the SNV alignment described above, and used to infer transmission networks comprising nodes (one representative isolate per infection episode) connected by edges where the pairwise distance was ≤25 SNVs (based on our previous investigation of within-patient vs between-patient SNV distances31, and other recent studies of CP K. pneumoniae transmission38,39) and the temporal distance was ≤45 days (twice the median of time-to-infection estimated for colonized patients14). The network function in the R package network (v1.17.1) was used to construct the transmission network, and to extract clusters of connected nodes (isolates). Putative transmission clusters were manually reviewed for plausible epidemiological links; one was removed because onset of the second case occurred on day 1 of admission and no previous admissions with Alfred Health were recorded; another was removed because it comprised specimens taken from one outpatient and one inpatient collected on the same day.

Statistical analysis

All statistical analyses were conducted using R (v3.6.3). Specific tests used are given together with each result in the text, corresponding R functions are: wilcox.test for Wilcoxon rank-sum test (two-sided); prop.test for test of differences in proportions (two-sided); fisher.test for Fisher’s exact test (two-sided); chisq.test for Chi-square test (two-sided); bartels.rank.test with left-sided test for trend, to test for temporal trends in monthly isolate counts; glm with ‘family = binomial(link = ‘logit’)’ for logistic regression. Figures were plotted in R using ggplot2 v3.3.5, ggnetwork83 v0.5.10 and ggtree84 v2.4.2.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.