Background

Campylobacter is one of the most common causes of bacterial gastroenteritis worldwide, outnumbering Salmonella, toxigenic Escherichia coli and Listeria combined [1, 2]. In the United Kingdom (UK), it is estimated that more than 299,000 cases occur annually [3, 4]. In most cases, infections are self-limiting, however, some cases result in persistent or invasive infections where antimicrobial therapy may be necessary [5,6,7]. Diagnosis is usually made by identification using PCR-based rapid detection assays or culture-based isolation of a single presumptive Campylobacter colony from a stool specimen [8], which does not identify co-infection with different Campylobacter species nor the presence of multiple sequence types [9]. With an estimated minimum infective dose of between 500 and 10,000 organisms, C. jejuni is responsible for 90% of known human campylobacteriosis infections [10,11,12]. Intraspecific recombination within C. jejuni is frequent [13]. Due to this, genetic exchange events are frequently overestimated in SNP analyses that compare strains at the nucleotide level, which significantly reduces the signals of C. jejuni population structure [14]. Other analytical approaches have demonstrated some C. jejuni populations have undergone clonal proliferation that exhibit a multi-host profile and may account for a large proportion of clinical strains [15]. A cg/wgMLST genotyping approach demonstrated a lineage of C. jejuni (ST2254-9-1) that had low genetic variability compared to other lineages of C. jejuni [15]. However, typing schemes available for C. jejuni strain classification continue to be challenging. Diversity profiling using a fragment of porA gene in Campylobacter also identified wide diversity within broiler breeder and broiler flocks, indicating a diverse population of Campylobacter has the potential to transmit through the poultry meat production route [16].

Poultry meat has been the predominant source attributable to human campylobacteriosis cases [16, 17]. Further back from the direct exposure to consumers, genetic diversity of C. jejuni isolated from chicken carcasses at a slaughter plant included multiple genotypes that are associated with strains found in human infections [18].

Failure to capture the genetic diversity of a C. jejuni population within a single human case stool specimen may confound source attribution investigations [9]. Moreover, the replacement of culture-based testing by PCR-based analysis in diagnostic laboratories is eliminating the availability of C. jejuni isolates, making epidemiological tracking for outbreak investigation near impossible [19,20,21].

Currently, the most discriminatory method for investigating strain diversity is by using whole genome sequencing (WGS) and analysing the complete genome [22, 23]. Cao et al. [24] estimated that the C. jejuni pangenome consisted of 900 core genes and 4621 accessory genes, based on 173 C. jejuni strains, whilst Rossi et al. identified 678 core and 2117 accessory genes based on 6526 C. jejuni isolates [25]. The different sizes of the pangenomes between these studies can be attributed to the different genomes, software and cut-offs used, but both highlight that C. jejuni has a relatively small core genome compared to its large accessory genome. The genetic variation of the genome within the species is thought to be linked to some strains carrying genes associated with increased pathogenicity in human infection [24]. Pathogenicity genes associated with the organism’s ability to survive in adverse conditions and possible host specificity have been reported [24]. The overall diversity of C. jejuni therefore requires that a large proportion of the population is analysed in epidemiological investigations [24, 26].

Antimicrobial resistant infections are more difficult to treat, can last longer, and can cause further complications. This increases the costs of healthcare expenses and may further disseminate resistant Campylobacter in the community [25, 27]. The WHO has categorised fluoroquinolone resistant Campylobacter as a priority list pathogen and classified it as a public health threat [28]. Moreover, in recent years, C. jejuni derived from human and chicken specimens have been found to contain resistance to β-lactam and tetracycline antibiotics, which are widely used in human medicine [29, 30]. Since Campylobacter is known to exchange genetic material [36], including antimicrobial resistance genes (ARG), the inclusion of resistance determinants is another indicator of intraspecies diversity.

Genetic diversity among multiple isolates can also be described by mapping DNA sequences to a reference genome of the same species to identify variable sites that display single nucleotide polymorphisms (SNPs) [31]. However, SNP analysis for C. jejuni has drawbacks as this strategy treats horizontal genetic exchange, locally grouped SNPs acquired in a single event, in the same way as dispersed repeats acquired by multiple events [32]. Horizontal gene exchange is common between C. jejuni strains [33] and so standard SNP analysis without removing putative recombinations is likely to overestimate genetic distance between isolates. Campylobacter have high frameshift rates that can contribute to genetic diversity and host adaptation through phase variable gene expression [34].

The aims of this study were to investigate the intraspecies genetic variation of a C. jejuni population at the pangenome level within patients that presented with gastroenteritis and evaluate whether or not this diversity could have been accumulated since the estimated onset of campylobacteriosis.

Results

Patient and Campylobacter characterisation

One diarrhoeal stool specimen from four PCR-verified campylobacteriosis patients were cultured for Campylobacter using direct-plating and stool filtration prior to plating. Each patient presented with gastroenteritis with three patients presented acute diarrhoea while one patient (patient 3) presented with prolonged diarrhoea for 2 weeks prior to specimen submission. One patient reported travel prior to infection onset (patient 4). Patient age ranged from 7 to 80 years (Additional file 4: Table S1).

A total of 92 C. jejuni isolates were recovered (Table 1) with 40% (37/92) of isolates originating from direct plating and 60% (56/92) of isolates originating from stool filtrates. The number of isolates per patient specimen ranged from 13 to 30 isolates; all 92 were classified as C. jejuni. A quality check of the 92 C. jejuni assembled genomes revealed no evidence of mixed specimens. For one patient (patient 3), C. jejuni was only isolated using the filtration method, whilst for the other three patients, C. jejuni was isolated using a combination of direct and filtration methods. For the three stool specimens where C. jejuni were cultured using both methods, no genes, SNPs or frameshifts were associated with method of detection.

Table 1 Campylobacter recovery, sequence types and antimicrobial resistance in stool specimens of four patients

For three of the four patients a single C. jejuni sequence type (ST) -61 (patient 1), ST-2066 (patient 2) and ST-21 (patient 4) was detected within each patient specimen while for patient 3, two different STs [ST-51 (n = 12) and ST-354 (n = 1)] were identified (Table 1).

SNP and frameshift analysis

SNP analysis was conducted for 91 genomes, excluding the single ST-354 genome from patient 3. The reference genome used to identify SNPs and frameshifts consisted of CP007191 for patient 1, LR134500 for patient 2, NZ_CP059967 for patient 3 and NZ_CP059969 for patient 4. SNP analysis demonstrated a high density of SNPs in isolates from patient 3, even with the outlier ST-354 excluded, while the density of SNPs for the remaining patients’ isolates remained low (Fig. 1). Gubbins software removed a large number of SNPs from the alignment of genomes from patient 3, indicating a high amount of putative recombination had occurred amongst the ST-51 isolates from this patient. Isolates collected from patient 3 also contained a large number of frameshifts (Table 2). The proportion of genes containing non-synonymous SNPs or frameshifts was not significantly different between predicted genes with different COG functional groups (Additional file 4: Table S2) in patient 1 (p = 0.621), patient 2 (p = 0.619), patient 3 (p = 0.577) nor patient 4 (p = 0.871) (Fig. 2), however some did contain multiple non-synonymous SNPs and frameshifts within the same gene (Fig. 3). No SNP or frameshift was associated with a method of detection.

Fig. 1
figure 1

Density plots of single nucleotide polymorphisms number in four patients’ C. jejuni along the length of the reference genomes (ST-354 excluded)

Table 2 Number of SNPs and frameshifts between C. jejuni isolates from stool specimens of four patients
Fig. 2
figure 2

Proportion of gene functional groups (Additional file 4: Table S1) that contained a non-synonymous SNP or frameshift in 91 C. jejuni isolates from four patients (ST-354 excluded)

Fig. 3
figure 3

Number of non-synonymous SNPs and frameshifts in genes belonging to four patients’ C. jejuni (ST-354 excluded)

Pangenome analysis

For the C. jejuni isolates collected from each patient, the pangenomes consisted of a similar proportion of core genes, ranging from 0.84 to 0.86 (Table 3; Additional file 1: Figure S1). Some of the missing genes could be attributed to pseudogenes (Additional file 2: Figure S2). The proportion of genes in the accessory genome significantly differed between functional groups in all patients (p < 1 × 10–6) (Fig. 4). The largest difference was between functional group A (RNA processing and modification) and all other functional groups, but for all pangenomes only one gene belonged to this functional group. No gene was associated with method of detection.

Table 3 Pangenome structure of 91 C. jejuni isolates from stool specimens of four patients
Fig. 4
figure 4

Proportion of gene functional groups (Additional file 4: Table S1) that consisted of accessory genes in four patients’ 91 C. jejuni isolates (ST-354 excluded)

SNP modelling

The core non-recombinant SNP data amongst isolates from each patient was modelled to determine how many isolates would be required to identify 95% of SNPs in each specimen. The model estimates were close to subsampled estimates (Additional file 3: Figure S3). Modelling found that if we sampled an infinite number of isolates from each specimen we would identify 46–68 core non-recombinant SNPs (Table 4). In addition, to identify 95% of SNPs we would need to specimen 11–81 isolates from each specimen.

Table 4 SNP number modelling prediction using known Campylobacter genome information in four patients

Antimicrobial resistance determinant analysis

In silico antimicrobial resistance determinant analysis found that isolates belonging to the same sequence type contained the same AMR determinants. At least one AMR determinant was found in all STs, with ST-21, ST-51 and ST-2066 containing two AMR determinants and ST-354 containing three AMR determinants. Beta-lactamase conferring resistance gene blaOXA-61 was identified in ST-21, ST-51, ST-61 and ST-354. Tetracycline resistance gene tet (O) was identified in ST-51, ST-354 and ST-2066. A single chromosomal mutation of gene gyrA T86I, associated with fluoroquinolone resistance, was identified in ST-21, ST-354 and ST-2066. (Table 1; Fig. 5).

Fig. 5
figure 5

Maximum-likelihood phylogeny based on core gene alignments of 92 C. jejuni isolates from four patients, coloured by the presence of AMR determinants and method of detection

Discussion

In this study we report the genomic diversity of ninety-two C. jejuni isolates from four clinical stool specimens at the pangenome level. The C. jejuni were cultured using two methods: direct culturing and filtration. For one patient, C. jejuni was only isolated using the filtration method, whilst for the other three patients, C. jejuni was isolated using a combination of direct and filtration methods. For the three stool specimens where C. jejuni were cultured using both methods, no genes or mutations were found to be associated with method of detection. This demonstrates that using the two methodologies increased the chances of culturing Campylobacter but did not have an associated effect on the diversity observed.

In this study, we found the maximum number of core non-recombinant SNPs amongst C. jejuni isolates belonging to the same sequence type and from the same specimen was 12–43 SNPs. Since Campylobacter co-infection is known to occur [9] and genomic diversity generated within a patient through mutation and horizontal gene exchange is frequent [35], outbreak investigations using single colonies are unlikely to capture the genetic diversity of isolates within patients, which could lead to false conclusions [35]. Our modelling of SNPs suggests that to capture 95% of the core non-recombinant SNPs from specimens, up to 80 isolates would need to be collected.

In most cases, human campylobacteriosis is self-limiting, however a significant minority of invasive or chronic infections may require antimicrobial therapy [5, 6]. Campylobacter isolates from humans and chickens have evolved resistance to β-lactam and tetracycline antimicrobials [29, 36]. In this study, antimicrobial resistance determinants were associated with β-lactam, tetracycline and quinolone resistance. Previous studies have reported 50–61% of C. jejuni isolates with ampicillin resistance [37], 50–100% with tetracycline resistance [38, 39], and 11–40.5% with quinolone resistance [40, 41]. All of C. jejuni isolates from patients 1, 3 and 4 contained the blaOXA-61 gene, responsible for the production of β-lactamase [29] and associated with ampicillin resistance [36]. All C. jejuni isolates collected from patients 2 and 4 and the outlier ST-354 in patient 3, contained the chromosomal T86I mutation in gyrA associated with quinolone resistance. The single-step T86I amino acid change in the gyrA gene found in ST-21, ST-354 and ST-2066 of our study is one of the most prevalent resistance mutations on the chromosome associated with decreased Campylobacter susceptibility to fluoroquinolones [42] and so this was an expected finding. There is worldwide concern around quinolone resistance [43,44,45] threatening the treatment of severe Campylobacter infections in humans [46, 47], but transmission routes are not clear—understanding the diversity in a single patient will help us to track the movement of resistance.

Campylobacter collected from human patients have been shown to vary in genetic diversity. Dunn et al. [48] investigated two Campylobacter isolates collected from the same patient from separate stool specimens on subsequent days and identified a single SNP difference between them, whilst Cody et al. [49] investigated twenty patients, comparing two Campylobacter isolates collected from separate stool specimens and found three patients with isolates belonging to different sequence types and 17 with isolates belonging to the same sequence type but that differed at 3–14 loci (SNP or frameshift differences). In our study, we isolated two sequence types from one of the patients, and a maximum number of 12–43 core non-recombinant SNPs and 0–20 frameshifts amongst isolates belonging to the same sequence type from the same patient. These results indicate more diverse populations of Campylobacter than in the patient described by Dunn et al. [48] and some of the patients described by Cody et al. [49]. However, many of the patients described by Cody et al. [49] had Campylobacter populations with similar diversity measurements as those described in this study, suggesting that only collecting two isolates from a patient is often unable to capture the diversity of Campylobacter from patients. Bloomfield et al. [50] and Baker et al. [7] investigated C. jejuni collected from the same patients over several years and found a maximum number of 53–84 core non-recombinant SNPs and 18–19 frameshifts amongst the isolates collected from the two patients, and these mutations were associated with genes involved in cell motility and signal transduction. These associations were not observed in this study, suggesting the selection pressures identified by Bloomfield et al. and Baker et al. may occur in persistent infections over a longer time period. Frameshifts often occur in genes involved in phase variation and can rapidly accumulate in C. jejuni populations. Because of their genetic instability it has been argued that these frameshifts should not be used for public health investigations [22]. However, we also identified core non-recombinant SNPs that are more genetically stable. Bloomfield et al. [50] and Baker et al. [7] both used phylogenetic analysis on core non-recombinant SNPs to determine the date of common ancestor for the Campylobacter collected from each patient to estimate when the patients were initially colonised. However, these estimates assume the long-term patients were not colonised with a heterogenous population. Based on the results from this study they may have overestimated the length of time the long-term patients were colonised with Campylobacter.

The most distantly related isolates belonging to the same sequence type from each patient shared 12–43 core non-recombinant SNPs. The SNP modelling suggests that we did not detect all SNPs from isolates from the same specimen. Since C. jejuni has a substitution rate of 1.5–4.5 × 10–6 substitutions site−1 year−1 [51], that equates to 2–8 SNPs per year, suggesting the isolates may have shared a common ancestor years before the specimens were collected. The exact length of time between when the patient became infected with C. jejuni and the collection of stool specimens was unavailable for analysis in this study. In previous studies, patients excreted Campylobacter in their stool for up to 2 months post exposure [52]. However, the substitution rate estimates were based on long-term colonisation of human patients, and the substitution rate may be higher for C. jejuni during initial infections. Also, those long-term patients were all colonised with ST-45 and it has been proposed that substitution rates may differ significantly between different lineages of C. jejuni [53]. Regardless, we believe there is sufficient genetic diversity demonstrated between isolates in this study collected from the same patient to suggest that all patients were colonised with a genetically diverse population of C. jejuni. In the case of patient 3, isolates belonging to two sequence types, and in the case of patients 1, 2 and 4, isolates belonging to the same sequence types but genetically diverse in terms of core non-recombinant SNPs. The populations may have become more genetically diverse between infection and specimen collection, but it is unlikely that they accumulated the genetic diversity observed in this study after infection. It is also possible that the patients were exposed to C. jejuni on multiple occasions, and since three of the patient specimens contained isolates belonging to the same sequence type, multiple exposures to the same source type may be another exposure scenario. The level of diversity amongst the multiple isolates within a patient described here suggests an infection with a genetically diverse population of C. jejuni through  a single source or repeated infections from different sources containing different strains of C. jejuni, which have persisted in the human host.

Conclusions

Using direct plating and filtration culture methods, a total of ninety-two C. jejuni isolates were recovered from four different patients with gastroenteritis. For one patient, C. jejuni was only isolated using the filtration method, but for the others there were no genetic associations between isolates and method of detection. SNP analysis determined genetic diversity amongst a C. jejuni population within a patient’s stool, thereby the C. jejuni population may have shared common ancestors before specimens were obtained, indicating that infection could be a result of exposure to a varied population of C. jejuni, or a result of subsequent colonisations. The presence of functional genes found in isolates from the same patient varied greatly, as did non-synonymous SNPs and frameshifts in these genes. However, neither the mutation nor the accessory genes were connected to a specific gene functional category which indicates absence of selection at this point of time. The within-patient C. jejuni population variance found in this study informs the limitations that exist in studying single isolates per patient specimen and reveals the subtyping genotypic information of campylobacteriosis patients.

Methods

Stool specimen collection

Four surplus diarrhoeal stool specimens were collected from the National Health Services Eastern Pathology Alliance (EPA) network diagnostic laboratory, Norwich, UK. Stool specimens represented four separate anonymised patients with gastroenteritis symptoms who submitted specimens to the laboratory in August 2020. Patient specimens were identified in this study as ‘patient 1–4’ and were described further by this patient identifier. Demographic patient information, presentation of diarrhoea and duration of illness prior to specimen collection was recovered retrospectively (Additional file 4: Table S1). Campylobacter spp. were initially identified in the stool specimens by the diagnostic laboratory using a rapid automated PCR-based culture-independent testing panel (Gastro Panel 2, EntericBio, Serosep United Kingdom). Once PCR results were confirmed, a 5 mL aliquot of stool was placed into a sterile specimen container and transported to Quadram Institute Bioscience using transport of diagnostic specimens’ guidelines and subjected to culture-based isolation.

Bacterial isolation and identification

Stool specimens were cultured for Campylobacter using modified ISO methods (EN ISO 10272-2019) for detecting and enumerating Campylobacter [54] by direct plating and by filtration of stool prior to plating. For the direct plating, a 10 μl aliquot of each stool specimen was directly plated onto modified charcoal-cefoperazone deoxycholate agar (mCCDA) with cefoperazone and amphotericin B supplements (Oxoid, Hampshire, United Kingdom). In parallel, a 4 ml aliquot of each stool specimen was emulsified in 4 ml of phosphate-buffered saline (PBS) and filtered through a 0.65 μm syringe filter (Sartorius, Göttingen, Germany), before 10 µl was inoculated onto a mCCDA plate. All plates throughout the protocol were incubated in a microaerophilic atmosphere using anaerobic jars with CampyGen 2.5 L sachet (Oxoid, Hampshire, United Kingdom) at 37 °C for 48 h. C. jejuni strain 81116 was used as a positive control throughout the protocol.

Once incubated, up to 30 suspected Campylobacter colonies per patient specimen were sub-cultured onto mCCDA for purification, and further sub-cultured onto Columbia agar with 5% horse blood (Oxoid, Hampshire, United Kingdom). Colony morphology, microscopy, and oxidase test (Thermo Fisher Scientific, Loughborough, United Kingdom) were utilised to confirm presumptive Campylobacter isolates.

Genome extraction and library preparation

DNA from each isolate was extracted using Maxwell RSC Cultured Cells DNA Kits (Promega, Madison, Wisconsin, USA) according to the manufacturer instructions. A modified library preparation was utilised for high throughput sequencing. Libraries for sequencing were prepared using the Illumina DNA Prep (Illumina Ltd, Cambridge, United Kingdom) as previously described [55]. Sequencing was performed on the Illumina Nextseq500 platform using a mid-output flowcell (NSQ® 500 Mid Output KT v2 (300 CYS) Illumina, Cambridge, United Kingdom), producing 2 × 150 bp paired-end reads.

Genome analysis

Illumina raw reads were trimmed using Trimmomatic v0.38 [56] and assembled using Spades v3.12.0 [57]. Centrifuge v1.0.4 [58] analysis was performed on the trimmed reads to predict bacterial genus and species. The quality of the assemblies was analysed using Quast v5.0.0 [59], CheckM v1.1.3 [60] and aligning reads to the assembly using the Burrows-Wheeler aligner (BWA) v0.7.17 [61].

MLST was conducted using ARIBA v2.14.4 [62] with the C. jejuni pubMLST database to predict the sequence type (ST) [63]. Antimicrobial resistance (AMR) genes were identified with ARIBA and the ResFinder [64] database. A custom database consisting of the 23S, gyrA and gyrB genes from the NC_002163 C. jejuni reference genome was used to identify known mutations conferring macrolide and fluoroquinolone resistance respectively. This database was uploaded to GitHub (https://github.com/samuelbloomfield/C_jejuni_point_mutation_database). StarAMR v0.8.0 [65] was used to confirm the AMR determinants identified. StarAMR identified blaOXA-61 in all isolates from patients 1, 3 and 4, but ARIBA called the gene interrupted in some of these isolates. We compared the gene sequences of the intact and interrupted blaOXA-61 genes according to ARIBA and found them to be identical, so used StarAMR results for the blaOXA-61 genes. For the other AMR determinants, StarAMR and ARIBA gave concordant results.

ReferenceSeeker v1.8.0 [66] was used to identify the best reference for isolates from each specimen. Single Nucleotide Polymorphism (SNP) analysis was completed using Snippy v4.3.2 (https://github.com/tseemann/snippy) to align reads from each C. jejuni genome to the C. jejuni reference genomes. Gubbins v2.3.1 [32] was used to remove areas of putative recombination from full alignments. RaxML v2.3.1 [67] was used to form a phylogenetic tree from the full Snippy alignment using a Generalized Time Reversible model [68].

Phase variation analysis

Tatajuba v1.0.2 [69] was used to align the trimmed reads of all the isolates to the reference genomes and identify tracts that differed in size between the isolates. A cut-off of 0.90 was used when comparing frameshifts between C. jejuni isolates to account for potential small proportions of sequencing errors. eggNOG v5 [70] was used to predict the function of the genes in the reference genome. Fisher’s Exact test, computed using 106 Monte Carlo Markov Chains iterations, was used to determine differences between the proportion of each gene in each functional group (Additional file 4: Table S2) that contained a non-synonymous SNP or frameshift, with p-value < 0.05 regarded as statistically significant. All statistical analyses were performed using R v3.6 [71].

Pangenome analysis

Assemblies were annotated using Prokka v1.14.6 [72]. The coding-DNA sequences (CDSs) of these assemblies were extracted and a database was formed after removing duplicates using CD-HIT v4.8.1 (https://github.com/weizhongli/cdhit). ARIBA was used to search for the presence of these CDSs. The report files were concatenated, and pseudogenes were identified as previously described by Mather et al. [73]. CDSs that were found in more than 95% of C. jejuni isolates from a patient specimen were regarded as part of the core genome, whilst the rest were regarded as part of the accessory genome. eggNOG was used to estimate the function of the pan genome and Fisher’s Exact test, computed using 106 Monte Carlo Markov Chains iterations, was used to determine differences between the proportion of accessory genes in each functional group, with p-value < 0.05 regarded as statistically significant. CDSs, SNPs and frameshifts associated with method of detection (direct plating vs filtration) were investigated by searching each patient’s Campylobacter pangenome for CDSs or these mutations found in > 95% of isolates identified through one method and < 5% of isolates identified through the other method.

SNP modelling

The number of isolates required to identify a proportion of SNPs shared amongst isolates from a specimen was calculated by determining the probability of finding a single SNP (p). This was calculated by taking a SNP (i), counting the number of isolates that contain this SNP (x) and dividing by the number of isolates investigated (N). The process was repeated for all SNPs shared amongst the isolates (n). The arithmetic mean of these proportions was then calculated (Eq. 1). This process assumes every SNP is equally likely to be found, so we only applied this to core non-recombinant SNPs.

$$p\, = \,\frac{1}{n}\sum\limits_{i = 1}^{n} {\frac{{x_{i} }}{N}}$$
(1)

The proportion of SNPs found (q) was then calculated from the probability of finding a single SNP (p) and the number of isolates investigated (N) (Eq. 2).

$$q\, = \,1 - \left( {1 - p} \right)^{N}$$
(2)

The number of SNPs we would likely find if we analysed an infinite number of isolates from a patient (T) was then calculated given that we found n SNPs from N isolates (Eq. 3).

$$T\, = \,\frac{n}{{1 - \left( {1 - p} \right)^{N} }}$$
(3)

We then rearranged the equation to calculate the number of isolates we would need to specimen (N) to identify a number of SNPs (n) (Eq. 4). As this is a logarithmic equation, we would need infinite specimens to identify all SNPs, but it does allow us to identify a percentage (e.g. 95%).

$$N\, = \,\frac{{\ln \left( {1 - \frac{n}{T}} \right)}}{{\ln \left( {1 - p} \right)}}$$
(4)

To test these equations, we applied them to the core non-recombinant SNP datasets from each of the four patients investigated in this study using a range of specimen sizes. We also repeated the SNP analysis for each dataset using a range of specimen sizes. For each specimen size, 100 combinations of isolates were randomly selected with replacement, unless less than 100 combinations were possible, where all combinations were used. For each combination, the number of core non-recombinant SNPs was calculated. The mean and 95% confidence intervals were calculated for each specimen size and compared to the model estimates.