Background

Multi-locus sequence typing (MLST) has become the gold standard for the analysis of bacterial populations [1, 2]. MLST involves PCR amplification and sequencing of 5–10 loci of ~500 bp in length, with each sequence variant assigned a unique locus variant or allele number. Each unique combination of locus variants is assigned a sequence type (ST), which is then used to denote a precise set of sequences. Public MLST databases are used to store and share information linking DNA sequences to locus variant numbers and sequence types and are available for >85 bacterial species including important human pathogens such as Staphylococcus aureus Haemophilus influenzae and Neisseria species (see http://pubmlst.org). This format allows quick, simple and direct comparison of bacterial populations analysed in different laboratories and over time. The databases also link individual bacterial isolates to STs, serotypes, sources and other meta-information with public health utility.

Advances in sequencing technology continue to bring down the costs of whole-genome sequencing of bacteria – indeed it is already close to the cost of MLST [3]. Because bacterial genomes are relatively small in size (typically 1–10 Mbp) they can be sequenced in multiplex, allowing the high read depths of e.g. a single lane of Illumina HiSeq to be distributed among up to 96 different samples via the use of individually-tagged libraries [47]. Whole-genome sequencing provides a lot more information than MLST and can be used to study microevolution in much finer detail and over small time scales [38]. As an assay, whole genome shotgun sequencing can be simpler than MLST, which requires multiple independent PCR and sequencing reactions to be performed for each sample. This will become increasingly true as automation becomes widely used in whole-genome library preparation workflows. Also, while MLST requires the design and purchase of species-specific specific primer sets, whole genome sequencing can be applied to any bacterium, without species primers or even prior knowledge of species. However, while the advantages of whole-genome sequencing over MLST are clear, it is crucial that newly sequenced isolates (or populations of isolates) can be analysed in the context of the vast amount of population data currently stored in MLST databases. MLST can be used to assess the frequency and expansion of particular clones, and most MLST databases store not only ST information for each isolate, but also detailed meta-information such as serotype, host or source, date of collection and in some cases spatial information (see e.g. MLST-maps, http://maps.mlst.net[9]). MLST provides a framework against which new isolate collections can be compared and interpreted, but can only be utilized if the STs of newly sequenced isolates can be accurately derived from whole-genome sequencing data.

The only method currently available to perform MLST allele assignment on short read data is web-based, requiring sequence data to be uploaded to a server and compared to public MLST databases [10]. This poses problems for data security and confidentiality, is unfeasible for the large datasets typically generated in high-throughput multiplex sequencing projects and excludes the use of privately maintained MLST databases. All of these issues are likely to be significant barriers for use in the majority of research or public health laboratories. Furthermore the method depends on de novo assembly [10], which limits its sensitivity, particularly for genomes sequenced at low read depth.

Here we present an open-source software tool, SRST, to derive STs from Illumina short read sequence data using a mapping-based approach to maximise sensitivity. SRST can be used together with any public or private MLST scheme and generates output files suitable for comparative analysis with existing MLST datasets, compatible with standard MLST tools such as eBURST [11], ClonalFrame [12] and Phyloviz (http://www.phyloviz.net). In this paper, we introduce the SRST approach and demonstrate its accuracy with real datasets including 534 genomes from four species-specific and four plasmid MLST schemes, and discuss the usefulness of the method for quality control in high-throughput sequencing projects and outbreak investigations.

Implementation

SRST takes as input (a) locus variant sequences and ST profile definitions, retrieved from a public or private MLST database (such as http://pubmlst.org); (b) sequences flanking each locus, retrieved from an appropriate reference genome sequence using the supplied script; (c) Illumina read data (in fastq format; any number of paired or single-end read files can be processed in a single command). SRST runs on any Linux based computer or cluster (including Mac OS X) and requires the installation of the free packages BWA and SamTools for alignment functions [13, 14]. Full instructions are available at http://srst.sourceforge.net.

Each readset is mapped to each of the possible locus variants v (with flanking sequence) and a score s is calculated to assess the quality of the match, as follows. Consider a single base position i in the mapping of read set R to locus variant v, in which n i reads map to position i, which has base v i in v and majority-rules consensus (i.e. most prevalent) base r i in R. If r i  ≠ v i we record a mismatch and rule out v as a possible locus variant. Otherwise, we compute the probability that the base call r i (which matches v i ) is erroneous, by calculating Binomial probabilities for the three alternative bases x:

Pr X x = j = k n i n t k p j 1 p n i j
(1)

where k = the observed read count for base x at position i. The probability that a non-consensus nucleotide (which does not match v i ) is the best explanation for the coverage at position i is the sum of the three probabilities. The probability that a sequence other than v is the best explanation for the observed coverage across the whole locus is the sum of these probabilities across all positions i. The final score s reported is the negative log of this probability, so that higher scores reflect more hits. Note this treats the probabilities at each position as independent, which they are not. However the assumption is conservative, and only results in non-trivial over-estimation of the probability in cases where the score will be below a threshold of acceptance in any case.

For each readset and locus, the highest scoring variant, with zero mismatches and passing a user-settable cut-off value, is assigned. If all loci are assigned for a particular readset and the combination of variants is a known ST, this ST will be returned; if a novel combination of locus variants is detected this will be assigned a novel ST. If any loci are not confidently assigned, no ST can be called.

The main output is a file specifying the locus variants and STs of all input datasets, suitable for analysis with common MLST tools such as eBURST or Phyloviz.

A log file is also generated, detailing scores and coverage statistics for each locus and readset. Where an exact match to a known allele cannot be found, the closest allele (and number of mismatching bases) is reported and the closest ST is determined; these results are flagged so as to be distinguished from precise ST assignments.

Optionally, verbose output can be switched on in order to retain full sequence information for novel alleles, including the alignment (bam format), pileup (pileup format) and consensus sequence (fastq format) obtained from mapping to the closest-matching locus variant. See Additional file 1 for an example fastq generated by SRST. This is intended to facilitate investigation of novel alleles, in which case visual inspection of alignments or pileups is recommended. Some MLST databases may also accept fastq or bam files for submission of novel alleles.

Results and Discussion

Accuracy in calling sequence types

To determine a suitable cut-off score and test the accuracy of SRST, we utilized three publicly available datasets each representing a different group of bacteria – Streptococcus pneumoniae[6, 15], Staphylococcus aureus[4, 16] and Salmonella bongori[17, 18]. All short read, MLST and reference data were downloaded from public databases (Table 1). We ran SRST on each read set (N = 341 genomes, 2,387 loci) and examined the sensitivity (call rate; i.e. the proportion of loci for which a variant could be confidently assigned) and specificity (false positive rate; i.e the proportion of loci with incorrect variant calls) obtained using different cut-off scores. As Figure 1 shows, best results were obtained with a cut-off score between 7 and 11. We therefore set the default cut-off score for SRST to 10, and use this cut-off for all subsequent analyses reported below. In the few cases where an allele could not be confidently assigned using a cut-off score of 10, the expected allele and ST were still correctly identified by SRST as the most likely result.

Table 1 High throughput read sets analyzed in this study
Figure 1
figure 1

Score threshold determination for SRST. Each data point indicates the call rate (y-axis) and false positive rate (x-axis) for a given cut-off score, labeled in blue. False positive rate = proportion of loci with incorrect variant calls, call rate = proportion of loci for which a variant could be confidently assigned, total loci = 2,387 (341 samples, 3 species)

Quality control in high-throughput sequencing projects

The most obvious application of SRST is to assign STs to novel isolates whose STs are unknown. However we were also interested to use SRST for quality control in large-scale sequencing studies of bacterial clones, to allow early detection and identification of read sets that should be excluded from comparative analysis of the clonal group of interest. To demonstrate the utility of SRST for this purpose, we used it to analyse a set of Shigella sonnei genomes sequenced using paired-end multiplex Illumina GAII (Table 1) [19]. Note Shigella species are actually sublineages of E. coli[20], hence the E. coli MLST scheme [21] is used to investigate Shigella.

A total of 188 Shigella data sets had sufficient mean read depth (>10x) to analyse and 170 (90 %) of these matched a known ST with SRST scores ≥10 for all loci. The other 18 samples comprised (i) 15 with ≥1 locus scoring <10 but identified as a known S. sonnei allele and (ii) 3 with novel locus variants (each having one locus for which the highest scoring match (scores 20–160) differed from a known S. sonnei allele by one mismatching base). Hence all 18 were recognizable as single-locus variants of known S. sonnei STs, confirming their suitability for inclusion in a S. sonnei study. Consensus sequences and quality scores for the novel alleles, generated using SRST’s verbose option, are given in Additional file 1. Of the 170 STs assigned with score ≥10, 166 matched known S. sonnei STs (ST152, ST1502, ST1504, ST1505). Four matched those of other, non-sonnei, Shigella (S. flexneri, ST245; S. boydii, ST243, ST1025), indicating they were erroneously included in the S. sonnei isolate collection and should be excluded from the intended analysis of S. sonnei. The species status of these four isolates, identified by SRST, was confirmed by serotyping of the original isolates, and comparison of the read sets to reference genomes of S. flexneri and S. boydii. Hence SRST could successfully detect and identify outliers for removal. In contrast, allele assignment by blastn search of de novo assembled contigs (assembled using Velvet 1.0.13 and Velvet Optimiser 2.1.7) succeeded for only 60 % of the Shigella read sets.

Plasmid MLST

There are currently MLST schemes available for four types of plasmid – IncI1 [22], IncN [23], IncHI1 [24] and IncHI2 [25] (http://pubmlst.org/plasmid/). To test the accuracy of SRST for plasmid MLST, we used it to detect and assign 5-locus STs to IncI1 plasmids in the S. sonnei dataset. Since we do not have traditional plasmid MLST sequences available as a control, we mapped the reads to the reference sequence for IncI1 ST16 plasmid pEK204 (NC_013120) [26] and used the proportion of the plasmid covered by each read set as a measure of the real presence of IncI1 plasmids in the data (e.g. 90 % coverage of the reference plasmid would indicate an IncI1 plasmid is present; 10 % coverage of the reference plasmid would indicate no IncI1 plasmid is present). As Figure 2 shows, there was a strong correlation between the number of IncI1 plasmid MLST loci that could be assigned with confidence by SRST, and the coverage of the IncI1 reference plasmid, indicating that SRST is useful for screening for the presence of specific plasmid types. A total of eight read sets were assigned the same IncI1 ST (16) as pEK204; their coverage of pEK204 ranged from 94.2 %-99.0 % (mean 96.9 %), while the highest coverage of pEK204 in other read sets was 92.3 % (assigned to ST37). This suggests that SRST’s assignment of plasmid STs is as accurate as that for chromosomal STs.

Figure 2
figure 2

Use of SRST to detect plasmids. X-axis indicates the number of IncI1 plasmid MLST loci with a high-confidence allele assignment (score > 10) using SRST; y-axis indicates the proportion of the IncI1 ST16 reference plasmid, pEK204, covered by each read set

High-confidence IncI1 plasmid STs were assigned to a total of 26 S. sonnei. As Figure 3 shows, these represent a variety of very distinct IncI1 plasmid types, indicative of multiple transfers of divergent IncI1 plasmids into S. sonnei. The plasmid STs clustered geographically and, within geographic regions, temporally (Figure 3), suggesting there have been several, highly localized, transfers of distinct IncI1 plasmids into the global S. sonnei population.

Figure 3
figure 3

IncI1 plasmid STs detected among S. sonnei. Minimum spanning tree of all IncI1 STs present in the IncI1 plasmid MLST database as at March 13, 2012, generated using Phyloviz (http://www.phyloviz.net). STs detected among S. sonnei read sets are highlighted; node colour indicates the country of origin according to the legend provided (bottom-right); node size indicates the number of isolates according to the inset circle legend (top-left). Nodes representing known STs are labeled in black with the ST number (7, 13, 16, 26, 27, 55), unlabeled nodes are novel STs (novel combinations of known alleles). Dates of isolation are indicated for larger groups (coloured text). Arrow indicates the position of ST31, identified in isolates of E. coli O104:H4 associated with an outbreak in Germany in 2011

Outbreak analysis

Chromosomal and plasmid MLST can provide useful insights into bacterial pathogen outbreaks. SRST allows these insights to be rapidly extracted from whole-genome shotgun sequencing, which can be performed without prior knowledge of the species and with no need for PCR with species-specific MLST primers. To illustrate this, we utilised five Illumina short read data sets from the outbreak of E. coli O104:H4 causing hemolytic uremic syndrome in Germany in 2011 [27, 28] (accessions SRP000285, SRP008003, SRP008032-36, SRP007327). The data was generated in two different sites (BGI, China and Broad Institute, US). We used SRST to screen the outbreak data using the E. coli MLST database to identify the chromosomal ST and all publicly available plasmid MLST databases to identify IncI1, IncN, IncHI1 or IncHI2 plasmids (see above). Details of the datasets and results are provided in Table 2.

Table 2 Chromosomal and plasmid analysis of E. coli outbreak strains

SRST correctly identified the chromosomal ST of all five outbreak isolates as E. coli ST678, which matches that reported using traditional MLST approaches [29]. The closest available finished reference genome sequence to the E. coli outbreak strain, Ec55989, shares this ST, and has formed the reference for phylogenetic and gene content analyses of the German outbreak in all published studies [2730]. This illustrates the utility of SRST to rapidly identify the most suitable reference sequence for whole-genome analysis during an outbreak. At the time of the German outbreak, Ec55989 had not been entered into the E. coli MLST database and extensive read mapping to all available E. coli sequences (approximately N = 60 at the time) was required to identify the most suitable reference [28], however assuming all available reference sequences are entered into the relevant MLST databases, this could be achieved much more quickly and easily in future using SRST.

SRST also correctly identified the presence of an IncI1 plasmid of ST31 in each of the E. coli O104:H4 outbreak isolates (arrow in Figure 3). The outbreak strain’s antibiotic resistance plasmid was previously confirmed as IncI1 ST31 using traditional plasmid MLST (present in the IncI1 database as CTX-I1-O104:H4, see http://pubmlst.org/plasmid/). As Figure 3 shows, this plasmid is quite divergent from any we detected in the S. sonnei data. The IncI1 plasmid MLST database shows IncI1 ST31 plasmids have previously been identified in a variety of other E. coli hosts circulating in both humans and animals, often containing extended spectrum beta-lactamase CTX-M genes similar to that encoded in the outbreak isolates’ IncI1 plasmids (see IncI1 database at http://pubmlst.org/plasmid/). No IncN, IncHI1 or IncHI2 plasmids were identified by SRST, consistent with published reports of the outbreak genomes [2730].

Other potential applications

As SRST is database driven it could be used for other sequence typing tasks beyond MLST, provided appropriate databases are used as input. For example, it could be used to annotate drug resistance genes and alleles. Used in conjunction with the recent ribosomal MLST database [31], SRST could potentially be used for species designation of novel isolates. These applications could be useful in outbreak analysis, strain identification, surveillance, studies of mechanisms and transfer of drug resistance, and a variety of other public health and research applications.

Conclusions

SRST uses read mapping to assign sequence types to novel bacterial genomic sequence data, which offers several advantages over traditional MLST using PCR and Sanger sequencing. SRST is accurate and sensitive at allele assignment and can identify and retrieve novel allele sequences for further investigation. Being mapping-based it is more sensitive than assembly-based allele assignment for short read data sets, and can be run locally without reliance on web-based services or data uploads. SRST can be used in a variety of contexts, including simple allele assignment to novel data sets, quality control in batch sequencing projects, outbreak investigation, plasmid MLST and potentially in any scenario where database-driven sequence typing is required.

Availability and requirements

Project name: SRST (Short Read Sequence Typing)

Project home page: http://srst.sourceforge.net/

Operating system(s): Linux/Mac

Requirements: samtools 0.1.8, BWA 0.5.7 (open-source)

Programming language: Python 2.6.4

License: BSD

Any restrictions to use by non-academics: No