Introduction

Leptospirosis is a globally distributed zoonosis responsible for more than one million severe cases and 60,000 deaths per year, with the highest incidence in tropical countries [1]. The agent of leptospirosis belongs to the genus Leptospira, which is composed of 68 species and more than 300 serovars [2, 3]. The strains responsible for leptospirosis in humans or animals belong to one of the eight pathogenic Leptospira species described to date. Among them, L. interrogans is the most frequently encountered worldwide [4] and several studies have shown that strains belonging to the Icterohaemorrhagiae serogroup (L. interrogans serogroup Icterohaemorrhagiae), of which the main reservoir is the rat, are responsible for the most severe forms of the disease [5,6,7,8].

Pathogenic leptospires are slow-growing bacteria that require a rich culture medium susceptible to contamination by other organisms. Isolation from biological samples is therefore tedious, especially as the bacteria can be present in low concentrations in blood and urine. During the course of infection, the bacteria are present in the blood during the first week after the onset of symptoms, with their concentration determined by qPCR ranging from 102 to 106 Leptospira/mL for the peak of leptospiremia [9, 10]. A decreasing number of leptospires are then found in the blood 6–7 days after the onset of symptoms until Leptospira nucleic acid is no longer detectable [10, 11]. Leptospira can also be detected in urine after symptom onset for a longer duration than blood. However, the bacteria may not be consistently present in the urine during the infection and both their concentration and the duration of their excretion are poorly defined.

The identification of circulating strains in a particular region is essential for establishing appropriate control and prevention measures such as development of vaccines, control of potential reservoirs, information for the general population, etc. Typing of the clinical isolates can also be important for identifying strains or virulence factors associated with disease severity. However, as indicated above, culture isolation is challenging.

Given the value of whole genomes for phylogenetic, epidemiological, and biological studies, there is an increasing interest in obtaining the genomic sequences of pathogens from clinical samples. This is particularly true for pathogens that are found in low quantities in the host organism and difficult to culture, as for pathogenic Leptospira. Single alleles, such as rrs [12], ligB [13], lfb1 [14], and secY [15,16,17], can be directly amplified from the samples and sequenced for subtyping but this approach provides only low-level resolution and does not allow discrimination between serovars and closely related strains. Multi-locus sequence typing (MLST) schemes using several alleles can also be used for direct typing from clinical samples, but this can result in incomplete allelic profiles [18,19,20] and they provide limited genetic information on the infecting strain. We recently developed a core genome MLST (cgMLST) scheme based on 545 genes that are highly conserved across the Leptospira genus [4]. Our cgMLST scheme allows the identification of pathogenic species, serogroups, and closely related serovars. However, this highly discriminatory approach requires culture isolation of clinical strains. Direct sequencing from clinical samples is hampered by high human host DNA contamination. Illumina sequencing of the cerebrospinal fluid of a patient with neuroleptospirosis showed, for example, only 0.016% of the sequence reads corresponding to the bacterial agent of leptospirosis [21]. Due to the low number of pathogenic microbes in clinical samples, several culture-independent genome sequencing methods have been recently developed using host depletion and/or microbial enrichment approaches. Targeted DNA enrichment, which relies on reference genomes of the target bacteria, has thus been used to retrieve the DNA of bacterial pathogens, such as Chlamydia trachomatis [22], Mycobacterium tuberculosis [23], and Treponema pallidum [24,25,26], from clinical samples.

Here, we describe a method utilizing biotinylated RNA probes designed specifically for L. interrogans DNA to capture the Leptospira core genomes defined by our cgMLST scheme [4] directly from routine diagnostic samples. This study demonstrates, for the first time, the successful and accurate high-coverage sequencing of Leptospira genomes directly from biological samples.

Methods

Samples

Twenty routine diagnostic samples (12 blood, 1 serum, and 7 urine samples) testing positive for Leptospirosis by real-time PCR in the French National Reference Center (NRC, Institut Pasteur) were analyzed in this study (Table 1); there was no attempt of culture isolation for these samples. Total DNA was extracted using DNeasy Blood and Tissue DNA extraction and QIAamp kits (Qiagen) and PCR was performed by real-time PCR using lfb1 as a target [27]. Sequencing of the PCR products of lfb1 enabled identification of the Leptospira species. Leptospira interrogans- or the L. interrogans-related species L. kirschneri-infected samples with a Ct value ≤ 38 were further selected for this study (Table 1).

Table 1 Samples used in this study and identification of the infecting Leptospira strain

SureSelectXT target enrichment

In total, 42,117 custom-made SureSelect 120-mer RNA baits (total probe size 1459 Mb) based on 130 L. interrogans genome sequences (Additional file 2: Table S1) spanning the 545 core genes previously defined for our cgMLST scheme [4] were designed and synthesized by Agilent Technologies.

For all samples, libraries were prepared following the Sureselect Xt HS Target Enrichment System for Illumina from Agilent Technologies. For pre-capture library preparation, between 2 and 200 ng total gDNA was used. Briefly, samples were mechanically sheared using a Covaris E220, repaired, and the ends A-tailed for barcoded Illumina adapter ligation. Ligated samples were amplified for 14 cycles and library quality was assessed using the Fragment Analyzer HS NGS migration kit. Libraries were captured individually, as recommended by Agilent. Captured libraries were pooled and sequenced using Illumina sequencing technology (Miniseq or Nextseq 500 sequencers). We also sequenced five libraries (samples 3, 4, 5, 7, and 8) before capture on an Illumina Miniseq sequencer to assess the efficiency of the capture method on patient samples.

Sequence analysis

Prior to mapping, reads were trimmed of the adapters. The variant calling pipeline described herein includes an additional trimming step performed during the mapping step, which removes bases with a Phred score < 30. The estimated depth of coverage was computed using the genome of the L. interrogans serovar Copenhageni strain Fiocruz L1-130 as a reference. Taxonomic analysis based on Kraken2 [28] using human, bacterial, and viral databases allowed us to classify an average of 99% of the reads with a minimum of 97.2% for one sample.

We generated a database of 273 genome sequences of representative pathogenic Leptospira strains from our publicly accessible online genome database https://bigsdb.pasteur.fr/leptospira/, which is based on the software framework Bacterial Isolate Genome Sequence Database [4]. Our allele database includes genomes from eight pathogenic species from 23 serogroups and 59 serovars isolated from patients from 40 countries (Additional file 2: Table S2).

Variant calling analysis was performed using the variant calling pipeline (v0.10.0) from the Sequana project [29] (https://github.com/sequana/variant_calling). Default parameters were used except for the minimum frequency set to 0.5. The mapping step was performed using BWA (v0.7.17) [30] and variant calling was performed using freebayes (v1.3.2) [31]. Annotations were also included in the final HTML report by using prokka [32] annotation of the core genomes. All VCF (variant calling format) files (20 samples times 273 strains) were processed to gather the number of SNPs, INDELs, and MNPs in each sample and each strain using the VCF files and the Python notebooks used to process them are available on Zenodo [https://doi.org/10.5281/zenodo.7584745].

Variants were removed from subsequent analysis if one of the following conditions was met: (i) a frequency of the alternate < 0.5 (minor variants), (ii) a strand balance < 0.2 (or > 0.8), indicating an unbalanced count of forward or reverse reads supporting the variant, or (iii) coverage < 10. For sample 18 (highest Ct), we allowed the depth to be as low as 4.

Results

Targeted enrichment was applied to 20 biological samples, consisting of blood, serum, and urine samples, from leptospirosis patients (Table 1). Leptospira isolates were collected from patients living in mainland France but patient E mentioned having traveled to the Philippines 2 weeks before the onset of symptoms. Among the 20 patients, all were males (71.2%) and the median (range) age was 46 (20–74) years. Patients presented with median (range) of 3.9 (0–8) days of acute illness and most of them exhibited fever, headache and myalgia; (patient Q died).

Library preparation, hybridization, and subsequent enrichment were carried out on samples using the SureSelect Target Enrichment System (Agilent Technologies) [33] and custom designed RNA baits. We compared the proportion of reads mapped to the Leptospira reference genomes with or without the SureSelect system for five samples to better evaluate the efficiency of Leptospira capture (Fig. 1). The percentages of reads mapped to the Leptospira reference genomes for samples prepared without the target-enrichment steps were 0.0008% (sample 3), 1.36% (sample 4), 0.0008% (sample 5), 0.15% (sample 7), and 0.013% (sample 8). The percentages of reads mapped to the Leptospira genomes for the same samples prepared using the SureSelect system jumped to 10%, 98%, 11%, 86%, and 61%, respectively. Thus, capture increased the proportion of Leptospira by several orders of magnitude (72–13,000).

Fig. 1
figure 1

Percentage of Leptospira reads found in samples 3, 4, 5, 7, and 8 before (blue bars) and after hybridization capture using Leptospira RNA probes (orange bars)

Almost all of the bacterial reads were assigned to the family Leptospiraceae (Additional file 1: Fig. S1, S2); viral content was marginal (< 0.1%) (Additional file 1: Fig. S2). The average depth of coverage in the 20 target-enriched samples was 590 ×, ranging from 0.5 × (sample 18) to 6000 × (sample 4) (Additional file 1: Fig. S3), hence leading to a large standard deviation of 1320. Coverage across the genomes was computed using the Sequana coverage tool [34] to more precisely characterize the genomic variations in the different samples (Additional file 1: Fig. S4). The enrichment results, together with the average coverage (Additional file 1: Fig. S3), highlights several key points. Most samples had coverage above 50 ×, except samples 3, 5, 6, and 18. The coverage of samples 3 and 5 was still sufficient, with 42 × and 28 ×, respectively. Sample 6 had a low coverage of 8 ×. Finally, sample 18 was more problematic, as its coverage was below 1 ×. It was also possible to assess the breadth of coverage (percentage of bases covered by at least one read) using L. interrogans serogroup serovar Copengageni (id246) as a reference (Additional file 1: Fig. S4); it was above 99.5% for most samples, except for sample 6, which had a breadth of coverage of 85%, and sample 18, for which the coverage was only about 3%.

The Ct values of real-time PCR targeting the pathogen-specific target lfb1 ranged between 20 and 38, corresponding to 105 bacteria/µl to less than 1 bacterium/µl [27]. There was a good correlation between the Ct values and the proportion of mapped Leptospira reads and depth coverage (Fig. 2). Thus, the six samples with more than 90% Leptospira reads (samples 4, 7, 10, 11, 13, and 17) had Ct values < 32 (Table 1).

Fig. 2
figure 2

Final depth of coverage as a function of the measured Ct for the 20 samples. The depth of coverage is based on the L. interrogans core genome. The correlation coefficient between the depth of coverage and Ct values was 0.76

The variant calling approach was performed to identify the genotype of each sample. We searched for the closest genome for a given sample using a database of 273 genomes of pathogenic Leptospira strains from different species, serogroups, and serovars originating from various geographic areas (Additional file 2: Table S2) by minimizing the distance between the raw sequencing data of the sample and the Leptospira reference genomes. The distance used was the count of high-quality variants found in a given sample relative to the different strains, as explained below. As the capture was designed using probes covering the core genomes only, we solely considered the core genome of the 273 strains; the average core genome length was 574.7 ± 8.5 kb. Although coverage was uneven in some samples, with the presence of spikes (excess coverage in short regions, low frequency trend in sample 20) (Additional file 1: Fig. S4), it was generally high enough for variant calling analysis. We first examined the number of SNPs. The distribution of SNPs across all genomes and samples was highly variable, with values ranging from 0 to 23,000 SNPs (average of 10,000). The SNP count histogram across the 273 strains is shown in Additional file 1: Fig. S5, in which 95% of the strains show a count above 100, whereas a few strains had SNP counts below 10 (Fig. 3A; Additional file 1: Fig. S5).

Fig. 3
figure 3

Analysis of multiple nucleotide polymorphisms. Histograms of the number of single-nucleotide polymorphisms (SNPs) (A), INDELs, including multiple nucleotide polymorphisms (MNPs) (B), and insertions and deletions, excluding MNPs (C) found in sample 1 across all 273 Leptospira genomes

Interestingly, most of the examined samples had less than 10 SNPs relative to the reference genomes of L. interrogans serovar (sv) Icterohaemorrhagiae strain RGA (id97), L. interrogans sv Icterohaemorrhagiae strain Verdun (id106), and L. interrogans sv Copenhageni strain Fiocruz L1-130 (id246) (Table 2). These three strains are phylogenetically related (Fig. 4) and belong to L. interrogans serogroup (sg) Icterohaemorrhagiae. In particular, L. interrogans sv Copenhageni (id246) appeared to be the strain with the smallest number of SNPs in most samples (if we ignore samples 5, 6, and 7 that were distant from the serogroup Icterohaemorrhagiae, and sample 18, which had no SNPs due to low coverage). Using the minimum number of SNPs as the criteria for assignation, 15 of the 19 samples were assigned to L. interrogans sv Copenhageni (id246). Sample 19 had 2 SNPs in id246 and none in id106 and was thus assigned to L. interrogans sv Icterohaemorrhagiae (id106). Sample 5 was assigned to L. interrogans serovar Zanoni (id228) from serogroup Pyrogenes with large number of SNPs (1272); the second best hit among the 273 reference genomes was another Pyrogenes strain (id11). Samples 6 and 7 were assigned to L. kirschneri sv Grippotyphosa (id117) with 6 and 15 SNPs respectively (Table 2). Sample 18 was excluded from the analysis due to a low average. Nevertheless, if we decreased the required depth for variant calling to 4 × then the number of SNPs rose to approximately 20, on average, across all genomes (Additional file 1: Fig. S5). One had no SNPs (L. kirschneri sg Grippotyphosa; id149), although 15 other strains had 1 or 2 SNPs. These strains are close to each other in the phylogenetic tree (Fig. 4) and belong to L. kirschneri. In particular, id117 has only 1 SNP, and was also the best hit for samples 6 and 7 (Table 2).

Table 2 SNPs, INDELs, and MNPs found in the 20 samples
Fig. 4
figure 4

Phylogeny of Leptospira genomes and the sequenced samples. The Leptospira sequences from samples were compared using a database of 273 core genomes of pathogenic Leptospira strains (Additional file 1: Table S2) to find the closest homologue. Samples 1–4, 8–17, 19, 20 had a minimal number of single nucleotide polymorphisms (SNPs), deletions, insertions, and multiple nucleotide polymorphisms (MNPs) relative to id246 (L. interrogans sv Copenhageni). Samples 6 and 7 are closely related to id117 (L. kirschneri sv Grippotyphosa) and Sample 5 is closely related to id228 (L. interrogans sv Pyrogenes). The phylogenetic tree was generated using multiple alignment (mafft v7.490) and RaxML software to infer the underlying phylogenetic tree. Visualization was performed via the iTOL web service (https://itol.embl.de)

To confirm these results, we also examined other types of variants: insertions, deletions, and multiple nucleotide polymorphisms (MNPs). The counts of insertions and deletions are hereafter summarized together as the number of INDELs. The number of INDELs in sample 1 varied from 0 up to 10,000 (Fig. 3B, C). Using the minimum of the total count of INDELs and MNPs across genomes, we found results similar to those found with the SNPs analysis. Sample 6 had several best hits: id117, id110, and id700 that led to no deletions, one insertion and one MNP each. Interestingly, id117, id110, id700 belong to serogroup Grippotyphosa and they are next to each other in the phylogenetic branch. For sample 7, id117 was the closest with one insertion and 2 MNPs. Other reference genomes from serogroup Grippotyphosa (id110, id315, and id700) were closeby with only 1 or 2 additional INDELs/MNPs. For sample 5, id228 had the minimum number of INDELs/MNPs with 0 deletions, 6 insertions and 100 MNPs. All other samples, which were close to id246 in terms of SNPs, had no INDELs or MNPs in id246, id97 and id106 (except sample 15, which had 1 MNP and sample 20 with several insertions or MNPs) (see Table 2 for details). Overall, the studies of SNPs, INDELs and MNPs converge to provide a robust assignation for each sample.

We further analyzed the SNPs and INDELs identified in coding regions for samples 1–4, 8–17 and 19–20 which were assigned to L. interrogans serovars Copenhageni and Icterohaemorrhagiae (Tables 1, 2). Comparison of genome of L. interrogans serovar Copenhageni strain Fiocruz L1-130 with the sequences of the 545 core genes of the samples resulted in the identification of 2 to 7 SNPs which were distributed in 22 different genes (Table 3). SNPs in genes LIC11311, LIC13481 and LIC12955 were conserved in most if not all samples (Table 3); other SNPs were sample specific. Of the identified mutations in the core genes, 9 were synonymous and 13 were non-synonymous. The vast majority of genes with non-synonymous SNPs are annotated as involved in various biological processes such as amino acid transport and metabolism, energy production and conversion, lipid transport and metabolism, transcription, translation, ribosomal structure and biogenesis, etc. (Additional file 2: Table S3). We also noted 1 insertion causing a frameshift in LIC11604 which encodes a hypothetical protein (Additional file 2: Table S3).

Table 3 Genes showing SNPs and INDELs

Discussion

Genomics studies are proving to be important for the characterization of pathogen diversity and pathogenicity, yet the fastidious growth of Leptospira and the low abundance of Leptospira in clinical samples has presented a challenge for such studies. Thus, it can take up to four months of incubation for a primary culture to become positive [35, 36]. In addition, certain Leptospira serovars require additional culture media supplements for their growth [37]. Our data demonstrate, for the first time, the suitability of target-capture technology for purifying very low quantities of Leptospira DNA from complex DNA populations in which the host genome is in vast excess. We show the successful enrichment of Leptospira DNA by the significant increase in the ratio of bacteria:human DNA post-hybridization in a subset of samples. The Ct strongly correlated with capture efficiency. In six samples with a high leptospiral burden (qPCR Cts between 20 and 31 or 105 bacteria/µl to approximately 102 bacteria/µl), Leptospira reads accounted for > 90% of the total reads. Targeted enrichment was applied to blood, serum, and urine samples from leptospirosis patients a few days after symptom onset (0–8 days, average of 3.9 days), showing that the method can be applied to routine diagnostic samples. We show that enrichment of L. interrogans reads provides sequencing data that match the quality and quantity of data obtained via sequencing from cultures, with coverage above 50 × for 16 of the 20 samples, providing an opportunity to compare Leptospira strains from routine diagnostic samples with greater resolution than previously possible. Today, this approach is still relatively expensive, currently costing approximately $300 per sample in our laboratory, but as next-generation sequencing costs continue to decline, this approach should become more affordable and accessible. To reduce the cost in future studies, samples can be barcoded and pooled before enrichment, thus enabling multiplexing of hybridization reactions. This approach was already been proven to considerably decrease the cost [24, 38].

The use of specific probes for L. interrogans is justified by the cosmopolitan nature of this species, which is found worldwide [4]. The species L. interrogans also hosts the most pathogenic serovars, such as those belonging to the Icterohaemorrhagiae serogroup [5,6,7,8]. Finally, L. interrogans is particularly appropriate for the use of target enrichment, as it has a relatively well-characterized clonal nature and L. interrogans strains from different origins show high genetic relatedness [4]. The specificity of the target enrichment probe sets was confirmed by our ability to specifically target L. interrogans (17/20). Furthermore, we were also able to target L. kirschneri (3/20), which is the closest species phylogenetically to L. interrogans (Table 1).

More importantly, this enrichment method effectively captures regions of diversity in the Leptospira core genome, which enables precise molecular typing of infecting strains. Comparison of these assembled sequences to the pathogenic Leptospira reference core genomes revealed only a limited number of SNPs for most of the samples. Remarkably, the number of SNPs for most samples was as low as a few (e.g., 4 SNPs in sample 1 on strain Fiocruz L1-130), even for samples with a low sequencing yield. Analyzing the SNPs, INDELs, or MNPs independently also provided coherent results, leading to robust assignation. There is however an exception with sample 5 which presents more than 1,200 SNPs with the closest reference genome which belongs to serogroup Pyrogenes. Interestingly this sample was collected from a patient who travelled in the Philippines where Pyrogenes is the most prevalent serogroup in both humans and animals [18]. However, the high number of SNPs reported for this sample suggests that the genome of the infecting strain is not present in our database. Further work will need to isolate and sequence more strains from patients to provide a better picture of the strains that are circulating worldwide.

Because we have a significant number of samples from patients infected with L. interrogans serovars Copenhageni and Icterohaemorrhagiae, we can analyze the genetic diversity of the core genes among these samples. Direct sequencing from clinical samples allows to get rid of the numerous mutations that occur during in vitro passages as previously described [39,40,41,42]. Previous comparative genomic studies of in vitro cultures from L. interrogans serovars Copenhageni and Icterohaemorrhagiae from different origin revealed that the genomes of these two serovars are highly conserved [43]. Similarly, we found a low proportion of mutations among the core genes during human infection and we did not identify any bias toward any particular biological function. One of the limitations of our study is that we analyze only a part of the genome (575/4600 kb or 12.5%) and we don’t have access to the accessory genome which usually includes a number of virulence functions. In the near future, custom synthesized RNA probe sets could be designed to span the entire chromosome of L. interrogans. This will provide insights on the bias that may be introduced by culture, as previously shown for the spirochete Treponema pallidum [44], as well as increase our understanding of genetic diversity among strains and its impact on immune evasion, persistence and disease outcome.