Background

The Leptotrichiaceae are a family of underexplored and rarely isolated microorganisms within the phylum Fusobacteria containing both species known from certain pathologies as well as colonising members of the resident microbiota. Many if not all species of the Leptotrichiaceae inhabit the oral cavities, gastrointestinal or urogenital tracts of humans and animals [13]. One of the reasons they are rarely encountered is the obligate anaerobic or capnophilic growth dependence of these fastidious bacteria and the usual presence of a high number of concomitant microorganisms. Some members of this family are well known pathogens, such as Streptobacillus (S.) moniliformis, one of the two causative organisms of the bacterial zoonosis rat bite fever [4]. Recently, a number of novel species have been described, most of which could be attributed to clinical disease [58]. It can also be concluded from numerous phylotypes, Leptotrichiaceae normally colonize mucous membranes [915], but when introduced into new tissue or host sites they are also able to shift their pathogenic potential and cause severe and even life-threatening disease. With increasing availability of next generation sequencing a number of single genomes have been published [6, 1620]. However, almost no comprehensive genomic studies including these microorganisms have been completed, nor have virulence properties been identified in these species. Phylogenetic studies and identifications within the phylum Fusobacteria have been carried out and based on single or multiple gene sequences such as 16S rRNA, 16S–23S rRNA internal transcribed spacer, gyrB, groEL, recA, rpoB, conserved indels and genes for group-specific proteins, 43-kDa outer membrane protein and zinc protease [18, 2130]. In an attempt to characterize different members of this phylum Gupta & Seti proposed various conserved signature indels (CSIs) in amino acid sequences for the Leptotrichiaceae from which three CSIs were found to be specific for this family [31]. On the other hand, no detailed phylogenetic and comparative genome studies dedicated to Leptotrichiaceae have been published up to now. Furthermore, and due to a general paucity of strains and attempts to differentiate members from the same species there is currently no tool available to type isolates in order to prove transmission chains. Our data, presented here, were derived from 46 complete genomes from 20 different taxa of the family Leptotrichiaceae aiming to provide the first such comparative analysis. Our study results confirm the picture of earlier phylogenies from this group that are now based on a larger scale of orthologous genes. We give a surveying insight into the investigated genomes, thereby also including recently described species from this family. With a novel approach it was, furthermore, possible to accurately and unequivocally type isolates of S. moniliformis based on three variable number tandem repeat (VNTR) sequences. With this, we are presenting a culture-independent, species-specific fingerprinting tool in order to type the most important causative organism of rat bite fever for the first time.

Results

Accession numbers

The GenBank/EMBL/DDBJ accession numbers for the genome sequences used in this study are summarized in Table 1.

Table 1 Strains as well as origins, clinical symptoms and host species of the Leptotrichiaceae members used in this study

Phylogenetic analysis based on orthologous genes

To determine the phylogeny within the genus Streptobacillus we aligned the allelic variations of 281 orthologous genes from 29 strains of S. moniliformis, S. ratti, S. notomytis, S. felis and S. hongkongensis which resulted in 57,841 single nucleotide polymorphisms (SNPs). From these SNPs we inferred a maximum likelihood phylogeny showing the distance between the different species within this genus (Fig. 1). To zoom deeper into the phylogeny of the S. moniliformis group we repeated this analyses with 775 orthologous genes present in 23 S. moniliformis strains which resulted in 5,211 SNPs. These SNPs were also used to construct a maximum likelihood phylogeny (Fig. 2).

Fig. 1
figure 1

Maximum likelihood phylogenetic tree of the genus Streptobacillus (strains 1–29 according to Table 1). The tree is based on 281 orthologous genes including 57,841 SNPs

Fig. 2
figure 2

Unrooted maximum likelihood phylogenetic tree of 23 Streptobacillus moniliformis strains from this study. The tree is based on 775 orthologous genes including 5,211 SNPs

As shown in the tree, most S. moniliformis strains used for this study are unrelated and form a heterogeneous population without any significant clustering. Solely strains A378/1 and B5/1 that both originate from the same source but without a common epidemiological background were phylogenetically indistinguishable.

Analysis of genomes and protein functions

The genome size in members of the Leptotrichiaceae varies between 1.22 and 4.42 Mbp with Caviibacter (C.) abscessus and Sebaldella (Se.) termitidis being the smallest und largest genomes, respectively. Generally, and with the exception of Sebaldella termitidis, genomes are smaller than 2.45 Mbp. The genera Caviibacter and Sneathia (Sn.) are comparable with respect to genome size (1.22–1.34 Mbp) as are the genera Streptobacillus and Oceanivirga (O.) (1.38–1.90 Mbp). Members of the genus Leptotrichia (L.) are the second largest group with 2.31–2.47 Mbp. A general overview on the genomes of all strains under study is depicted in Table 2. A similar order can be observed with respect to coding DNA sequences (CDS), i.e., C. abscessus and Sneathia spp. possess 1212–1282 CDS, followed by Streptobacillus spp. and O. salmonicida (1293–1679), Leptotrichia spp. (1930–2365) and Sebaldella termitidis (4083). The average percentage of CDS within the whole genome displays a graded distribution within the family: a highly coding group consisting of the genera Caviibacter, Oceanivirga and Sneathia (89–93 %), an intermediate Streptobacillus spp. group (87 %) and a group containing the genera Leptotrichia and Sebaldella (84 %) with lower coding density. Nevertheless, intra-genus variability can be considerably high, the former results can inevitably also be shown for the average gene densities and the average intergenic regions (in parentheses average genes/Mbp; number of intergenic nt): O. salmonicida (1056; 79), C. abscessus (996; 76), Sneathia spp. (989; 84), Streptobacillus spp. (987; 115), Leptotrichia spp. (967; 144) and Sebaldella (936; 149). An organization of the genomes under study into clusters of orthologous groups (COGs) is depicted in Additional files 1 and 2 and shows, however, high intra-species as well as inter-species variations. On a generic level, gene contents of COG classes J, L, D and F are inversely correlated with increasing genome size, whereas COG classes K, N, T and Q are positively correlated (see Additional files 1 and 2).

Table 2 Analysis of genome data as well as predictions of coding regions of the Leptotrichiaceae members used in this study

Multiple-Locus Variable number tandem repeat Analysis (MLVA)

In silico VNTR analysis

Under default conditions, 127 repeats were identified by the tandem repeat finder. For further analysis, the three most variable VNTRs were identified according to the degree of variability of allele types identified by alignment analysis (Table 3). These three allelic loci were only present in S. moniliformis and thus proved to be specific for this microorganism (all other members of the Leptotrichiaceae were negative). The combination of the three loci yielded a high discriminatory index (0.94296 DI; Table 4).

Table 3 Streptobacillus moniliformis specific Variable Number of Tandem Repeat (VNTR) primer sequences used in this study
Table 4 VNTR allele types of the Streptobacillus moniliformis strains used in this study

PCR-based validation of in silico results

The absence of the calculated VNTR loci could also be proven by polymerase chain reaction (PCR) in all Leptotrichiaceae members other than S. moniliformis (data not shown). Contrarily, each of the ten S. moniliformis strains exhibited a specific band corresponding to their predicted tandem repeats pattern. Analysis of the sequenced PCR products confirmed the allele type allocation determined in silico (Table 4). VNTR_Sm1 alleles of two isolates, which were not found in silico, were successfully assigned (Table 4). Re-calculation revealed a DI of 0.9529 after including these two isolates, as well as one isolate for which no genome data was available. In order to facilitate comparisons of results in future studies, every genotype (from the allele types of the three loci) was expressed as a specific allele combination resulting in a specific allele code (Table 4). An online database dedicated to MLVA results of S. moniliformis has been established on the webserver of University Paris-Saclay, Orsay, France (http://microbesgenotyping.i2bc.paris-saclay.fr/databases/public) which is open to future entries and strain comparisons.

Discussion

Members of the Leptotrichiaceae are rarely encountered microorganisms, a phenomenon that seems to be highly dependent on difficulties with cultivation. With the availability of molecular methods in this field the number of findings and frequencies has significantly increased [1015, 3236]. On the other hand, we still need deeper insight into the genomes of this group. In particular, the mechanisms involved in pathogenesis and virulence of pathogenic species are completely unexplored. We have undertaken a first step into this direction by analysing a broad spatio-temporal collection of strains, thereby including especially species with regular evidence for pathologies. Firstly, the large dataset from this study has been utilized for the confirmation of our phylogenetic picture from earlier studies [18, 30, 37, 38]. An intra-genus phylogeny that was based on 775 orthologous genes revealed a very similar picture to previous studies involving only four selected functional genes (Figs. 1 and 2). Conversely and in contrast to almost identical average nucleotide identity (ANI) values [30], full genome analyses revealed a high level of heterogeneity for all but two strains (no. 15 and 18) of S. moniliformis without any significant clustering. This is, albeit, not surprising, because the present study included a large spatio-temporal collection of 23 S. moniliformis strains that have been isolated over a period of 90 years from at least five different host species and from almost all subcontinents. We were also able to display the three predicted Leptotrichiaceae specific CSIs of MreB/MrI (2 aa deletion), AlaS and RecA (5 and 2 aa insertions, respectively) in all of our genomes as well as in the recently described members of the family (data not shown) [31].

Genome size dependent gene content has been described and could also be confirmed for the genomes from this study [19]. With increasing genome size gene contents of COG classes J, L, D and F involved in DNA replication, cell cycle regulation and protein translation are inversely correlated, whereas COG classes K, N, T and Q involved in transcription, signal transduction, cell motility and the biochemistry of secondary metabolites are positively correlated (see Additional files 1 and 2). This makes sense when essential gene functions are preserved in smaller genomes and less important gene functions which are dispensable or can be ‘outsourced’ to the host, are lost [19]. On first impression the group of S. moniliformis strains is highly similar as can be concluded from related morphological and phenotypical properties and also from their high intra-species ANI of 98.5–99.3 % (cf. Table S2 in [30]). Based on data from this study very similar COG classes were also observed within this group (see Additional files 1 and 2), but differences in coding densities suggested, on the other hand, remarkable discrepancies. Fuelled by the idea that these discrepancies could, furthermore, be utilized with respect to epidemiology, we have developed a specific MLVA typing scheme for the major pathogen from this group, S. moniliformis, and the causative organism of rat bite fever. This scheme proved to be sufficient in unequivocally typing all 23 S. moniliformis strains under study plus one additional isolate with high discriminatory power (0.9529 DI). Interestingly, only four allele codes (genotypes; LHL2, LHL5, LHL10 and LHL11) were found more than once among isolates (Table 4). At least for LHL2 isolates, a connection could be pursued in that both isolates have been stored in the same strain collection, although a direct transmission could not be proven. To check the clonality of isolates belonging to these four genotypes we have investigated further loci with high discriminatory potential, i.e., the clustered regularly interspaced short palindromic repeats (CRISPR) region known to occur in S. moniliformis (http://crispr.u-psud.fr/cgi-bin/crispr/SpecieProperties.cgi?Taxon_id=519441). In contrast to all other allele codes (LHL5, LHL10, LHL11), both strains (no. 15 and 18) belonging to the allele code LHL2 indeed shared an identical CRISPR region, thereby pointing towards a clonal relation of these two isolates (data not shown) as could also be concluded from the phylogenetic tree (Fig. 2). Due to its length of up to approximately 3,000 nucleotides and its high level of heterogeneity the CRISPR region seems, on the other hand, presently not very well suited as a direct typing tool, but could be useful in certain situations to confirm or negate clonality of strains. A second advantage of the MLVA method described in this study is that it can effectively be pursued directly from the original matrix (e.g., a mouth microbiota swab and a clinical sample) without prior cultivation of the organism, which offers the possibility to better understand transmission chains in the future. This seems to be especially relevant since established PCR assays are not species specific, but limited to genus level specificity [37, 39, 40]. The majority of diagnoses of rat bite fever cases in the recently published literature relies only on partial 16S rRNA gene sequence analysis that may – in the light of very similar novel Streptobacillus spp. that also colonize rats – be quite uncertain for proper pathogen identification [41]. Hopefully, the newly established MLVA database will help to clarify regional infectious clusters and confirm transmission of certain lineages.

Conclusion

We have undertaken a first analysis of Leptotrichiaceae genomes using a large spatiotemporal collection of strains also including novel members of this group. Our dataset unveiled a first insight into characteristics founding a stable phylogeny, genome structure and COG classes. Beside apparent intra-species similarities we have detected also genetic heterogeneities that provided a basis for fingerprinting the most relevant pathogen from this group, the rat bite fever organism, S. moniliformis. This highly useful and economical tool can be directly used from clinical samples without ambitious prior cultivation and with high discriminatory power. Our data form the basis for a newly established MLVA database that provides the opportunity to store and compare isolate-specific information in future cases with this neglected zoonosis.

Methods

Generation of genomic data

Twenty-two strains of S. moniliformis were sequenced in this study, ten strains were taken from previous publications of our group and 15 strains were descended from other projects (Table 1). Genomic DNA was extracted from a 72 h bacterial culture with a commercial kit according to the manufacturer’s instructions (MasterPure™ Complete DNA and RNA Purification Kit, Epicentre, distributed by Biozym Scientific, Hessisch Oldendorf, Germany). Whole genome sequencing of the strains was performed on an Illumina MiSeq with v3 chemistry resulting in 300 bp paired end reads and a coverage of greater than 90×. Quality trimming and de novo assembly was performed with CLC Genomics Workbench, Version 7.5 (CLC Bio, Aarhus, Denmark). For automatic annotation we used the RAST Server: Rapid Annotations using Subsystems Technology [42]. Data from further relevant reference genomes from the Leptotrichiaceae were also utilized and obtained from the National Center for Biotechnology Information (NCBI) database (http://www.ncbi.nlm.nih.gov). Sequence analyses and genome calculations as well as oligonucleotide primer generation were carried out with Geneious (v. 8.1.3; Biomatters, Auckland, NZ) [43]. Table 1 depicts the set of strains and reference genomes used for this study.

Phylogenetic analysis based on orthologous genes

The determination of the maximum common genome (MCG) alignment was done comprising those genes present in all genomes considered for comparison [44]. Based on the parameters sequence similarity (minimum 70 %) and coverage (minimum 90 %) the genes were clustered and those genes that were present in each genome, fulfilling the threshold parameters were defined as MCG. This resulted in 281 orthologous genes for the comparison of 29 strains of S. moniliformis, S. ratti, S. notomytis, S. felis and S. hongkongensis and in 775 orthologous genes for the comparison within 23 strains of S. moniliformis only.

The following extraction of the allelic variants of these genes from all genomes was performed by a blast based approach after which they were aligned individually for each gene and concatenated which resulted in an alignment of 219,961 bp for the 29 strains and of 546,508 bp for the 23 S. moniliformis strains [45].

This alignment was used to generate a phylogenetic tree with randomized axelerated maximum likelihood (RAxML) 8.1 [46] using a General Time Reversible model and gamma correction for among site rate variation.

Analysis of genomes and protein functions

Genes were predicted with Prodigal [47] and assigned to COGs with the NCBI’s Conserved Domain Database [48].

Multiple-Locus Variable number tandem repeat Analysis (MLVA)

In silico VNTR analysis

The complete genome sequence of the S. moniliformis type strain DSM12112T (accession number CP001779.1) was used to search for potential VNTRs using a tandem repeat finder web tool (http://tandem.bu.edu/trf/trf.basic.submit.html). We focused our search on repeats that were characterized by high purity, large size, and/or large number of repeat copies [49]. Repeats of interest were aligned against a set of available genomes depicted in Table 1 using Geneious and allele types were determined as shown in repeat copy numbers. The DI was calculated for a combination of three most variable VNTRs using an online discriminatory power calculator (http://insilico.ehu.es/mini_tools/discriminatory_power/).

PCR-based validation of in silico results

Ten S. moniliformis strains (strain nos. 1, 2, 3, 12, 14, 15, 21, 22 and 23 according to Table 1 plus strain A40-13 for which complete genomic data were not available) as well as all accessible members of the Leptotrichiaceae other than S. moniliformis were used for validation. DNA was extracted from respective isolates (2–3 colonies) by boiling in 100 μL distilled water for 20 min (min.) followed by centrifugation at 20,817 × g for 5 min. The 20 μL final PCR reaction contained 10 μL of Hotstar Taq MasterMix (Qiagen, Hilden, Germany), 1 μL of each forward and reverse primer (10 pmol/μL) (TIB MOLBIOL, Berlin, Germany) (Table 3), 6 μL DNase free PCR grade water (Qiagen), and 2 μL of the extracted DNA. PCR conditions were as following: 1× (95 °C, 15 min), 40x (94 °C, 30 s; 58 °C, 30 s; 72 °C, 30 s), 1× (72 °C, 10 min). PCR products were stained with ethidium bromide in a 2 % agarose gel (100 V for 1.5 h) and then analyzed with a gel documentation system (BioDoc-It, UVP, UK). The PCR amplicons were purified using MicroElute DNA Cycle-Pure Kit (OMEGA bio-tek, Norcross, USA) and sequenced at Seqlab-Microsynth laboratories (Göttingen, Germany). All sequences were analyzed by tandem repeat finder web tool and/or BLASTN 2.3.1+ [50] hosted by NCBI website and compared to the in silico results.