Background

Clostridium difficile (reclassified as Clostrioides difficile [1]) is an enteric pathogenic bacterium that can cause symptomatic disease, which ranges in severity from fever and diarrhoea to the development of pseudomembranous colitis and toxic megacolon [2]. Clostridium difficile infection (CDI) occurs following antibiotic treatment as new ecological niches become available upon disruption of the normal microbiota [3]. CDI may arise from ingested endospores transmitted via the faecal oral route, or from vegetative cells already present in the patient, as the bacterium can be asymptomatically carried in adults and children [4]. CDI may also be contracted outside the hospital setting [4], and C. difficile has been isolated from food products [46], on surfaces around the home [7] and from swimming pools [7]. It has also been isolated from the natural environment including river water, soils, sea water and estuarine sediments [710]. The presence of C. difficile at these sites may be due to contamination with sewage or agricultural run-off, yet bacteria from these locations could be re-introduced to the food chain, for example via contaminated shellfish or seafood [11, 12], and they have been implicated in the infection of marine mammals [13].

The movement of C. difficile between reservoirs is particularly pertinent for isolates of the PCR ribotype 078 (R078). This is an epidemic strain, first identified in livestock and subsequently in clinics across Europe [14]. Although pathogenic, it is not clear quite how much virulence versus strain fitness shapes which strains come to prominence in the hospital environment [15, 16]. R078 strains form a lineage divergent from other major ribotypes [17], as also determined via multilocus sequence typing (MLST) analysis [18, 19] and core genome phylogenies [20, 21]. Previously, we isolated a R078 strain, CD105HS27, from estuarine sediment [9] and sequenced its genome using Illumina HiSeq 2000 generating a draft assembly [22]. The carriage of transposon Tn6293 (previously unnamed) and the absence of Tn6164 was confirmed in this study from the results of the Single Molecular Real Time (SMRT) sequencing. The accessory gene content in C. difficile as a species is high relative to the size of its core genome [23], and it is characterised by multiple mobile genetic elements which include transposons, integrated conjugative elements, plasmids and prophages (for recent reviews see [2325]). The acquisition of antibiotic resistance and novel virulence factors are thought to drive C. difficile strain pathogen evolution [26], but its ecology outside of the human host is little understood.

Recently, SMRT technology has been applied to sequence C. difficile genomes, exploiting the long read data to determine chromosomal structure, mobile genetic content and methylation patterns [2731]. The re-sequencing of previously analysed strain CD630 showed differences in its ribosomal operon, transposon and tRNA content [28, 31]. In this study we first determined if re-sequencing the reference strain M120 (R078) using SMRT would reveal differences in the chromosomal architecture. Next, we compared SMRT generated genome sequences of M120 with CD105HS27 in order to gain a better understanding of the differences between an environmental isolate and a clinical strain. To date, SMRT has not been applied to isolates of R078. In addition to analysing the genomic data, we compared methylation patterns across the genome. Due to the fact that the CRISPR/Cas system also can provide immunity to invading DNA elements, we assessed its potential to target MGEs for each strain. In both cases, understanding mechanisms that govern horizontal gene transfer in C. difficile provides insight into the genome evolution of this pathogen.

Results and discussion

Genome features of M120 and CD105HS27

The two genome assemblies generated using SMRT are in near-complete condition; the genome of M120 is 4,082,634 bp with an average coverage of 16.3×, an average 28.73% GC content, and is comprised of two contigs of 4,069,609 bp and 13,024 bp in length. The total sequence for CD105HS27 is 4,122,476 bp, with an overall coverage of 15.75× and an average 29.15% GC content, and consists of five contigs of 3,462,540 bp, 339,877 bp, 174,028 bp, 146,675 bp and 1156 bp, respectively.

Both assemblies were compared to the reference genome of M120, which is a single chromosome 4,047,729 bp in length with an average 28.76% GC content. The 13,024 bp size contig contains a set of 5S, 16S and 23S rRNA genes and 19 tRNA genes, and has duplicated region encoding an identical tRNA (Alanine) and 16S rRNA gene (dot plot data not shown), in addition to predicted CDSs encoding glycosyl transferases, DNA polymerase subunit and recombination protein RecR. The relative coverage of this contig is on average 1.3× (see Fig. 1). To determine whether this contig represents a sequence mobilization event and a low copy number requires experimental investigation.

Fig. 1
figure 1

Genome features and comparisons of M120 and CD105HS27. Comparison between M120 reference genome (top), M120 sequenced with SMRT (middle) and CD105HS27 (below). The genomes are connected by regions indicating nucleotide (nt) sequence similarity with notable genomic features annotated at locations along the genome including the PaLoc (Pathogenicity Locus), C. difficile binary toxin (CDT) genes, C. difficile sigK intervening (skinCd) element, flagella gene region 1 (F1) and annotated transposons. The GC% is provided for all three genomes alongside the coverage and methylation modifications for N4-methylcytosine (m4C), N6-methyladenine (m6A) and undetermined modified bases. Boxes highlight the different methylation patterns observed across each of the unique transposons

Annotation of the re-sequenced M120 genome identified 3541 CDSs, 101 tRNAs and 39 rRNAs; this is consistent with the reference genome, but includes an additional 15 tRNAs and 7 rRNA genes. Similar observations were seen  in a SMRT sequenced genome of CD630Δerm with additional tRNA and rRNA genes located in a novel ~5 kbp insertion [28]. This was attributed to adaption during laboratory culture as extra ribosomal gene operon copies have been shown to affect fitness in E. coli with regards to nutrient availability [32]. Furthermore, recombination events have been suggested as a mechanism for generating the diversity of ribotypes in C. difficile [33].

The genome of CD105HS27 has 3598 CDSs, 93 tRNA and 47 rRNA genes. The chromosome breaks are located in regions encoding ribosomal genes, which appear to have undergone duplication events across the genome. The application of SMRT can also improve the assembly of other regions containing repeat sequences. For example, previously, toxin gene carriage had been confirmed by PCR for CD105HS27 [9], but an Illumina generated draft genome assembly of its genome resulted in fragmented versions of tcdA and tcdB [22]. Here, these genes have been resolved fully. CD105H27 has 79 CDSs that are not present in M120, most of which are encoded on Tn6293, In contrast, M120 has 103 CDSs that are not present in CD105HS27, of which 102 are encoded on Tn6164. The predicted genetic content of these two transposons suggests that they may be conjugative transposons although this has yet to be demonstrated experimentally. Therefore, these should be re-termed as putative conjugative transposons CTn6164 and CTn6293. Tn6164 is a large (~100 kbp) element that appears to be two MGEs including a prophage region which shares similarity to the Streptoccocus conjugative phage Φ1207. 3 [34]. Φ1207. 3 has been demonstrated to transfer between strains via conjugation and was originally annotated as a conjugative transposon [35] but contains conserved phage genes including those predicted to encode terminases, capsid, tail and holin proteins leading to its re-designation as a conjugative prophage [36]. Prophages transmitting via conjugation appear rarely in the literature (e.g. [37]). Whether these prophages also transfer via conjugation has not been established, however their discovery suggests that this mechanism may occur more widely than previously known.

The two genomes are related, sharing an average nucleotide identity of 99.98% based on the whole genome sequence (following the method described in [38]). Alignment of the whole genomes using MAUVE and its SNP (Single Nucleotide Polymorphism) detection tool showed that the aligned sequences differed in 85 positions by single nucleotide changes. Further comparison of the two genomes via BLASTn (Fig. 1) and within a dotplot (Fig. 2) revealed extensive sequence similarity between the two strains, with exceptions of two large indel (insertion-deletion) regions (~100 kbp) that carry the putative CTns Tn6164 and Tn6123, the movement of Tn6190, and inversion rearrangements. Use of SMRT has previously shown major chromosomal rearrangements from resequencing the genome of strain CD630 in addition to duplication of ribosomal gene operons [28, 30]. One mechanism for these rearrangements are the movement of the MGEs, as seen in the mutant CD630Δerm, where the re-mobilisation of transposon CTn5 led to the inversion of the genome sequence [28]. What affect such chromosomal re-engineering has on the physiology of the cell in terms of gene expression is not known, but may be significant as has been described for the control of DNA elements from the chromosome in the regulation of diverse bacterial processes [39].

Fig. 2
figure 2

Dotplot of the two genome sequences with indel regions and chromosomal rearrangements. Pairwise comparison of the two nucleotide sequences was performed using a dotplot matrix. The results show regions of shared sequence along the chromosomes (black line) and where there are insertion-deletion (indel) events that result in no sequence similarity shared between the genomes (white gap). The two largest gaps (~100 kbp each) correspond to the positions of the putative CTns, Tn6164 in M120 and Tn6123 in CD105HS27. The conserved but differently positioned Tn6190 is shown also. The contigs for each genome are illustrated along the sides for each genome to show the chromosomal rearrangements occur within the assembled contig boundaries

In silico typing of M120 and CD105HS27

In C. difficile, ribotyping is one of the main methods used to categorise strains. In silico ribotyping was performed to assess the outcomes from the SMRT generated genomes and to explain how the duplication events affect the ribotypes profile. As expected from the different numbers of total rRNA genes, the two profiles differ, with 11 bands predicted from M120 reference, 12 from M120 SMRT and 16 from CD105HS27 (Additional file 1: Table S1). The profiles differ by duplication of identical sized regions in addition to bands of different lengths which may affect ribotypes assigned. While ribosomal gene regions assemble poorly in Illumina datasets, the ability to generate near complete genomes using SMRT technology show how ribosome operon duplication and recombination events could be tracked.

Another method used to type C. difficile is MLST (multilocus sequence typing), a scheme that compares the sequence data for seven conserved genes [40]. The two isolate genomes were assigned to Sequence Type (ST) 11, clade 5, which is consistent with previously typed isolates of R078 [19, 40, 41]. The C. difficile MLST tool also analysed additional key genes, such as toxins Toxin A, Toxin B and the CDT and also genes that encode for antibiotic resistance. The results confirmed both M120 and CD105HS27 have wild type toxin genes cdtAB and tcdB and a 39 bp deletion in tcdC which has been characteristic of R078 isolates from its early identification [14]. Furthermore, tetM, predicted to encode a ribosomal protection protein (CDM120_RS02595) carried on Tn6190 in M120 [34], is absent in CD105HS27, which has two copies of a variant tetM, that share 67% identity at the aa level to that in M120.

Mobile genetic element content of M120 and CD105HS27

Like other isolates, those from R078 have been found to carry different sets of MGEs which encode for predicted virulence factors and antibiotic resistance [24, 25]. These include the conjugative transposons related to those in other strains of C. difficile; Tn6073 (CTn1-like), Tn6107 (CTn5-like) and CTn4 in the clinical R078 strain QCD-23 M63 [42], as well as those more distantly related, such as Tn6164 in the reference strain M120 [34]. Tn6164 is a composite MGE containing a prophage and has several regions that originate from different bacterial lineages [34]. This is considered likely to be a transposon as it can excise and circularise, and carries genes encoding products predicted to be involved in conjugation [34]. While Tn6164 is characteristically associated with R078 strains, not all R078 isolates carry it [34]. R078 isolates also may harbour Tn6190 (previously termed CTnCD3a [20]), a Tn916-related element that carries the tetracycline resistance gene tetM [42], as well as Tn6235 which carries aphA1, an aminoglycoside 3′-phosphotransferase suggested to confer streptomycin resistance [19]. M120 and CD105HS27 both have Tn6190, but, as described previously, M120 has Tn6194 whereas CD105HS27 does not. However, the environmental isolate does have a different large ~104 kbp element [22], now assigned as Tn6293. Encoded on Tn6293 are several genes with predicted functions that could potentially enhance cell survival and growth, including homologs of aadE (which confers aminoglycoside resistance [43]), a LexA repressor (involved in the SOS response regulation [44]) and 23S rRNA methyltransferase RlmN (that could impact on cellular growth [45]). It has predicted transposases and conjugation transfer genes as well as homologs of plasmid maintenance and replication protein encoding genes; parA and parB, and repA, suggesting this MGE is also a composite with several origins as determined for other C. difficile transposons, Tn9194 and Tn6103 [34, 42]. Interestingly, the amino acid sequence of AadE was 100% identical to that of plasmid-carried aadE genes in Campylobacter jejuni (YP_009079621) and Pediococcus acidilactic (YP_001965484), and is present in several Firmicutes sp. sequences from WGS projects, further supporting prior observations that this resistance can transmit between bacterial genera [46]. To determine the carriage of Tn6293 in C. difficile, its sequence was searched using BLASTn against C. difficile (taxid 1496) sequences. Homologous regions were found in the genomes of three of the seven isolates that are related to M120 (Additional file 2: Table S2); E1 and T5 (R126, human isolates) and NAP08 (R078, human isolate) [21]. To determine its potential origin, the nt sequence was searched against the NCBI nt/nr db. It has similarity to regions in Eubacterium and Ruminococcus spp. genomes. The shared nt sequence similarity is primarily located in genes whose predicted products are involved in genetic element mobilisation and maintenance functions. These include a serine recombinase (CD105HS27_00591), DNA binding and mobilization proteins (CD105HS27_00611 and CD105HS27_00612) and plasmid recombinase (CD105HS27_00634). Both Eubacterium sp. and Ruminococcus sp. belong to the same order as C. difficile, the Clostridiales, and the shared sequence similarity observed supports previous findings of MGEs being exchanged between these genera [25].

Both genomes carry a predicted R-type bacteriocin. R-type bacteriocins resemble phage tail-like particles (PTLPs) and have genes predicted to encode proteins involved in structural roles for tail assembly. However, they lack predicted capsid genes and thus are not a complete virion particle. These bacteriocins, or PTLPs, have been observed in culture supernatants of diverse isolates [9, 47, 48], and been used either as typing tools or to determine their use as alternative therapeutics [49, 50]. Due to the specificity required of proteins that target the cell surface, obtaining sequence information from the genomes of clinically relevant strains could aid in using a synthetic biology approach for designer antimicrobials; this has been demonstrated for the bacteriocin carried in a R027 isolate [51], with subsequent genetic modification for enhancing its antimicrobial application [52].

It is not possible to conclude whether these strains have transferred from the environment to the patients or vice versa from the comparisons we have performed here based on a sample size of two. However, the putative origins of these CTns have been examined based on sequence homology. Tn6164 and Tn6293 are clearly distinct from one another, and to known elements in other bacterial species. For example, for Tn6164, similarity to other sequences is split over the length of the transposon into at least two major regions: the phage containing region is most closely related to a single Clostridium difficile genome Z31 (CP013196.1) based on a nt identity of 93% covering 35% of its length. In the same region, the next most closely related elements are found in the complete genome of Thermoanaerobacter  spp. (CP002210 and CP000923.1) and a draft genome of Clostridium bornimense (GCA_000577895). Thermoanaeroacter strains were originally isolated from anaerobic enrichments with environmental samples from subsurface. C. bornimense is a hydrogen producing Clostridium and this species does not have an associated history with human infections, but isolated from a laboratory bioreactor [53]. The second region of the transposon has homology to Streptococcus and Anaerococcus spp. In contrast, Tn6293 showcases sequence similarity in multiple regions across its full length to different bacterial genera including Ruminococcus, Clostridium and Eubacterium spp. It is interesting that the second region of homology in Tn6164 is to pathogenic species. However, as this is based on few sequences, it is not possible to conclusively state this has been acquired while in clinics despite its absence from CD105HS27 (and thus infer CD105HS27 has evolved outside of clinics). Whether the two isolates have evolved in isolation is one possibility. SNP analysis has been used to track the transfer of strains across the world [54] and in different reservoirs [19, 54], with estimated mutation rates of 1–2 sites per year, suggesting that the number of substitutions (n = 85) we observed here suggests that these two isolates have evolved from one another over some time. Increasing numbers of R078 genomes will aid in determining the movement of strains from clinics to the environment and vice versa, in addition to how these strains further evolve when in different reservoirs.

Methylome of R078 isolates

To establish genome-wide methylation patterns of the two isolates, the profiles for methylation modifications N4-methylcytosine (m4C) and N6-methyladenine (m6A) were analysed from the SMRT data [55]. Methylation (the addition of methyl groups to bases) in bacteria may play a regulatory role in terms of gene expression [56], but is also one way that DNA elements can exploit to protect against their degradation by restriction modification systems in the host cell [57]. Both strains M120 and CD105HS27 show adenine methylation of the consensus sequence CAAAAA with high efficiency of target methylation (7484/7579, or 98.75% sites in M120 and 7469/7559 or 98.8% in CD105HS27). This target specificity had been previously assigned to the N6-adenine methyltransferase named M.Cdi25 or Cdi630V (locus tag CD630_27580, protein Id YP_001089271.1) of strain CD630 [22] and is reported in the REBASE database [58]. The respective methyltransferases of M120 (CDM120_RS14295, WP_003422891.1) and CD105HS27 (CD105HS27_02520) are identical and show a 98% identity (565/577) to the CD630 orthologue. Strain M120 showed signatures for a N4 modified cysteine ACGGC methylation target (398/414) and a consensus sequence CGGCNTGTGNNNNNNT was identified but with unknown modified base calls (12/13). In REBASE, the ACGGC target is assigned to two tandem methyltransferases of Tn6164, M1.CdiMORFAP (CDM120_RS02255, WP_041160334.1) and M2.CdiMORFAP (CDM120_RS02260, WP_041160335.1). No further modified base was detected in strain CD105HS27. The finding that methylation pattern of m4C GCCGT/ACGGC was absent in CD105HS27 may be explained by the absence of Tn6164 and both these two methyltransferases. In contrast, both M120 and CD105HS27 encode CdiMORFEP, a homolog of M.CdiG46II (amino acid identity of 565/577 (98%)) which is predicted to recognise CAAAAA sites. Three further predicted methylases on Tn6164 are present in M120 [34] and absent from CD105HS27, as the latter lacks this mobile element. While it was expected for the two Tn6164 m5C methyltransferases M.CdiMORFBP (CDM120_RS02360, WP_041160353.1) and M.CdiMORFCP (CDM120_RS02725, WP_041160386.1) to show no signature on the SMRT dataset, we would have expected to identify a signature for the putative m6A methyltransferase (CDM120_RS02520, WP_000662263.1). The fact that no additional adenine methylation pattern was detected could be due to one of many reasons including target identity of this enzyme and M.Cdi25/Cdi630V, lack of expression of the enzyme in CD105HS27 or inappropriate annotation of predicted CDSs.

Just as there are different sets of methylation genes functional in C. difficile, strains carry genes encoding multiple restriction enzymes [59]. It is of interest to note that despite the fact that M120 and CD105HS27 are highly related, they share only core genome methylation systems as the adenine methylase above or the McrBC system, as they do with the strain CD630. This is due to the fact that the majority of methyltransferases are in Tn6164 which is absent from CD105HS27. In addition to methylation Restriction Modification (RM) systems, MGEs have other defence systems against super-infection [60]. Here, Tn6164 carries three putative methylase genes on the transposon region and two on the prophage region of the element. The two sequenced strains were also found to contain defence mechanisms to combat RM systems, notably, Tn6190 carries ardA which encodes ArdA, an anti-restriction protein for type I restriction systems [61]. Whether this system is active remains to be determined, but evidently there are multiple mechanisms employed by MGE in C. difficile to be maintained.

CRISPR/Cas system of M120 and CD105HS27

Immunity to phage infection can also be conferred via the CRISPR (Clusters of Regularly Interspaced Palindromic Repeats)/Cas system which works as an RNA based interference against invading DNA elements [62], but also may act as regulatory machinery for other aspects of the cell biology and genome evolution [63]. The function of the CRISPR/Cas system depends on the action of CRISPR associated (Cas) proteins that are highly diverse in operons across prokaryotes, and ultimately involves the processing and matching of spacers to target DNA with its subsequent restriction [64]. It comprises of arrays that have conserved direct repeat (DR) sequences that flank spacer sequences. Spacers are homologous to phage or plasmid sequences as have been incorporated into arrays following unsuccessful past invasions, and in this way they can provide information about past interactions with such elements [65].

In this study, six CRISPR arrays and three cassettes of Cas genes were identified in each genome. Two Cas gene operons belonged to the I-B/TNeap group and contained all gene components to be functionally complete [64], and the third set comprised of cas6, cas7, cas5 and cas3, but lacked cas1 and cas2. Multiple cas sets within a single genome, of both complete and incomplete operons, have been described previously in C. difficile strains CD630 [66] and R20291, but it appears unusual that these two isolates have two complete yet distinct cassettes. The two complete sets are adjacent to CRISPR arrays CRISPR 4 and CRISPR 5.

The six CRISPR arrays are conserved between the two isolates. Five of the arrays have identical spacer contents with 17 (CRISPR_1), 44 (CRISPR 2), 13 (CRISPR 3), 32 (CRISPR 5) and 9 (CRISPR 6) spacers. The remaning array, CRISPR 4, has one additional spacer in CD105HS27 than M120, with 39 and 38 spacers, respectively (spacer number 12, indicated in by Additional file 3: Table S5. by asterisk). Previously, we showed that spacers targeted C. difficile phages [66]. Here, we searched spacers from the six arrays against 20 C. difficile phage genomes (Fig. 3, Additional file 4: Table S3). Of the total 154 spacers present in both isolates, 19 spacers have at least one identical match to a phage sequence from 18 phages. Perfect matches were identified between spacers and phage sequences from all arrays, except CRISPR arrays 3 and 6. Spacers with matches were located throughout the arrays, but differed with regards to location and type of phage (Fig. 3). We focused on perfect matches as phages phiCDHM1, phiCDHM19, phiCDHM14 and phiCDHM13 do not produce lysis of either strain [22]. To identify matches for the remaining spacers and to a wider range of DNA sequences, we searched the viral and plasmid databases in CRISPRTarget [67], the metaviromic datasets publically available on MetaVir [68] and C. difficile genomes (Additional file 4: Table S3 and Additional file 5: Table S4). We did not detect any perfect matches to the viromic datasets, but identified matches for spacers from all six CRISPR arrays to prophage and phage-like genes in the C. difficile bacterial genomes (Fig. 4, Additional file 3: Table S5). It has been found that CRISPR systems may also have regulatory roles in genomes [69]. To identify if there were spacers that matched to genomic sequence, we searched the genome of CD105HS27 and identified one perfect match for a spacer in CRISPR 6. The protospacer sequence is located in CD105HS27_02420, a gene encoding a putative carboxylase. This does not have either of the previously identified CCT or CCA Protospacer Adjacent Motif (PAM) sequences [66] so whether this has a functional role is unknown.

Fig. 3
figure 3

CRISPR spacer content with perfect matches to C. difficile phages. Left. Positions of spacers for each array with matches to the 18 phages (key coloured according to groups of medium myoviruses (MMs), long tailed myoviruses (LTMs), small myoviruses (SMVs) and siphoviruses (SVs)). The arrays show clear differences in terms of protospacer content with spacers that match to multiple phages. Right. Histogram showing the matches to protospacers in phage genes encoding portal, terminase, tape measure (TMP), tail fiber, cell wall hydrolase, repressor, anti-repressor, DNA binding and hypothetical proteins in addition to those outside predicted CDSs with their respective frequencies, and the table below corresponds to the gene’s functional region in the phage genome, phage type and the consensus Protospacer Adjacent Motifs (PAMs) detected

Fig. 4
figure 4

CRISPR spacer content with perfect matches to C. difficile isolate genomes. The spacer sequences from the 6 CRISPR arrays (on y axis). Protospacer locations (x axis) are shown in first column from perfect and imperfect matches for annotation (details in figure key). The next 53 columns contain perfect matches between spacers and corresponding C. difficile bacterial isolate sequences, coloured according to protospacer location (see key). The protospacer locations include those in conserved prophage genes. A total of 201 perfect matches were identified, with the spacer with most protospacers (n = 39) identified for CRISPR_2_17, in a phage protein of unknown function

We see that in C. difficile, CRISPR arrays appear to undergo horizontal exchange between strains via their presence on MGEs, including prophage, plasmids and the C. difficile sigK intervening (skinCd) element [18, 66]. In the genome of C. cellulolyticum H10, two CRISPR arrays are proximal to a transposase gene which suggests that recombination events could shift immunity profiles via the introduction of novel arrays with new spacer content [70]. Similarly in M120 and CD105S27, two of the arrays, CRISPR 1 and CRISPR 2, are in proximity to CDSs that suggest past integration events containing either integrase or transposase domains. Whether these genes still function and these regions are mobile is not clear from annotation alone. However, these findings of arrays on MGE and signatures of past integration events nearby suggest that arrays could move following genome insertion and excision events by a variety of mechanisms.

Conclusions

SMRT technology has been used to generate near complete genomes for two R078 strains, allowing the comparison of clinical and environmental isolates. The two genomes differ in chromosomal structure and number of ribosomal operons. Additionally, the two genomes differ in the carriage of two transposons, Tn6164 in M120 and Tn6293 in CD105H27, which we suggest are termed as putative conjugative transposons CTn6164 and CTn6293.

The majority of unique genes are carried on the two putative CTns and include predicted methylases. The methylome analysis for each genome suggests a vastly different methylation pattern with no consensus m4C motif in CD105HS27 detected. This likely impacts the immunity of each isolate to DNA elements including phages, and to the type of HGT that may occur for each. In contrast, their CRISPR/Cas systems are highly similar with only one spacer different between the two. Our findings support previous work that the CRISPR/Cas and RM systems are not mutually exclusive [71], and show this indeed appears to be the case in C. difficile.

Methods

Bacterial genomic DNA extraction

Bacterial genomic DNA (gDNA) extraction was performed using 1 ml overnight culture from a single colony grown in Brain Heart Infusion (BHI) broth (Oxoid, UK). DNA was extracted using a Qiagen GenomicTip 500/G kit (Qiagen, UK) following the manufacturer’s instructions. Pulsed Field Gel Electrophoresis was performed to assess gDNA degradation, with 100 ul of each sample separated on a 1% Agarose gel (Manufacturer info) for 18 h at 6 V. Gels were stained with 10 ul of ethidium bromide and visualised using UV G Box, Syngene. Sample gDNA quantity and quality was measured using by Qubit assay on a Qubit fluorometer (Life Technologies, USA) according to the manufacturer’s instructions, and by measuring absorbance at 260 nm and 280 nm using a Nanodrop Spectrophotometer (Thermo Scientific, UK).

Genome sequencing and bioinformatics analysis

Genomic DNA sequencing using a SMRT Pacific Biosciences platform was performed at the Centre for Genomic Research, University of Liverpool. SMRTbell libraries were prepared by Margaret A. Hughes with 3 SMRT cells used per library for sequencing. High quality genome assemblies were generated using HGAP (Hierarchal Genome Assembly Processer) as part of the SMRT Portal and methylation patterns detected. Contig structure and plasmid identification was performed from dotplots generated using Gepard [72].

Genomes were visualised using Artemis Genome Browser [73]. Coverage was determined from alignment of the corrected reads to the final assembly using BWA-SW [74], and samtools for index and conversion of file formats [75]. Coverage was assessed using Qualimap v.1.0 [76] and coverage plots were generated using the Artemis DNAplotter perl script [77]. Genome annotation was performed using PROKKA v1.7 [78], with a custom guide database containing proteins from the reference genome of M120 (accession NC_017174.1). RNA genes were predicted using RNAmmer v1.2 [79]. In silico ribotypes profiles were predicted using the oligonucleotide sequences from Bidet et al. [80]. Shared gene content was identified with blast + v2.2.28 using blast-all-v-all [81]. This publication made use of the Clostridium difficile Multi Locus Sequence Typing website (http://pubmlst.org/cdifficile/) developed by Keith Jolley and sited at the University of Oxford [82]. The characterised C. difficile CD630 CTns were used as a reference set for the identified of similar MGEs by BLASTn. Whole genome alignment and single nucleotide differences were generated using MAUVE v.2.4.0 [83]. Average nucleotide identity was calculated following the method described in [38], using the online web based tool which can be accessed at http://enve-omics.ce.gatech.edu/ani/ with parameters of min. length 700 bp, min. identity 70% and min. alignment 50. Dotplot analysis was generated using Gepard [72]. Genome comparison maps were generated using EasyFig v.2.2.2 [84]. Restriction modification systems were analysed using entries from REBASE (the Restriction Enzyme database) [58]. Prophage regions were predicted using PHAST [85]. CRISPR arrays were identified using CRISPRfinder [86], and the genomes CRISPR content compared using CRISPRcompar [87]. Spacer sequences were searched against the GenBank-Phage, RefSeq-Plasmid, RefSeq-Viral and Genbank-Environmental databases (accessed 1/10/2015) using CRISPRTarget [67] in addition to virus metagenome datasets (Additional file 5: Table S4). Spacer protein targets were identified using a curated approach based on annotations on the NCBI genome browser at locations identified from the CRISPRTarget search. Where no annotation was available from perfect spacer-target matches on CRISPRTarget, consensus annotations from imperfect matches (up to 7 mismatches) were used.