Background

Crohn’s disease (CD), one of the major forms of inflammatory bowel disease (IBD), is a chronic generalized inflammation of the gastrointestinal tract. The histological picture of Crohn’s disease includes thickened submucosa, transmural inflammation, fissuring ulceration, and non-caseating granulomas. The common complications in the intestine are presented by strictures, abscesses, fistulas and, in the long run, colon cancer. Extraintestinal complications include arthritis, erythema nodosum, uveitis, and primary sclerosing cholangitis.

Many factors, both genetic and environmental, are regarded to contribute to the CD pathogenesis. It is a general notion that CD is a result of abnormal immune response of genetically susceptible individuals to the imbalance in the intestinal microbiota (reviewed in [1, 2]). Host susceptibility factors include intestinal barrier dysfunctions (decreased levels of antimicrobial peptides defensins, discontinuous tight junctions and aberrant mucin assembly) and defects in innate immunity, autophagy, and phagocytosis. Polymorphisms in certain genes, e. g. NOD2, ATG16L1, and IRGM) involved in these processes have been reported to be associated with CD (reviewed in [1]). There are at least 71 susceptibility loci identified by genome-wide association studies, that are considered to be involved in the pathogenesis of Crohn’s disease [3].

Among dysbioses in CD patients 10–100 fold increase in abundance of Escherichia coli is often observed as compared to healthy individuals [4,5,6,7,8], so this led to several studies of E. coli isolated from those patients. The obtained strains were defined as pathotype adherent-invasive E. coli (AIEC) due to their ability to adhere and invade epithelial cells of the intestine [9, 10]. They are also able to survive and replicate within macrophages, and are selectively favored by impaired autophagy to replicate intracellularly [11]. In comparison with commensal E. coli, AIEC are more often resistant to antibiotics [12], and are strong biofilm producers [13]. Some of the isolated strains were shown to induce chronic inflammation by colonizing mice intestine [14, 15]. An adhesion-invasion model was proposed, according to which interaction between bacterial porin OmpC (outer membrane protein C) and human CEACAM6 (carcinoembryonic antigen related cell adhesion molecule 6) receptor were a key step in the pathogenesis [16].

Published observations on the phylogenetic diversity of the AIEC group are controversial. In some independent studies performed using various techniques including genomic hybridization assays, RAPD-PCR and serotyping, and phylotyping by a multiplex PCR protocol, E. coli strains from different patients were shown to be highly heterogeneous and were assigned to several phylogroups (A,B1,B2,D) [5, 13, 16,17,18,19,20,21,22,23,24]. The ribotyping analysis on the contrary leads to the suggestion that the majority of CDEC have evolved from the same ancestral strain from phylogroup B2 [4], perhaps by acquisition of additional virulence factors via mobile elements transfer or insertion of a pathogenicity island(s) into the bacterial chromosome [25, 26].

Results of whole-genome shotgun sequencing supported the single ancestor hypothesis. To date, four complete genomes of E. coli isolated from CD patients have been sequenced. They all belong to phylogroup B2. Although, these isolates were obtained from independent clinics (NRG857c from Canada [27], LF82 from Germany [28], UM146 from France [29], and HM605 [30] from United Kingdom), their genomes showed considerable sequence similarity and synteny (more than 99% sequence identity at 93–99% genome coverage). Several pathogenic islands were observed in these genomes [27, 28] as well as plasmids homologous to those from Klebsiella and Salmonella [28, 29]. However, no comparative analysis of these plasmids was performed.

In a recent paper B2-phylogroup E. coli genomes from CD patients were compared with 25 strains from patients with ulcerative colitis (UC) and non-IBD, and the phylogenetic heterogeneity of AIEC and CD strains was established [31]. No gene common to all, or even a majority of AIEC was identified. Previously, genes encoding polyethylene glycol utilization and iron acquisition were reported to be overrepresented in AIEC relative to nonpathogenic E. coli [32].

In the present paper we report whole-genome sequences of 28 E. coli isolates from the ileum and feces of ten CD patients. The comparative analysis of these genomes and previously published strains revealed their high phylogenetic diversity as a group, high homogeneity within a single inflamed intestine, and specific genome features.

Methods

Patient selection

Patients were selected from two clinical centers (Central Scientific Institute of Gastroenterology and State Scientific Center of Coloproctology) in Moscow, Russian Federation, from 2012 to 2014.

Ten patients (seven males and three females, 23–47 years old, mean age 33, who met the eligibility criteria were enrolled in the study (Table 1). The inclusion criteria were the following: age above 18, endoscopically and radiologically diagnosed, and histologically confirmed Crohn’s disease. The exclusion criteria were signs of indeterminate colitis, infectious diseases, anamnesis of total colectomy, presence of stoma, and recent antibiotic treatment.

Table 1 Samples and patients

Diagnosis and treatment

Duration of the disease was from four months to eight years. Two patients had acute disease (less than six months), eight patients had chronically relapsing disease. All patients had the confirmed Crohn’s disease three months before enrolment or earlier. Seven patients had ileocolitis (L3), two of them with perianal disease, two patients had ileitis (L1), and one patient had colitis (L2) [33]. At the enrolement, three patients had clinically severe disease (Crohn’s Disease Activity Index, CDAI > 450), one patient had moderate disease (CDAI = 320), five – mild disease (CDAI 150–220), and one patient was in clinical remission (CDAI = 110) [34]. Most patients received immunosuppressive therapy, five of them with infliximab. Two patients received steroids. None of the patients received antibiotics at the moment of enrolment in this study and two months prior to it.

Study procedures

Three types of samples were collected for the purpose of this study. Fecal samples were collected prior to preparation for endoscopy. Bowel preparation was performed with polyethylenglycol solution. Patients underwent ileocolonoscopy at clinical centers. During this procedure samples of two types were collected, ileum liquid content was aspirated from the ileum, mucosa biopsy was taken by sterile biopsy forceps from the ileum, caecum and sigmoid (inflamed tissue near ulcers).

Strains and cell culture

Isolation of E. coli was performed as follows: liquid aspirates were diluted approximately ×106 fold with sterile PBS (phosphate saline buffer). Approximately 0.05 ml volume of feces were placed into 0.5 ml of sterile PBS, vortexed to homogeneity, an aliquot was diluted approximately ×106 fold. Biopsy samples were vortexed in 0.2 ml of sterile PBS. For all samples, 0.1 ml of the resulting liquid was spread onto the Luria-Bertani agar plates. After overnight incubation on 37 °C, isolated colonies were identified with the Matrix Assisted Laser Desorbtion/Ionization (MALDI) Biotyper software (Bruker Daltonics, Germany) using the Microflex LT mass spectrometer (Bruker Daltonics, Germany). For DNA extraction, all E. coli strains were grown in the Luria-Bertani broth at 37 °C with shaking (200 RPM) overnight and collected by centrifugation. Samples and corresponding E. coli isolates are listed in Table 1.

The testing of susceptibility to ampicillin/sulbactam, ceftriaxone, cefotaxime, ceftazidime, cefepime, imipenem, meropenem, gentamicin, levofloxacin, and ciprofloxacin (all from Bio-Rad, USA) was performed by the disc-diffusion method using the Mueller-Hinton agar plates. The E. coli strain ATCC 25922 was used as a control. Current CLSI and EUCAST criteria were used for interpretation.

Genome sequencing

Genomic DNA from individual cultures was extracted by the QIAamp DNA Mini Kit (Qiagen) according to the manufacturer’s protocol. Extracted DNA (100 ng for each sample) was disrupted into 200–300 bp fragments by Covaris S220 System (Covaris, Woburn, Massachusetts, USA). The barcode shotgun library was prepared by Ion Xpress™ Plus Fragment Library Kit (Life Technologies). PCR emulsion was performed by Ion PGM™ Template OT2 200 Kit (Life Technologies). DNA sequencing was performed by Ion Torrent PGM (Life Technologies) with the Ion 318 chip and Ion PGM™ Sequencing 200 Kit v2 (Life Technologies).

Genome assembly and annotation

Genomes were assembled using Mira 4.0 with standard parameters for the Ion technology.

To correct Ion Torrent homopolymer errors, which could result in assembly errors [35] and artificial frame shifts in coding sequences (CDS), our HomoHomo tool was applied (freely available at www.github.com/paraslonic/HomoHomo). In short, the method consists of the following steps: mapping reads to an assembly; searching for positions with indel polymorphisms in the mapped reads; BLASTN [36] search of the assembly region around found positions; and selecting the sequence variant which is consistent both with the best BLAST hit and reads. This method reduces artificial indels in the assembly by the factor of about 2.5. The estimation is based on comparing assemblies of Ion Torrent reads before and after correction with reads from more accurate sequencing technologies such as Illumina, SOLID, and Sanger.

To produce meta-assemblies, reads from different colonies obtained from the same patient were assembled together and processed as described above.

The obtained genome sequences were annotated using PROKKA 1.7 [37].

The draft genomes are available in GenBank with the following accession numbers: RCE01 (JUDV00000000), RCE02 (JUDW00000000), RCE03 (JUDX00000000), RCE04 (JUDY00000000), RCE05 (JWJZ00000000), RCE06 (JWKA00000000), RCE06 (JWKA00000000), RCE07 (JWKB00000000), RCE08 (LAXB00000000), RCE10 (LAXA00000000), RCE11 (LAWZ00000000).

Genome analysis

Several phylogenetic methods were used in order to verify the results. Two methods based on multiple alignment and maximum likelihood: assembly-free method with the use of precalculated groups of orthology (OG), method with de-novo assembly, annotation and OG construction.

Comparison of individual colonies by an all-vs-all method

First, we utilized reference-free approach to examine relationships among the sequenced strains. Assemblies of individual colonies were used as a reference. All-vs-all mapping was done with the bowtie2 tool. SNPs (single nucleotide polymorphisms) were calculated with the samtools mpileup tool [38] and filtered using vcftools [39] with a p-value threshold of 10−5, 90% frequency threshold, and minimum coverage of four reads. The distance between samples was calculated as the SNP count divided by the length of the referenсe (the total length of all nucleotides with at least 4× genome coverage). The Neighbour Joining tree was build using the distance matrix by the ape package for R [40]. Scripts are available at Github (https://github.com/paraslonic/Rakitina_etal_Crohn_paper/tree/master/snpSimilarity) [41].

Phylogenetic analysis by an assembly-free method

To assign the sequenced isolates to E. coli phylogroups, orthology groups (OGs) from 32 phylogenetically diverse E. coli and Shigella strains were taken from [42]. Alignments of proteins within each universal group (OGs with single-copy genes present in all analyzed genomes) were produced with ClustalW version 2.1 [43]. Consensus sequences were generated from the resulting alignments with the EMBOSS package ver. 6.6.0 [44]. These consensuses were then used as a reference for read mapping with bowtie2 ver. 2.1.0 [45]. The reads from each isolate were mapped individually, and the consensus for each OG for a particular isolate was generated with samtools [32]. The resulting consensus sequences were added to the OGs, and the groups were realigned. Alignments of all universal OGs were concatenated, all columns with gaps were removed, and the final alignment was used to construct a phylogenetic tree with PhyML v. 3.0 [46] (with 100 bootstrap replicas, the tlr optimization parameter). Previously sequenced CD-associated strains, uropathogenic (UPEC) strain JJ1886, E. albertii strain KF1 (GeneBank ID: CP007025), and E. fergusonii strain ECD227 (GeneBank ID:CM001142) were added in a similar manner, but instead of reads, nucleotide sequences of genes were used.

Phylogenetic analysis based on assemblies and de-novo OG construction

Additionally, we evaluated whether E. coli isolated in the present study arose from the strains with similar lifestyles, e.g. commensal or pathogenic. For this purpose a larger ML phylogenetic tree was built without bootstraps. E. coli genome sequences obtained in our experiments were compared with all available complete and some unfinished E. coli genomes from GenBank. Only unfinished genomes that were top BLASTN hits for each CD-associated isolate were selected (the complete list is in Additional file 1). All selected genomes were assigned to one of the following groups: Crohn, genomes sequenced in this study; CrohnLit, publicly available genomes associated with CD [27,28,29,30], Non-pathogenic, commensal and laboratory cultivated non-pathogenic strains; Pathogenic, strains associated with diseases other than CD; and Other, with no reliable phenotype information. To avoid artificial differences resulting from different annotation pipelines, genomes from GenBank were reannotated with PROKKA 1.7 [37]. OGs were obtained using the OrthoFinder software [47, 48] with default parameters. Universal groups were selected and OGs with large gene length variation (more than 80% of the median length) were filtered out. Nucleotide sequences of genes from selected OGs were aligned by ClustalW [39]. Aligned sequences were concatenated by strain, and a Maximum Likelihood (ML) tree was built using the dnaml tool from the Emboss package [44]. All scripts for tree construction from OGs are available at GitHub https://github.com/paraslonic/Rakitina_etal_Crohn_paper/tree/master/phylogeny [41].

Multilocus sequence typing (MLST) characterizes isolates of microbial species using the DNA sequences of internal fragments of multiple housekeeping genes [49]. A MLST group was assigned by web service: mlst.warwick.ac.uk/mlst/dbs/Ecoli.

Comparison of the gene and domain content

Principal component analysis of PFAM domains and families was performed in order to find Crohn-enriched genes and domains. For domain and domain-family annotation we applied the pfam_scan.pl v. 1.5 script [50] to annotated proteins from all strains. Annotation results were combined and binarized. For each strain a pandomain profile was obtained, defined as the vector of presence/absence values attributed to each studied genome. The length of this vector is the number of domains present in at least one strain. Bray-Curtis similarities were calculated and used to build multidimensional scaling (MDS) plot with custom R script available at GitHub [41] https://github.?>com/paraslonic/Rakitina_etal_Crohn_paper/tree/master/pfamProfiles.

To identify over- or under-represented OGs in certain groups of strains, the two-way Fisher test was used separately for each domain or OG. The comparison of the Crohn group with the Commensal group was performed. The Crohn group contained 10 assemblies from 10 patients involved in the current study (multiple genomes from one patient assembled together), and 17 previously published genomes [27,28,29,30,31]. The Commensal group included only strains isolated from healthy individuals [51,52,53,54] (Additional file 1 (B)). For the OG content analysis, E. coli genomes were reannotated. The Holm method was applied to adjust for multiple comparisons [55]. The retention index which is an indicator of consistency between a feature (i.e. the OG composition) and a tree was calculated for each domain based on the large ML tree (see Phylogenetic analysis based on assemblies and de-novo OG construction) using the phangorn package for R. Functions to OGs and domains were assigned using PFAM, Uniprot, and KEGG databases [50, 56, 57].

Detection of plasmids

The contig was considered as a candidate plasmid, if it had no links with any other contigs of the same assembly (no reads were mapped both to an edge of this contig and to an edge of another contig), and its coverage was at least twice as high as the average coverage of the genome. All candidate contigs were then aligned with blastn against a database containing the results of the query “plasmid[title]” from NCBI nucleotide database Contigs with at least 80% nucleotide identity and 75% length coverage of a reference plasmid sequence were considered as potential plasmid contigs.

Candidate plasmid contigs were realigned with plasmid sequences with Mauve 2.4.0 [58] and visualized with genoplotR R package. In addition, the presence of a particular plasmid was identified by read mapping. Reads from each of 28 isolates left after mapping on universal OGs were then mapped with bowtie2 ver2.1.0 (local alignments) to the studied plasmids: plLF82 (NC_011917) and pJJ1886_1–5 (NC_022661, NC_022649, NC_022662, NC_022650, NC_022651). The per-nucleotide coverage was extracted with bedtools ver. 2.18.2.

Bacteriocin production test

CDEC strains were tested for bacteriocin production by the method from [59] with minor modifications. Bacterial cells were used for inoculation of liquid TY medium containing tryptone 8 g/L, yeast extract 5 g/L, and sodium chloride 5 g/L. The 1.5% TY agar plates were subsequently inoculated by a needle stab with fresh broth cultures and the plates were incubated at 37 °C for 48 h. Bacteria were killed using chloroform vapours for 10 min. Each plate was overlaid with a 5 ml of a warm soft agar (0.7% TY agar, w/v) containing 107 cells/mL of an indicator strain (K12 or MG1655). The plates were then incubated at 37 °C overnight. The assessment of bacteriocin production was based on the diameter and intensity of growth inhibition or lysis zone. The indicator strains were obtained from an in-house collection of strains. Five minutes ultraviolet-C irradiation was used as an inductor of bacteriocin expression.

The intensity of inhibition was evaluated as “strong” (a clear lysis zone), or “weak” (an opaque zone, indicating some growth inhibition).

Phages resistance test

E. coli strains were tested for phage resistance (virulent, temperate, Salmonella-specific and male-specific) by the cross-streak and spot-test methods as described in [60]. All phages were taken from the collection of the Laboratory of Bacterial Genetics (Gamaleya Institute for Epidemiology and Microbiology).

Results

E. coli Strains cultivated from an inflamed intestine of a CD patient are closely related

Genome assemblies were obtained for 28 E. coli isolates from 10 Crohn’s disease patients (Table 1). SNP analysis of these genomic sequences (all-vs-all method) revealed that bacteria isolated from one patient tend to cluster together, even when these bacteria are isolated from different parts of the intenstine, such as in the case of patients RCE01, RCE03, and RCE04 (Fig. 1, Table 1, Additional files 2 and 3). The number of SNPs within a patient was negligible (less than 200) as compared to the interpatient diversity (on average more than 28,000). Alignment of genome sequences from different intestine parts of one patient revealed some deletions (usually a deletion of one of the smaller contigs, probably a plasmid), and minor heterogeneity (Mauve analysis, Additional file 4A). This suggested that the whole inflamed intestine of a CD patient is colonized by a single strain of E. coli. Basing on this conclusion we were able to merge E. coli genomes from each patient into a meta-assembly, and use the latter to compare strains from different patients.

Fig. 1
figure 1

Genomic similarity of E. coli from individual colonies. Heatmap colors represent the number of SNPs per nucleotide (all-vs-all method). Lighter colors mean higher sequence similarity

CDEC is a polyphyletic group

The phylogenetic analysis by two methods (SNP analysis of de novo genome assemblies and alignment of concatenated conserved proteins from 653 universal orthologous groups) shows that isolates from different patients fall into different phylogroups of E. coli (Figs. 2 and 3, Additional file 4B). E. coli from patients RCE04, RCE07, RCE11, and RCE01 belong to phylogroup A, the RCE02 isolate to phylogroup B1, the RCE05 isolate is close to phylogroup D, while RCE06 and RCE10 are placed in phylogroup B2 along with previously published genomes of CDEC strains. Isolates from patient RCE03 were shown to be more distant from E. coli BL21 than E. fergusonii. However, similarly to all E. coli it was resistant to lphi7S1, a phage, to which all S. typhimurium are susceptible and all E. coli are resistant (see below Additional file 5). The set of RCE03 genes was similar to other E. coli (see below the comparison of orthologous gene groups). The disease symptoms and clinical course of the patient RCE03 were also quite typical for CD (Additional file 2).

Fig. 2
figure 2

Phylogenetic analysis of E. coli strains. The phylogenetic tree of all universal single-copy genes was constructed by the maximum-likelihood algorithm with 100 bootstrap replicates for E. coli isolates from ten patients (this study), 32 E. coli and Shigella strains from [42] from phylogroups A (yellow), B1(light green), B2 (green), D(cyan), E (blue), S (violet), previously published [27,28,29,30] CD-associated strains (red), and uropathogenic strain JJ1886. Escherichia albertii KF1 and Escherichia fergusonii ECD227 were used as outgroups

Fig. 3
figure 3

Genomic comparison of 14 CD-associated strains with pathogenic and non-pathogenic E. coli genomes. Maximum likelihood unrooted tree is based on the core genes. Strains from this study are colored black (here and on the images below indicated as Crohn); previously published CDEC (here and on the images below indicated as CrohnLit) are grey; nonpathogenic, green; pathogenic, pink; strains of undetermined pathogenicity, white. Derivatives of one laboratory strain are merged. Strains containing a plasmid homologous to plLF82 and pJJ1886 are indicated. Pdu operons from LF82 are marked with red

Hence CDEC do not form a single phylogenetic group sharing a common ancestry. The same conclusion could be drawn from the MLST typing (Additional file 6). The CD-associated strains from ten patients sequenced here fall into nine different STs including one unknown (RCE04 genome). Only ST131 has two representatives (RCE06, RCE10). This ST has been described as the fastest spreading among the B2 group [61].

At the same time, some CDEC isolates from independent sources are very similar (Figs. 2 and 3). Two classic AIEC strains from France (LF82) and Germany (O83:H1) share more than 99% sequence identity (chromosome coverage 98%) [27, 28]. Here, we revealed the high level of identity by the BLASTn alignment of chromosome sequences of two strains (patients RCE06 and RCE10) isolated in different clinics - 99% sequence identity at 94% of chromosome coverage. Weaker but pronounced similarity is observed for strains RCE07 - RCE11 (99% sequence identity at 92% of chromosome coverage). Notably, eight of the ten examined CDEC strains appeared to be phylogenetically closest to pathogenic E. coli (Additional file 2).

CDEC genomes contain plasmids from pathogenic strains

Bacterial plasmids often carry genes associated with pathogenicity. To search for candidate plasmid contigs we analyzed individual CDEC isolates independently. In 24 assemblies from one to three plasmids of various length (5–100 kb) and origin were detected. Some of them had plasmids of pathogenic bacteria, e.g. uropathogenic E. coli and Salmonella, as the closest homologs, but most had high sequence identity (more than 80% coverage, 95–99% similarity) with genomes of commensal E. coli isolated from healthy individuals (Additional file 2). Two plasmids were found to be specific for CDEC strains.

Plasmid contigs identified in isolates from three patients (RCE01, RCE02, RCE04) (Fig. 4a) were highly similar to the previously published reference AIEC strain LF82 [28] (Additional files 2 and 7A). Regions of plLF82 that are common for all meta-assemblies contain 99 CDS. The latter are mostly represented by phage proteins, proteins involved in DNA maintenance and conjugation, and possible virulence determinants such as enterotoxins, outer membrane proteins, resistance proteins, etc. (Additional file 7B).

Fig. 4
figure 4

Full-length alignment of plasmids shared by CDEC strains from the present study: E. coli LF82 plasmid (a), and JJ1886 plasmid 4 (b). The first row in each case represents the plasmid map; other rows show homologous regions and rearrangments (MAUVE 2.4.0, default parameters) between the plasmid of interest and meta-assemblies for specific patients. Each homologous region is shown by a specific color

A candidate plasmid detected in isolates from patient RCE02 and three previously published CDEC genomes [31] were found to be similar to plasmid pJJ1886_4 from the fatal urosepsis E. coli isolate JJ1886 (Fig. 4b, Additional files 2 and 7A). Genomes of two other isolates (RCE06 and RCE10) were closely related to the genome of the JJ1886 strain [30, 62]. Predicted functions of proteins shared by the CD isolates and the JJ1886 UPEC plasmid are plasmid DNA maintenance, type IV secretion, and resistance (Additional files 2 and 7C).

Plasmids plLF82 and pJJ1886_4 have no homologs in commensal or non-pathogenic E. coli (Additional file 7D). Plasmids with considerable similarity exist in Yersinia pestis (plLF82), multidrug-resistant E. coli from hospitals (pJJ1886_4), Salmonella enterica, and Klebsiella pneumoniae (both) (Additional file 7E, F). plLF82 has been suggested to be acquired by E. coli via horizontal gene transfer from Yersinia or Salmonella [28].

No sequence similarity was observed between plasmids plLF82 and pJJ1886_4, but functional analysis revealed some common functions, such as plasmid DNA maintenance and conjugation, and a few enterotoxins, outer membrane proteins, and multidrug-resistance proteins (Additional file 7B,C).

There are no domains or genes found only in CDEC genomes

In order to identify potential virulence domains, we compared CDEC with other pathogenic and non-pathogenic E. coli. The principal component analysis of PFAM domains and families shows that CDEC do not cluster together (Fig. 5), and are scattered among pathogenic and nonpathogenic strains. Thus, CDEC as a group can not be attributed to pathogenic or non-pathogenic E. coli on the basis of their pandomain profile.

Fig. 5
figure 5

Multidimensional scaling plot of distances between the PFAM-domain content in CDEC, pathogenic, non pathogenic, and commensal E. coli. The colors of strains are as in Fig. 3

In order to identify genes that could influence CDEC virulence, we compared the protein composition of CDEC (27 genomes) with commensal E. coli strains isolated from healthy individuals (24 genomes) [29, 30, 51,52,53,54]. The complete list of strains is given in Additional file 1. In total, 143 orthologous groups (OGs) are overrepresented in the CDEC group, and 237 OGs are underrepresented (Fisher test p-value ≤0.05, Additional file 8). No difference was significant after adjustment for multiple testing (the Holm correction), even for those that were 10 times more often in CDEC genomes (Additional file 8). That can be partially explained by a small size of the analyzed E. coli dataset (51) compared to the number of regarded OGs (11,886 OGs). Hereinafter, those OGs are referred to as enriched in CDEC or commensal E. coli genomes.

OGs enriched in CDEC genomes tend to fom operons

Most genes from CDEC-enriched or commensal-enriched OGs are located on the chromosome, and moreover, 156 of them form operons with certain finctions (Figs. 6 and 7). In the reference CDEC strain LF82 all CD-enriched operons are present, while the commensal-enriched operons are not. Six operons from LF82, namely glyoxilate metabolism - gcx part of ptn-cgl-gcx-ibe operon, capsular assembly PAI IV LF82, iron uptake operon I, sorbose uptake and utilization, prophage I LF82, and propanediol utilization operon, showed the number of enriched OGs above random probability level (Additional file 9, Fig. 7), therefore their enrichment in CDEC genomes is valid.. Genes of CD-specific plasmids did not pass Fisher’s test.

Fig. 6
figure 6

Functions of overrepresented OGs (Fisher’s test p-value <0.05 prior to Holm’s correction). The number of overrepresented OGs with a given function is shown on the horizontal axis for commensal strains (grey, left panel) and CDEC (yellow, right panel)

Fig. 7
figure 7

Operons and gene groups enriched in CDEC (yellow) and commensal E. coli (grey) (Fisher test p-value <0.05 prior to Holm’s correction). OGs form horisontal rows, strains – vertical columns. Genomes with similar OG patterns were clustered together using a custom R script (see Methods). that can be achieved at GitHub repository https://github.com/paraslonic/Rakitina_etal_Crohn_paper/tree/master/ogEnrichment [41]. Phylogroups of strains are indicated (A, B2, B1, D)

OGs overrepresented in CDEC are involved in metabolism, horizontal gene transfer (HGT), and virulence (Fig. 6).

OGs with functions associated with metabolism are mainly enriched in commensal strains (aromatic compounds degradation, fatty acid biosynthesis, and glycerolipid metabolism). The only metabolic function of CDEC-enriched OGs was utilization of sugar alcohols (propanediol, galactitol, glycerol). This function in CDEC is represented by the propanediol (15 genes) and galactiol (7 genes) utilization operons (Fig. 7).

Enrichment in OGs associated with HGT was previously reported to be characterictic of pathogenic strains leading to accumulation of pathogenic genes [49]. In our comparison, however, OGs with such HGT functions as “transposases” (transposon proteins), and “foreign DNA transfer” were enriched in commensal strains (Fig. 6). OGs with function of “foreign DNA resistance” were also enriched in the commensal group, due to the CRISPR-Cas locus (Figs. 6 and 7). The maindifference observed in HGT category is a presence of distinct prophages: Mu-like prophages tend to occur in commensal strains, while lambda-like prophage I from LF82 is specific for CDEC (Figs. 6 and 7). However, the phage resistance test of CDEC revealed that the presence or absence of a particular prophage in a genome cannot be directly interpreted as the evidence of the strain sensitivity (or resistance) to this phage (Additional file 5).

OGs involved in carbohydrates metabolism and uptake have equal amount of representatives among CD-enriched and commensal-enriched OGs (Fig. 6). Most of the CD-enriched OGs are rather involved in invasion, than metabolism (see below).

OGs responsible for adhesion-invasion are more common in commensal E. coli (Fig. 6), and are represented by the type III secretion system locus. At the same time, in CDEC this function is represented by the GimA island, containing three carbohydrate and glycerol metabolism operons (ptn, cgl and gcx) and one invasion ibe operon (Fig. 7). This island was first identified in meningitis-causing E. coli and proved to be responsible for carbon-source induced invasion of the blood-brain barrier [63].

In the commensal group enriched OGs asocciated with toxins are represented by the microcin operon, while in the CD group – by one gene from type II toxin-antitoxin system.

Other potentially pathogenic functions enriched in CDEC are iron uptake (Fig. 6), presented by chu operon and enterobactin gene clusters (iron uptake operons I and II, Fig. 7), and lipid A biosynthesis (3 separate OGs).

OGs encoding membrane, fimbrial proteins, transporters and those involved in cell wall/envelope assembly, are present in both commensal-enriched and CD-enriched groups (Fig. 6).

CDEC are resistant to varying antibiotics

The antibiotic susceptibility test confirmed interpatient heterogeneity (Additional file 10). All tested E. coli strains expressed different phenotypes. Isolates recovered from patients RCE01 and RCE03 were pan-susceptible. Isolates from three other patients (RCE04, RCE05 and RCE06) were resistant to three or more antibiotics, thus being multidrug-resistant. All studied isolates were susceptible to ampicillin and carbapenems (imipenem and meropenem).

CDEC produce bacteriocins, inhibiting the growth of other E. coli strains

Four CDEC strains (isolates from RCE04, RCE06, RCE10 and RCE11) were tested for the bacteriocin production by the method of Kohoutova [59] with slight modifications. All strains showed bactericidal effects on indicators (Additional file 11). RCE04 had the weakest bactericidal ability, showing only a weak effect on the most susceptible indicator. At that, the RCE04 strain was isolated from the ileal lumen and caecal biopsy (six isolates altogether) of one patient, suggesting it colonised the whole intestine. One may conclude that either bacteriocins are not necessary for E. coli to dominate the intestine, or that the expression of the RCE04 bacteriocin has not been induced in the cultivation conditions.

Discussion

Several studies have attempted to establish whether CD-affected intestine is colonized by a single or multiple strains of E. coli. Indeed, different strains could abide in mucosa and lumen, within lesions and in non-affected sites (reviewed in [2]). Our analysis shows that complete genome sequences obtained from a given patient have a very low SNP rate, confirming genetic homogeneity of E. coli within the same intestine. Even the genomes of E. coli from caecum biopsy, ileum lumen and feces (patient RCE01) demonstrate high similarity, indicating that all parts of the inflamed intestine are colonized by a single strain.

Since the time the CDEC group had been defined, its phylogeny had been debated. It has been suggested [24] that this group might have evolved from a common commensal ancestor, that has become pathogenic by acquisition of virulence factors via horizontal gene transfer from related pathogenic organisms (Klebsiella, Shigella and Yersinia) [28]. Because of that, recent studies concentrate mostly on the phylogroup B2 [31]. However, in other cases high heterogeneity of CDEC serotypes and MLST groups has been observed implying, that there are only functional similarities between CDEC, and no common origin [16].

The results of our study suggest a combination of the above hypotheses. Here, high interpatient heterogeneity has been demonstrated with isolated CDEC attributed to several distinct phylogroups (Fig. 2). On the other hand, some independently isolated strains (LF82 and O83:H1, RCE06 and RCE10) are highly similar and likely share a common origin. Strains with similar chromosomes may contain unrelated plasmids and vice versa. For example, chromosomes of LF82 and O83:H1 share more then 99% homology, but their plasmids have no homologous genes. On the other hand, chromosome of RCE03 is more distant from LF82 than E. fergusonii (Fig. 2), while the identity between their plasmids exceeds 98% at 86% coverage. This supports the hypothesis that the CD-associated phenotype could have arisen by horizontal gene transfer (plasmid or phage), possibly from non-E. coli bacteria.

Another question concerning E. coli and Crohn’s disease is whether it is a pathogen, or just a survivor. The mechanism of strain domination has to be discovered, one possibility being that it is due to a bactericidal effect on other E. coli strains. Indeed, all tested CDEC demonstrated some bactericidal activity. However, in co-cultivation experiments at standard conditions, CD-isolates failed to outcompete isolates from healthy individuals (Additional file 12). Another explanation of the increased abundance of CDEC in microbiota may be better fitness in the acute inflammation conditions. This hypothesis is supported by the observed proliferation of AIEC during severe ileitis in non-sterile mice, initially induced by chemicals or protists [64].

AIEC role in CD pathogenesis is supposed to be mediated by the bacterial cell surface proteins (porins, pili, membrane proteins, glycoproteins and proteins complexes with lipopolysachharides) [2]. In that regard it is interesting that many OGs overrepresented or underrepresented in CDEC have those functions. This suggests possible differences between CD- and commensal E. coli cells outer surface – membrane proteins repertoire and polysaccharide composition.

Our study provides evidence that CDEC as a group is closer to pathogenic E. coli than to commensal one. The genomes of some CDEC strains share more than 99% identity with defined pathogenic strains and/or contain plasmids closely related to those of defined pathogenic strains (Additional file 2). While no universal pathogenic feature was found in the genomes of the analyzed strains, several protein functions were more prominent in the CD-associated group of E. coli, and that could be relevant to the possible pathogenicity of these strains.

One of the functions of genes enriched in CDEC is propanediol utilization (similar results were obtained in [12]). It is interesting that the propanediol utilization operons in CDEC are of diverse origins: the operon from LF82 (O83:H1, four strains from the present study and 7 CD strains from [31]) has homologs in pathogenic E. coli strains, E. albertii, and Shigella sp. The operon from two other CD strains (UM146 and RCE10) is similar to the pathogeniсity island II from E. coli 536 strain. The operon from the RCE11 strain is similar to that of Citrobacter and Klebsiella spp. This provides additional support to the suggestion that CD-specific features in E. coli strains are not specific genes, but functions, probably obtained from independent sources via horizontal gene transfer (Fig. 8). Indeed, in many cases these genes form operons flanked by genes encoding transposases or recombination proteins (Fig. 8).

Fig. 8
figure 8

Schematic representation of the propanediol and galactitol operons in CDEC genomes. For each operon the reference strain and the percent of genomes containing it is indicated (CDEC vs commensal)

Recent publications show that the utilization of 1,2-propanediol is closely linked to intestinal proliferation and virulence of Listeria monocytogenes, enteropathogenic E. coli (EPEC), Salmonella enterica and Enterococcus faecalis (reviewed in [65]). Further, the genes required for 1,2-propanediol degradation are necessary for Salmonella replication within macrophages [66]. 1,2-propanediol utilization is important for the growth in host tissues since its precursor, fucose is found in glycoconjugates of intestinal cells involved in host-parasite interactions [67]. 1,2-propanediol can be utilized by members of Enterobacteriaceae via aerobic and/or anaerobic pathways [68]. A normal condition in the intestine is anaerobic, whence the aerobic pathway is much more efficient. So, the inflamed intestine would provide bacteria possessing this pathway with both an abundant substrate and the conditions for its optimal utilization. At that, Salmonella typhimurium has been suggested to induce acute inflammation in the intestine to provide aerobic conditions for ethanolamine utilization (a pathway close to the propanediol utilization) [69]. One could speculate that a similar mechanism forms a base for the CD pathogenesis.

Of 14 strains containing the pdu operon similar to that of LF82, nine are positioned close to the galactitol utilization locus (Fig. 8). These genes are functionally analogous to the gat operon that is common to all E. coli, however without any sequence similarity. The closest relatives of the CD-specific galactiol catabolism operon from LF82 are found in Klebsiella sp. Salmonella enterica, Enterobacter spp., and Listeria monocytogenes. Previously, the sets of genes for the galactitol catabolism in Enterobacteriaceae were reported to be involved in horizontal gene transfer and recombination events [70]. It is hard to tell whether the additional galactitol operon has any specific function, but this pathway is connected with the gut colonization. For example, genes involved in the galactitol catabolism are induced in E. coli by growth on mucus [71] and show differential expression during biofilm formation [72]. Also, multiple mutations in those genes rapidly occur in laboratory strains of E. coli transferred from the minimal growth media to the mouse gut, suggesting they are under specific selective pressure in natural conditions [73].

The above observations suggest that CD-enriched genomic features of CDEC presumably provide bacteria with an increased ability for intestine colonization. These genes are organized in clusters that are likely acquired by E. coli from other members of Enterobacteriaceae via horizontal gene transfer.

Hence it seems that CDEC are not just commensal strains able to survive the acute inflammation. They have some characteristics of pathogenic E. coli. None of these are as straightforward as the Shiga toxin. All of them have been reported to improve colonization and to increase survival and fitness. It is possible that while persisting in the intestine, certain E. coli strains accumulate more and more of such improvements until taken together they may push the strain from commensality to pathogenesis.

Common CD-factors, if any, may not be specific genes or proteins, but rather functions performed by different genes in different strains. The heterogeneity of CDEC does not exclude the possibility that different groups of CDEC can possess different mechanisms for the survival in the inflamed intestine and therefore for the development of Crohn’s disease response in a patient, suggesting that specific treatment might be required in each case.

Conclusions

Our findings suggest that CDEC are of diverse phylogeny. However, some strains isolated from independent sources possess highly similar chromosomes or plasmids. No CD-specific genes or functional domains were found to be present in all CD-associated strains. However, some genes and operons are more often found in the genomes of CDEC than in commensal E. coli. They are mainly linked to the gut colonization and utilization of propanediol and other sugar alcohols.