Background

Microorganisms play a crucial role in human health by forming endosymbiotic relationships with their hosts and other microorganisms. These complex networks of microbial communities found throughout various environments, particularly in the human gut, are referred to as microbiomes [13]. Aside from bacteria-host interactions, bacteria are constantly engaged in an evolutionary arms-race with mobile genetic elements (MGEs), such as phage and plasmids. To defend against antagonistic actors, prokaryotes have developed a variety of mechanisms to alleviate such threats, one of which are CRISPR-Cas systems, an adaptive immune system that provides sequence-specific defense against invading MGEs [46].

CRISPR-Cas systems are highly prevalent, existing in approximately half of bacterial and most of the archaeal genera [79]. The extreme diversity of CRISPR-Cas systems is reflected by their ever-changing classification scheme, owing to the constant discovery of new CRISPR-Cas system types and subtypes [5, 10, 11]. CRISPR-Cas systems can be grouped into two main classes: Class I and Class II CRISPR-Cas Systems. Class I CRISPR-Cas Systems includes Types I, III and IV and use a complex of Cas proteins to degrade foreign nucleic acids. Class II CRISPR-Cas Systems include Types II, V, and VI and use a single, large Cas protein for the same purpose (Type II, V and VI use Cas9, Cas12 and Cas13, respectively) [12]. The diversity of CRISPR-Cas systems provides a fitness edge against invaders and is suggested to be a product of advantageous evolution [1315]. Similarly, evolution of invaders have been observed to occur in tandem with host adaptive immunity as to evade host defense mechanisms, such as anti-CRISPR genes [4, 5, 1619].

CRISPR arrays are comprised of short DNA segments, known as spacers, and these provide a cornerstone to CRISPR-Cas derived adaptive immunity. Spacers retain the memory of past immunological encounters, and are primarily acquired as a result of Cas protein complex mediated acquisition [5]. Newly acquired spacers are typically integrated towards the leader ends of arrays [20, 21]. Additionally, leader sequences usually found upstream of CRISPR arrays are attributed to the efficiency of CRISPR-Cas derived immune response [22]. Several studies have also suggested that spacer acquisition remains possible through several alternative means such as homologous recombination [20, 23, 24], and ectopic spacer integration where spacers are inserted into the middle of arrays as a result of leader sequence mutations [22, 25]. While CRISPRs hold immunological memory of past encounters with some arrays spanning several hundred spacers long [26], CRISPR arrays are typically found to be on average less than 30 spacers long suggesting that some spacers are purged over time [27]. While a specific underlying mechanism of CRISPR array maintenance has not yet been elucidated, various studies have suggested several mechanisms of spacer loss, such as spontaneous deletions, recombination, and DNA polymerase slippage during replication [23, 2830].

In recent years, much effort has been placed into expanding our understanding of the interactions between microbiomes and their host, as well as, the potential modulation of the human microbiome to improve human health. One particular member of the microbiome, B. fragilis, has been proposed as a potential probiotic due to its ability to facilitate the alleviation of certain disease conditions [31]. In contrast, Bacteroidetes is one of the most common genera of bacteria in the lower intestinal tract, and while this member of the microbiome only accounts for a small fraction (∼2%) of the total Bacteroides found in the gut microbiome, this species contributes to over 70% of Bacteroides infections [3234]. This is due to B. fragilis’ extensive pan-genome and susceptibility to horizontal gene transfer events. As a result, certain strains of B. fragilis have become known pathobionts and opportunistic pathogens [3537]. The perplexing interplay between the pathogenic and probiotic nature of B. fragilis strains highlights the importance of understanding pathobiont evolutionary dynamics, elements that contribute to a species’ pathogenicity, and CRISPR-Cas dynamics.

Many studies of the adaptation process in CRISPR-Cas systems involve an individual bacterial species challenged with invaders in controlled assays. Taking advantage of the increasing number B. fragilis reference genomes, and more importantly, the large number of B. fragilis isolates from 12 individuals [38], we re-investigated the CRISPR-Cas systems in B. fragilis in its natural living environment. The availability of hundreds of time-resolved genomes from B. fragilis isolates from 7 individuals (some involving multiple time points) allowed us to investigate both the intra- and inter-personal dynamics of interactions between B. fragilis and their invaders, and expand upon previous surveys of B. fragilis CRISPR-Cas systems [37]. Insights into how B. fragilis interacts with its invaders, as well as how its CRISPR-Cas systems confer immunity help improve our understanding of the factors that contribute to B. fragilis virulence, horizontal gene transfer, and evolution.

Results

CRISPR-Cas systems in B. fragilis

To better understand the dynamics of CRISPR-Cas systems within B. fragilis, we analyzed a total of 823 B. fragilis genomes, which included 222 NCBI (National Center for Biotechnology Information) reference genomes and 601 isolates from 12 healthy individuals, 7 of which include longitudinal observations (referred as the Zhao2019 dataset; see Methods) [38].

Our analysis shows that among all B. fragilis genomes, three types of CRISPR-Cas systems were identified, Type I-B (class 1), Type II-C (class 2), and Type III-B (class 1; see a review article [39] for the classification of CRISPR-Cas systems). The three types of CRISPR-Cas systems contain universal cas genes including cas1 and cas2, and other type specific genes. For example, the Type II-C CRISPR-Cas systems contain the Type II signature cas9 gene; the Type III-B CRISPR-Cas systems contain the signature cas10 gene; and the Type I-B CRISPR-Cas systems contain the subtype I-B specific cas5 gene, although cas5 is universally found in all class I CRISPR-Cas systems [39]. Example illustrations of these CRISPR-Cas systems are depicted in Fig. 1A, and Table 1 shows the signature cas genes and the repeat sequence of the CRISPR arrays in these CRISPR-Cas systems. We note that besides the three types of predicted CRISPR-Cas systems, additional putative CRISPR arrays were predicted in a de novo fashion. However, they were deemed to be false CRISPR arrays due to their lack of spacer content heterogeneity, despite the fact that they superficially contain the repeat-spacer structures (see details of these CRISPR artifacts and reasons why there were discarded at the companion website). Among the discarded CRISPR groups include a putative fourth CRISPR-Cas system that was previously reported in B. fragilis genomes [37]. This CRISPR-like artifact was found in 137 out of the 222 (61.7%) reference B. fragilis genomes, and isolates of B. fragilis in 10 out of 12 individuals in the Zhao2019 dataset (other CRISPR artifacts were rare, found in one or very few genomes). Additionally, this CRISPR artifact was predicted to contain protein-coding genes encoding for transcriptional regulators in some genomes (e.g., CP036550.1 and CP0811922.1), further suggesting that it is unlikely a genuine CRISPR.

Fig. 1
figure 1

The three types of CRISPR-Cas systems found in B. fragilis genomes. (A) Representatives of B. fragilis CRISPR-Cas systems found in B. fragilis strain S14 (accession number: GCA_001682215.1_ASM168221v1). The figure legend indicates the CRISPR-Cas types by color where each arrow represents a gene and its orientation. While some cas genes are labeled in italicized text below the arrows, further details can be found on the companion website for this accession number. CRISPR arrays are labeled by their number of repeats in blue text. The predicted anti-repeat that is partially complementary to the CRISPR repeat in the Type II system is shown as a red triangle in the plot. (B) Prevalence of CRISPR-Cas systems among the NCBI reference genomes and Zhao2019 isolates. (C) Spacer content heterogeneity of the CRISPR arrays

Table 1 Signature cas genes and CRISPR repeat sequences for the three types of CRISPR-Cas systems found in B. fragilis

Inter-subject spacer diversity of B. fragilis cRISPR-Cas systems

An evaluation of the CRISPR-Cas system distribution among all isolates of the Zhao2019 dataset showed that CRISPR-Cas system types were unevenly and dis-similarly distributed between individuals (Fig. 1B and Table 2). Type I-B CRISPR-Cas systems were among the least prevalent with only isolates from two individuals (S07 and S08) containing this type of CRISPR-Cas system. Type II-C and Type III-B CRISPR-Cas systems are similarly prevalent with isolates from five and six individuals, respectively (see Table 2 for the lists of individuals that contain these systems). No CRISPR-Cas systems were found to be present within Zhao2019 subjects S04, S05, S11 and S12. The lack of uniformity of CRISPR-Cas system presence, or lack thereof, suggests that lineages of B. fragilis between individuals are, for the most part, unique from each other, reaffirming the findings of Zhao et al. [38]. Similarly, there was an observed lack of shared inter-individual spacer content, with majority of the spacers observed being individual specific (Fig. 3), with the most common shared spacer being the anchor spacer on the trailer end of observed CRISPR arrays.

Table 2 Presence of the different types of CRISPR-Cas systems in the 12 individuals

Spacer content heterogeneity score (Fig. 1C) and compressed spacer graph of Type I-B CRISPR-Cas systems (Fig. 2A) found within Zhao2019 isolates show that the CRISPR arrays of Type I-B systems have low heterogeneity and are less active in terms of spacer turnover compared to other CRISPR-Cas systems in B. fragilis. In comparison, Type III-B CRISPR-Cas Systems contained mostly individual specific spacers and shared very few spacers between individuals. This pattern of individual-specific spacers is reflected in the branching structures observed in the compressed spacer graphs (Fig. 2B). Each branch within the spacer graph represents an unique CRISPR array structure; bottle-neck nodes (e.g. 79, 53, and 48) represent uniformly shared spacer(s) in spacer sharing CRISPR arrays. The observed branching structure in the compressed spacer graph indicates a diverse CRISPR array structure between individuals, indicative of the activity and heterogeneity of Type III-B CRISPR-Cas systems within B. fragilis. For comparison, the spacer content heterogeneity score shows similar trends among the B. fragilis reference genomes (see Fig. 1C).

Fig. 2
figure 2

Compressed spacer graphs of Type I-B (A) and Type III-B (B) CRISPR-Cas systems in B. fragilis. Numbered nodes represent a single, unique spacer whereas nodes labeled with numerical intervals represent several consecutive spacers, and directed edges represent the ordering of the spacers in arrays. For the spacers in the same array, spacers closer to the trailer end are numbered with smaller numbers whereas the spacers closer to the leader end get bigger numbers. Nodes containing leader end spacers are highlighted in blue, and nodes containing trailer end spacers are shown in yellow. The compressed spacer graphs provide visual summary of 80 Type I-B CRISPR arrays identified from 80 Zhao2019 isolates (A), and 465 Type III-B CRISPR arrays identified from 410 isolates (B), respectively

The Type II-C CRISPR-Cas systems found within the Zhao2019 isolates were among the most diverse between the three observed types of CRISPR-Cas systems found within B. fragilis. Among the seven subjects that contained Type II-C CRISPR-Cas systems, many of the identified Type II-C CRISPRs did not share inter-subject spacers, except for trailer spacers (i.e., end spacers such as nodes 1 and 19 shown in Fig. 3) which have been previously hypothesized as ancient spacers or anchor spacers [28, 40, 41]. The spacer sequence diversity can be seen in Fig. 3, where each branch path represents a unique CRISPR array observed. The diversity of the CRISPR arrays observed in Type II-C CRISPR-Cas systems suggests that Type II-C systems have greater spacer activity (e.g. spacer acquisition and loss), and also highlight the evolutionary pressures that MGEs exert on B. fragilis.

Fig. 3
figure 3

Spacer graph of the Type II-C CRISPR-Cas systems in B. fragilis. Each numbered node represents a unique spacer. Spacers that are unique to individuals are highlighted in different colors (see color scheme in the plot; spacers shared by isolates from at least two individuals are shown in white). This spacer graph summarizes the spacer organization in 250 CRISPR arrays of Type II-C identified from 187 Zhao2019 isolates. Of particular interest, the funneling shape highlighted in a red triangle is a result of a total of 28 CRISPR arrays involving insertion of six different new spacers (nodes numbered 119, 129, 124, 120, 130 and 63) into the CRISPR arrays after the same spacer (numbered 62)

Intra-individual cRISPR-Cas dynamics of B. fragilis

Taking advantage of the temporal intra-individual isolates of B. fragilis, we were able to study micro-dynamics of B. fragilis community dynamics and the adaptation of its CRISPR arrays over time per individual. Overall CRISPR array structures remain relatively stable across samples from the same individual, with some slight variations between observed CRISPR arrays (e.g., Fig. 3 S01, S02, S03, S06, S07). A notable example is the arrays of type III-B CRISPR-Cas systems found in S01 isolates (Fig. 4A). Isolates were derived from this individual at 10 different time points spanning more than 2 years, and we only observed a small variation of the arrays in those isolates, resulting in a simple spacer graph with only one branching structure involving the loss (or gain) of a spacer in the middle of the arrays.

Fig. 4
figure 4

Intra-individual CRISPR array diversity. Each numbered node represents a single, unique spacer, and directed edges represent the ordering of spacers in the CRISPR arrays. (A) Spacer graph of Type III-B CRISPR arrays from individual S01 isolates, consisting of 10 time points, showing low spacer diversity. (B) Spacer graph of Type II-C CRISPR arrays for individual S02. Green nodes are unique spacers identified from time point 1 (S02-0001), purple nodes are from time point 24 (S02-0024), and white nodes represent spacers found in multiple time points. (C) and (D) are the spacer graph, and the pileup view of five representative CRISPR arrays found in five isolates from individual S06, respectively. The five isolates were from two time points, including four from time point S06-0001 and one from time point S06-0122. The arrays involving spacers shown in green are unique to S06-0001 time point isolates, whereas the arrays involving spacers shown in open ovals are common to isolates from both time points. The numbers included in red boxes in figures A-C represent the number of isolates contributing to each major form of CRISPR arrays; for example, 1, 2, and 33 isolates contributed to the three forms of Type II-C CRISPR arrays in individual S06 shown in figure C, respectively

However, in some instances, periods of diverse spacer acquisition were observed from samples from the same individual (e.g., S08). As shown in Fig. 3, various strains of B. fragilis with varying CRISPR array structures were observed from isolates obtained in the same individual (S08) at a single time point. The spacer graph shows a funneling pattern, where multiple nodes on the leader end of the CRISPR array converge into a single neighboring node (see blue nodes on the left of Fig. 3; highlighted in a red triangle). This observed pattern in the spacer graph suggests that multiple ‘lineages’ have gained alternative leader end spacers in comparison to each other, specifically when the bacteria are exposed to different MGEs and are evolving according to the observed threat.

More examples of divergent lineages can be found in Fig. 4B–D, where instances of time-point-specific non-spacer sharing CRISPR arrays were present, as well as cross-time-point shared CRISPR arrays were present within B. fragilis Type II-C CRISPR-Cas systems. There are multiple lineages of B. fragilis containing diverse Type II-C CRISPR arrays in S02 (at time point S02-0001) and S06 (at time point S06-0001). Figure 4D shows a few representative CRISPR arrays found in individual S06, in which the three representative spacer-sharing CRISPR arrays in two time points, S06-0001 and S06-0122, were mostly similar except for differences likely a result of loss from two consecutive spacers in their corresponding isolates. We note that the time-point specific spacers/arrays were rare and found in a small number of isolates as compared to the arrays that share many spacers (see the numbers of isolates in Fig. 4). While previous studies have shown that intra-individual populations of B. fragilis are dominated by a single strain [4244], our findings here show that in some cases many lineages, or strains of B. fragilis, remain present within the same individual at any given time point. The observation of various intra-individual B. fragilis strains is yet another example of the evolutionary arms race between host and the invading MGE.

Interaction network of B. fragilis and its invaders

A total of 1531 unique spacers were identified from B. fragilis genomes (including the 823 Zhao2019 isolates and 222 reference genomes). Among these spacers, 136 were shared by the two collections, 1290 were found in reference genomes only, and 104 were unique to the Zhao2019 isolates. We note that although Zhao2019 isolates outnumbered the collection of reference genomes we analyzed, due to the redundant nature of the Zhao2019 isolates (from 12 individuals), fewer unique spacers were identified in the Zhao2019 collection. All the spacers were used to identify potential MGEs that had left their traces in the B. fragilis genomes.

Among the 1531 unique spacers identified from B. fragilis isolates, 522 found matches (protospacers) in 161 MGEs (153 phages and 8 plasmids). 108 out of the 153 phages could be assigned to a family by PhaGCN [45] with a majority of them being Siphoviridae (93, 86%). Using these spacers, interaction networks between B. fragilis and its invaders were inferred. Analysis of the networks (Fig. 5A and B) showed varying levels of micro-dynamics within B. fragilis CRISPR-Cas systems. The spacer-MGE network (Fig. 5A) contains a few modules each containing a large number of MGEs and spacers (e.g., modules a, b, c and d highlighted in the Figure), likely a result of the arms-race between B. fragilis and MGEs (B. fragilis acquired new spacers to maintain immunity and invaders mutated to evade immunity). The spacer-MGE network shows that B. fragilis used its Type I-B and II-C CRISPR-Cas systems extensively to defend against MGEs that were mostly phages (the network contains 353, 163, and 3 spacers that were exclusively caught in Type II-C, Type I-B, and Type III-B CRISPR-Cas systems, respectively). It also suggests differential defense activities of the Type I-B and II-C CRISPR-Cas systems against some invaders (e.g., those included in modules a and b were preferentially targeted by Type II-C CRISPR-Cas systems; by contrast, invaders included in modules c and d don’t show such preference). Figure 5B (focusing on B. fragilis isolates from several individuals) showed that some invaders (such as P1, P2, P3 and P4 located at the center of the network) have their traces found in B. fragilis in many different individuals, likely the result of ubiquitous presence of these MGEs in human gut. Despite of these central MGEs that make the whole network highly connected, we observed groupings of B. fragilis isolates from one or two subjects with some more localized MGEs (e.g., the MGEs that were targeted by the B. fragilis CRISPR-Cas systems in individual S09). Similar trends can be observed in the heatmap visualizations of the spacer-MGE and host-MGE interaction networks, as shown in Supplementary Fig. 1 and Fig. 2, respectively.

Fig. 5
figure 5

Interactions between B. fragilis and MGEs, as well as, the distribution of protospacers in NC_011222. MGEs containing protospacers are shown in ovals with plasmids highlighted in red and phages in orange. (A) The spacer-MGE network includes spacers identified from Zhao2019 and the genomes of their MGE targets (if protospacer were identified in any known MGE reference genomes). Spacers identified in different types of CRISPR-Cas systems are shown in rectangles of different colors (Type I-B in light green, Type II-C in dark green, and Type III-B in blue). (B) The host-MGE network includes spacer-contributing B. fragilis isolates from Zhao2019 only. Hosts are shown in hexagons with B. fragilis isolates from different individuals shown in different shades of green; four phage MGEs are highlighted with labels: P1 (uvig_425872, Siphoriridae), P2 (uvig_422350, Siphoviridae), P3 (k141_68_round8-12 1718861, Microviridae), and P4 (NC_011222, Siphoriridae). (C) The distribution of protospacers in NC_011222, with green and red vertical lines representing protospacers matching spacers found in Type I-B and Type II-C CRISPR-Cas systems, respectively (gray arrows represent the genes). (D) PAMs detected in the downstream regions of the protospacers associated with Type I-B and Type II-C CRISPR-Cas systems, respectively

We analyzed the protospacers and their downstream regions in the identified MGEs. Figure 5C shows the distribution of protospacers in NC_011222 (Bacteroides phage B40-8, labelled as P4 in Fig. 5B) that was targeted extensively by both Type I-B and Type II-C CRISPR-Cas systems. We were able to infer the protospacer adjacent motif (PAM) that follow the protospacers. We extracted 10 bases of the downstream regions of all the protospacers and applied MEME [46] to detect and visualize the motifs among these sequences. Figure 5D shows the PAMs found in the downstream regions of the protospacers that matched spacers found in the Type I-B and Type II-C CRISPR-Cas systems, respectively. The logos show that the Type I-B and Type II-C systems target segments with distinct PAMs: Type I-B CRISPR-Cas system tends to target segments with base G at their 2 and 3 downstream positions, whereas Type II-C CRISPR-Cas system tends to target segments with conserved base C at downstream position 6.

Discussion

In this paper, we expanded upon previous works [37] and explored the CRISPR-Cas dynamics within B. fragilis genomes, while focusing on dynamics pertaining to a time-resolved study of B. fragilis within and between individuals. We analyzed a total of 823 genomes, a 7.5 fold difference in number of genomes analyzed in previous B. fragilis CRISPR-Cas papers [37]. While B. fragilis is a common commensal bacterium of the human gut microbiome, sometimes a probiotic candidate and sometimes pathogen, its role as one of the most virulent members of the Bacteroides genus should not be overlooked [47]. Part of B. fragilis virulence is due to its potent virulence factors, and as such, a thorough understanding of the mechanisms and factors that contribute to its virulence, horizontal gene transfer, and evolution are important. By utilizing CRISPR-Cas systems and focusing on time series isolates, we were able to reveal micro-dynamics found in B. fragilis isolates within and between individuals.

The analysis of NCBI’s reference genomes and genomes from the Zhao2019 dataset enabled us to update the evaluation of known CRISPR-Cas systems found within B. fragilis. Particularly, we found three types of CRISPR-Cas systems (Type I-B, Type II-C, and Type III-B) with varying distributions among the genomes. Our analysis also shows that a fourth previously reported CRISPR-Cas system in B. fragilis was a false CRISPR-like artifact. This CRISPR-like artifact was previously characterized as an orphaned CRISPR array [37], but due to its structure containing only two spacers, three repeats, as well as non-uniform repeat sequences, we believe this is not an orphaned CRISPR array.

While differentiating between active, in-active, and false-positive CRISPRs remains a challenging and active research area, we employed various methods to mitigate the potential of including false-positive CRISPR arrays in our analysis. Identification of CRISPR arrays can be challenged by repetitive sequences that mimic CRISPR array structures. Here we employed the use of CRISPRone (which employs an ensemble method to remove potential false-positives) [44], and additionally introduce the filtering of putative CRISPR arrays through the use of spacer content heterogeneity. Our analysis shows that while all B. fragilis CRISPR-Cas system types had some level of plasticity, where CRISPR arrays across different time points and individuals were heterogeneous, the level of heterogeneity varied between CRISPR-types and even time-points. Intra-individual variations of CRISPR arrays, such as those found in Individual S08 (Fig. 3), showed periods of rapid expansion and diversification of CRISPR spacers between strains of observed isolates; these periods of diversification can be observed in the branching structures of the spacer graph. In comparison, periods of contraction where little to no CRISPR spacer content heterogeneity was observed were similarly present in intra-individual CRISPR-Cas systems, such as those found in Individual S01 (Fig. 4A). Unsurprisingly, most inter-individual CRISPR-Cas systems did not share many spacers between individuals. This could be explained that individuals picked up different isolates of B. fragilis. Here we also show that CRISPRs can go through periods of expansion, while others go through periods of stability, suggesting that CRISPR evolution is not a constant process but occurs in modes. Uncovering these CRISPR-Cas dynamics would not be possible without time series analysis of the same bacterial lineage. We found that B. fragilis CRISPR-Cas systems seemed to prefer targeting phage genomes over plasmid genomes while exploring the interplay/dynamics of B. fragilis and its MGEs. This is a contrast to some studies which found CRISPRs favoring the targeting of plasmids over phages [48, 49]. CRISPR spacer-MGE networks also revealed micro-dynamics of B. fragilis CRISPR targets, where we observed several notable network structures. Hairball-like structures, where a single spacer targeted many unique MGE targets, and exemplified that in some cases CRISPR spacers were likely able to target multiple MGEs through the same CRISPR spacer. This suggests that the protospacer is conserved across many targets. In addition to hairball like structures, it was also observed that several spacer nodes and MGE nodes formed cliques/modules, where nodes clustered together more closely to each other than other members of the network. Within these modules, MGE nodes shared an edge with many spacer nodes, suggesting that these MGEs contained many protospacers. This observation of many spacers targeting the same MGE may be suggestive of a process known as ‘primed CRISPR adaptation’. In primed CRISPR adaptation, the presence of an existing spacer is used to enhance the acquisition of new spacers on the same MGE target [50, 51]. Alternatively, it may be possible that these instances of multiple targeting are a result of naive adaptation where spacers were independently acquired.

Not all spacers identified in B. fragilis had a matching MGE protospacer target, which might have biased our analysis to spacer targets based on available MGE database genomes. However, it has been suggested that most unidentified spacers relate to host-specific mobile elements [52, 53] and thus without adequate sequencing and annotation of the hosts’ microbiome, many of the spacer targets will remain unresolved. Another hypothesis to the limited spacer-MGE associated matches, especially in trailer end (older) spacers, is that protospacer sites of targeted MGEs have since mutated to evade detection by the CRISPR spacer and the MGE target pre-dates sequencing technology; thus, spacers are unable to match to any known protospacer targets within the available MGE databases.

Additionally, in compressed spacer graphs, we observed periods of expansion and contraction of CRISPR arrays. Funneling patterns are of particular interest and were mostly observed at the leader end of spacer graphs. The lack of these funnel shaped patterns in the middle or trailer end of compressed spacer graphs suggests that certain spacers may provide an evolutionary advantage compared to other spacers, and establish itself as the dominant strain, out competing strains containing less fit CRISPR arrays; thus we do not see this branching structure in ‘older’ segments of the CRISPR array.

Although CRISPR-Cas systems are commonly found in prokaryotes, only about half of the bacterial species contain them [9, 44]. We recently showed that human related bacterial species have a broad spectrum of the prevalence of the CRISPR-Cas systems; for example, Staphylococcus aureus has the least tendency of obtaining the CRISPR–Cas systems with only 0.55% of its isolates containing CRISPR–Cas systems, whereas most isolates of Clostridioides difficile analyzed have CRISPR–Cas systems each having multiple CRISPRs [54]. It is reflected in the Zhao2019 collection—isolates from 8 out of 12 individuals contain one or more of the CRISPR-Cas systems found in B. fragilis (see Table 2). This poses a limitation of using the evolution of CRISPR arrays to study the adaption of bacterial species to the changing environments. On the other hand, due to the hypervariable nature of the CRISPR arrays, they provide a sensitive approach for studying the microevolution of bacterial species, as shown in [55, 56].

While our work improves the understanding of B. fragilis adaptation to MGE exposure by using inferred host-MGE networks, more work is still needed to understand how CRISPR adaptation plays a role in B. fragilis acquisition of virulence factors, evolution, and horizontal gene transfer. In particular, one main challenge to Host-Invader analysis is the limitation of available MGE databases. Future efforts and resources to maintain databases of MGEs and other elements of the microbiome (e.g. fungome) remain invaluable for further understanding of the microbiome, and not just prokaryotic members. A better understanding of how B. fragilis and other pathobionts interact with their invading mobile elements will enable a better understanding of their evolution and the elements responsible for their pathogenicity.

Conclusions

By exploring CRISPR-Cas systems present in B. fragilis and the dynamics of its host-MGE networks, we uncovered micro-dynamics of B. fragilis adaptation against invaders. We made available of all annotated CRISPR-Cas systems and their target MGEs, and their interaction network as a web resource at https://omics.informatics.indiana.edu/CRISPRone/Bfragilis. We anticipate it will become an important resource for studying of B. fragilis, its CRISPR-Cas systems, and its interaction with mobile genetic elements providing insights into evolutionary dynamics that may shape the species virulence and lead to its pathogenicity.

Methods

Genomic data processing and assembly

Reads from 601 B. fragilis isolates from the Zhao et al. study [38] were downloaded from the NCBI BioProject Accession PRJNA524913, henceforth referred to as the ‘Zhao2019 dataset’. All isolates from the Zhao2019 dataset were obtained from the OpenBiome stool bank whose donors abstained from antibiotics for a minimum of 3 months prior to donation [38]. Raw shotgun sequencing reads were trimmed using Trimmomatic v0.39 [57] (parameters used: LEADING:5 TRAILING:5 SLIDINGWINDOW:4:10 MINLEN:60). Trimmed reads were then assembled using SPAdes v3.12 [58] with default settings. FragGeneScan [59] was then used to predict protein coding genes of metagenome assemblies.

A total of 222 B. fragilis reference genomes, 16 complete and 202 draft genomes, were downloaded from the NCBI ftp website as of Jan 18, 2021. A list of genomes included in this analysis can be found at the companion web resource.

Characterization of cRISPR-Cas systems

To identify CRISPR-Cas systems in B. fragilis genomes, we utilized CRISPRone [44] which predicts both CRISPR arrays and cas genes within a given input genome sequence. Predicted CRISPR-Cas systems were then further refined through a reference based approach. Repeat sequences of CRISPRone predicted spacers were extracted and clustered to obtain consensus reference repeats using CD-HIT-EST [60] with 85% sequence identity. Consensus reference repeats were then used as input for CRISPRAlign [61], a reference based approach to identify CRISPR arrays. As the exact boundaries of CRISPR arrays predicted by de novo approaches may sometimes be blurred due to small CRISPR arrays, sequencing errors, and mutations in repeat sequences, we utilize a reference based approach to redefine the repeat-spacer boundaries of CRISPR arrays predicted by CRISPRone.

To compare spacer sequences across different arrays, reduce spacer redundancy, and the eventual computation of spacer content heterogeneity, spacers were clustered with CD-HIT-EST [60] at 85% sequence identity. An 85% sequence identity was used to provide greater flexibility in spacer sequences, and allow for a small amount of sequence variation either due to sequencing error or real mutations found between individual spacers. Spacer sequences that clustered together were considered identical spacer sequences. Spacer clusters were reserved for downstream computation of spacer content heterogeneity (Fig. 6A) and construction of compressed spacer graphs (Fig. 6B).

Fig. 6
figure 6

Approaches used for the identification and refinement of the CRISPR arrays and construction of spacer graphs. (A) CRISPR arrays are analyzed in groups such that each group shares identical or very similar repeats (repeats are shown as diamonds and spacers are shown as boxes). CRISPR arrays that lack spacer content heterogeneity and have no adjacent cas genes were considered to be false-positive and discarded. (B) Example of spacer sharing CRISPR arrays can be represented as a simplified graphical structure (spacer graph), in which the edges record the ordering of the spacers in arrays

In some cases, it may be difficult to differentiate between true CRISPR-Cas systems and false positive CRISPR-Cas systems (e.g., false CRISPR-arrays, inactive CRISPR-Cas systems). While manual curation can help filter out some of these issues, it becomes difficult to screen out hundreds to thousands of genomes. CRISPRone utilizes a set of heuristics to identify and filter out potential false-positive CRISPR arrays, including STAR-like element [44]. To additionally help filter out potential false positive arrays and inactive CRISPR-Cas systems, we propose a metric of heterogeneity to measure the rate of change (i.e., growth and turnover of spacers) in CRISPR arrays with the assumption that CRISPR arrays of active CRISPR-Cas systems undergo active expansion and turnover of spacers. In instances where spacer content heterogeneity was zero, but arrays had adjacent cas genes, these arrays were considered to be true CRISPRs. Here we define spacer content heterogeneity score as:

$$ \textrm{Spacer Heterogeneity} = \frac {m - max(c_{i})}{\sum_{i}^{n} c_{i} - max(c_{i})} $$
(1)

Where n is defined as the number of CRISPR arrays, with each CRISPR array containing c1,c2,..., cn unique spacers (in some rare cases, CRISPR arrays may contain multiple copies of the same spacer, which will be considered as one spacer) and m denotes the number of unique spacers found from all n arrays combined. Spacer heterogeneity scores range from 0 to 1, where 0 indicates no spacer heterogeneity (i.e., two CRISPR arrays share all spacers), and 1 indicates the greatest possible extent of spacer content heterogeneity (i.e. two CRISPR arrays share no spacers).

Because spacer content heterogeneity alone is not enough to rule out false positive or inactive CRISPR-Cas systems, predicted CRISPR-Cas systems were further filtered out by coupling spacer content heterogeneity with gene content information. CRISPR groups that lack spacer content heterogeneity and had no adjacent cas genes were considered inactive or false positive, and thus discarded from further analysis; all filtered arrays were also manually inspected prior to their removal.

Compressed spacer graph for summarizing the sharing of spacers among a group of CRISPR arrays

Compressed spacer graphs [41] were constructed for each CRISPR-Cas type to summarize and illustrate CRISPR array dynamics. For every spacer in a given array, where each spacer was represented by a node of its representative spacer cluster, a directed edge was built between nodes of neighboring spacers in sequential order. Once all CRISPR arrays were represented in the graph structure, the spacer graph was then simplified by collapsing neighboring nodes if two neighboring nodes shared an “in-degree” and “out-degree” equal to or less than one (Fig. 6B). Compressed spacer graphs highlight CRISPR array structure and dynamics (e.g. branching structures representing spacer gain and loss). Arrays that share no spacers result in disconnected components in the compressed spacer graph.

Mobile genetic element databases

A collection of mobile genetic element (MGE) databases were gathered, including phage and plasmid databases. The phage databases included the Gut Phage Database [62] (GPD), MicrobeVersusPhage [63] (MVP) database, and the reference viral database [64] (RVDB). The plasmid databases included the Comprehensive and Complete Plasmid Database [65] (COMPASS), and PLSDB [66]. The phage and plasmid databases included sequences from the NCBI reference database, NCBI nucleotide database, MGEs identified from metagenomic assemblies, and prophages identified in prokaryotic genomes. We collectively refer to these databases as the ‘MGE database’ for simplicity.

Identification of CRISPR targets

All unique spacer sequences extracted from B. fragilis’ CRISPR arrays were queried against the MGE database using BLASTN [67] to search for putative invaders that were targeted by B. fragilis. For this analysis, we used all unique spacers instead of 85%-similarity nonredundant set to increase the search sensitivity. Results were filtered to retain hits with a greater than 90% sequence identity, query coverage per hsp greater than 80%, and an e-value of less than 0.001. We noticed that even after dereplication by dRep [68] (with default parameters), there was still a large redundancy in the identified MGEs. Instead, we devised a greedy algorithm to select the minimum number of MGEs that collectively contain all protospacers matching the spacers. Similarly, we selected the minimum number of B. fragilis isolates that contained all identified spacers and only included them in the network. Selected MGEs and isolates are then used for building spacer-MGE and host-MGE networks. In the spacer-MGE network, spacer sequence clusters (called spacers for simplicity) and MGEs are represented as nodes and an edge is added between a spacer node and MGE node if the MGE contains a segment that matches the spacer (i.e., protospacer). In the host-MGE network, an edge is added to a host and a MGE if the host and MGE pair contain at least one matching protospacer and spacer. For MGEs that are phages (or prophages), we applied PhaGCN [45] to assign their taxonomic groups (ICTV [69] families). All visualizations and manual inspection of the networks were performed using Cytoscape [70].