Structure of the space of taboo-free sequences

Models of sequence evolution typically assume that all sequences are possible. However, restriction enzymes that cut DNA at specific recognition sites provide an example where carrying a recognition site can be lethal. Motivated by this observation, we studied the set of strings over a finite alphabet with taboos, that is, with prohibited substrings. The taboo-set is referred to as \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {T}$$\end{document}T and any allowed string as a taboo-free string. We consider the so-called Hamming graph \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varGamma _n(\mathbb {T})$$\end{document}Γn(T), whose vertices are taboo-free strings of length n and whose edges connect two taboo-free strings if their Hamming distance equals one. Any (random) walk on this graph describes the evolution of a DNA sequence that avoids taboos. We describe the construction of the vertex set of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varGamma _n(\mathbb {T})$$\end{document}Γn(T). Then we state conditions under which \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varGamma _n(\mathbb {T})$$\end{document}Γn(T) and its suffix subgraphs are connected. Moreover, we provide an algorithm that determines if all these graphs are connected for an arbitrary \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {T}$$\end{document}T. As an application of the algorithm, we show that about \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$87\%$$\end{document}87% of bacteria listed in REBASE have a taboo-set that induces connected taboo-free Hamming graphs, because they have less than four type II restriction enzymes. On the other hand, four properly chosen taboos are enough to disconnect one suffix subgraph, and consequently connectivity of taboo-free Hamming graphs could change depending on the composition of restriction sites.


Introduction
In bacteria, restriction enzymes cleave foreign DNA to stop its propagation. To do so, a double-stranded cut is induced by a so-called recognition site, a DNA sequence of length 4-8 base pairs (Alberts et al. 2004). As part of their restriction-modification (R-M) system, bacteria can escape the lethal effect of their own restriction enzymes by modifying recognition sites in their own DNA (Kommireddy and Nagaraja 2013). Nevertheless, Gelfand and Koonin (1997) and Rocha et al. (2001) found a significant avoidance of recognition sites in bacterial DNA, and Rusinov et al. (2015) showed that this avoidance was characteristic of type II R-M systems. Also in bacteriophages, the avoidance of the recognition sites is evolutionary advantageous (Rocha et al. 2001), mainly for non-temperate bacteriophages affected by orthodox type II R-M systems (Rusinov et al. 2018a). Therefore in those instances the recognition site is, as we call it, a taboo for host and foreign DNA.
Although avoidance of recognition sites is well studied, e.g. by Rusinov et al. (2018b), taboo free DNA evolution has not yet been modelled. To initiate models of sequence evolution with taboos, we studied the Hamming graph Γ n (T), whose vertices are strings of length n over a finite alphabet Σ not containing any taboos of the set T as subsequence. Two vertices of the Hamming graph are adjacent if the corresponding taboo-free strings have Hamming distance equal to one. In biological terms, the sequences differ by a single substitution.
We note that, for a binary alphabet Σ = {0, 1} and taboo-set T = {11}, the corresponding Hamming graphs Γ n (T) are known as Fibonacci cubes. Some properties of the Fibonacci cubes like the Wiener Index or the degree distribution were surveyed by Klavžar (2013). Further results have been obtained for taboo-sets forbidding arbitrary numbers of consecutive "1"s, T = {1 . . . 1}, by Hsu and Chung (1993), or when T = {s} for an arbitrary binary string s by Ilić et al. (2012). Recently, the equivalent problem of lattice paths that avoid some patterns has been described using automata and generating functions by Asinowski et al. (2018Asinowski et al. ( , 2020. We are not so much interested in enumerative properties of Hamming graphs. We want to define conditions under which the Hamming graphs stay connected for arbitrary finite alphabets and arbitrary finite taboo-sets. From an evolutionary point of view, connectivity guarantees that any taboo-free sequence can be generated by point mutations from any initial taboo-free sequence without containing a taboostring during evolution. To include further biological realism, we will also study the connectivity of subgraphs Γ s n (T) of the Hamming graph, where s is a taboo-free suffix. Suffix s can be viewed as a conserved DNA fragment, that is, a sequence that remained invariable during evolution (Shoemaker and Fitch 1989;Fitch and Margoliash 1967).
The inclusion of Hamming graphs with a constant suffix provides more general results, because Γ e n (T) = Γ n (T), where e is the empty string. Given a taboo-set T, if for every taboo-free string s and integer n the Hamming graph Γ s n (T) is connected, then evolution can explore the space of taboo-free sequences by simple point mutation, no matter which DNA suffix fragments remain invariable, as long as the taboo-set T does not change in the course of evolution. Vertex T T T is a cut vertex, because if we remove T T T and its incident edges (dashed lines, coloured red), then the resulting graph is disconnected. Consequently, graph Γ 3 (T * ) induced by taboo-set T * = T {T T T } is disconnected. Red, blue and yellow edges connect vertices with a different distribution of letter T (colour figure online) In set Ψ (T), consider every pair of strings with Hamming distances 1 or 0. For example, the pair (A A, A A) has distance 0; the pair (A A, C A) has distance 1; and the pair (A A, T T ) has distance 2. If every pair with Hamming distance 1 or 0 can be taboo-free extended to the left by the same letter, then all graphs Γ s n (T) are connected. For example, the pair (A A, A A) can be extended by C, because C A A is taboofree, and the pair (A A, C A) can be extended by T , because T AA and T C A are taboo-free. After checking all possible pairs with Hamming distance 0 or 1, we see that all such pairs in Ψ (T) are extendable to the left, and thus taboo-set T generates connected taboo-free Hamming graphs.
(3) If Proposition 24 cannot be applied, then we apply the characterization of Theorem 22. Assume for example that T = {A A A, CC A, T AA, G A A}. Since the pair {A A, C A} ⊂ Ψ (T) with Hamming distance one is not taboo-free extendable to the left by any letter, we proceed as follows. First we construct suf(T), the set of all proper suffixes of T. In our example, suf(T) = {A A, C A, A, e}, where e is the string with no letters. Now we consider, for every suffix r ∈ suf(T) the graph Γ r |r |+M (T), where |r | is the length of r and M is the length of the longest taboo(s) in T. If all graphs Γ r |r |+M (T) are connected, then every graph Γ s n (T) is connected. In our example, graphs Γ A A 5 (T), Γ C A 5 (T), Γ A 4 (T) and Γ 3 (T) are connected, implying that all taboo-free Hamming graphs are connected. When graph Γ r |r |+M (T) is disconnected for some r ∈ suf(T), then suffix r induces disconnected taboo-free Hamming graphs of the form Γ r n (T) for n ≥ |r | + M. Therefore evolution cannot explore the whole space of taboo-free sequences. This is the case for taboo-set T * of Fig. 2, where r = e yields the disconnected graph Γ 3 (T * ).

Outline
We will characterize taboo-sets T such that every Hamming graph of the form Γ s n (T) is connected. To this end, we describe in Sect. 5 basic properties of taboo-sets. In Sect. 6, we introduce a very general type of taboo-sets, called left proper (Definition 4), which are our main object of study. In Proposition 11.b we show that, to construct graph Γ s n (T), we only need the longest prefix of s which is a suffix of a taboo, which we call s[1, k s ]. In Sect. 7 we state the graph isomorphism Γ s n (T) Γ s[1,k s ] n (T) (Theorem 16). In Sect. 8 we explain how the edges of a quotient graph are related to the structure of graph Γ n n (T) (Proposition 17). Combining all these results, in Sect. 8 we characterize the connectivity of Hamming graphs Γ s n (T). We prove by induction that the connectivity of a small number of quotient graphs implies the connectivity of all Hamming graphs with long suffixes (Proposition 20). This result can be used to prove connectivity of Hamming graphs with short suffixes (Proposition 21). These two results yield the characterization of the connectivity of every suffix Hamming graph in Theorem 22. Section 9 provides examples of bacterial taboo-sets and their connectivity.

Basic notations
We will introduce some standard notations concerning strings as well as some relevant terms from graph theory.

Strings
We will use the term string to refer to a sequence of symbols over an arbitrary finite alphabet Σ = {a 1 , . . . , a m }, where m ≥ 2, while (DNA) sequence is reserved for biological contexts, where the alphabet consists of the four nucleotides Σ = {A, C, G, T }.
We denote the set of strings of length n over the alphabet Σ by Σ n . The length of a string s is denoted by |s|. The empty string will be denoted by e, and satisfies |e| = 0 and {e} = Σ 0 .
Given a string s = b 1 . . . b n ∈ Σ n , the expression If string s 1 is substring of string s 2 , we write s 1 ≺ s 2 , while s 1 ⊀ s 2 denotes that s 1 is not a substring of s 2 . By convention, e ≺ s for any string s. For strings s 1 and s 2 , we define s 1 s 2 as the concatenation of s 1 and s 2 . Note that es = se = s for any s. For a string s and a set of strings S = {s 1 , . . . s k }, the concatenation of s with all elements in S is denoted by s • S := {ss 1 , . . . ss k }. If S 1 and S 2 are disjoint sets, then the disjoint union of S 1 and S 2 will be denoted by S 1 S 2 .
Finally, given two strings s 1 , s 2 of equal length, d(s 1 , s 2 ) denotes their Hamming distance, that is, the number of positions at which the corresponding symbols differ.

Graph theory
We will use common graph theory terminology following Wilson (1986). Let G = (V , E) denote a simple, undirected graph with vertex set V and edge set E. We say that graph and we denote this as G 1 ⊆ G 2 .
Given a graph G = (V , E) and a subset V 1 ⊆ V , then the subgraph induced by That is, G 1 and G 2 are isomorphic if there exists an edgepreserving bijection between their vertex sets.
We will also need the quotient graph, as defined by Sanders and Schulz (2013), to study the connectivity of Hamming graphs. To define it, consider a graph G = (V , E) and a partition of its vertex set V , namely V = b∈J V b for some index set J . The quotient graph of G, denoted as Q[G] = (J , E J ), is the graph whose vertices are J and such that {b 1 , b 2 } ∈ E J iff an edge connects a vertex in V b 1 with a vertex in V b 2 . Figure 3 gives an example of a quotient graph. Our strategy to prove connectivity of taboo-free Hamming graphs will use the following propositions, whose proof is simple enough to be omitted.

Proposition 1 Consider graph
Proposition 2 For graph G = (V , E), the following statements are equivalent:

Properties of taboo-sets
We will repetadly use of the following terminology.
Definition 1 -A finite set of strings T such that every t ∈ T satisfies |t| ≥ 2 is called a taboo-set. With Definition 1 in mind, we can prove some simple properties of taboo-sets.
Proposition 3 Given taboo-sets T 1 and T 2 , it holds that: (c) If for every t 1 ∈ T 1 there exists t 2 ∈ T 2 such that t 2 ≺ t 1 , then for any n ∈ N, V n (T 2 ) ⊆ V n (T 1 ).
Proof (a) Every t ∈ T 1 T 2 has length at least 2, and thus T 1 T 2 is a taboo-set.
(b) All strings s ∈ V n (T 1 ) V n (T 2 ) satisfy t 1 ⊀ s for all t 1 ∈ T 1 and t 2 ⊀ s for all t 2 ∈ T 2 this is equivalent to s satisfying t ⊀ s for all t ∈ T 1 T 2 . (c) Consider s ∈ V n (T 2 ). Assume that s / ∈ V n (T 1 ); then there exists t 1 ∈ T 1 such that t 1 ≺ s. But there also exists a t 2 ∈ T 2 such that t 2 ≺ t 1 , and thus t 2 ≺ s, a contradiction. Hence s ∈ V n (T 1 ).
For a given n and T, we can find a taboo-set T = T such that V n (T) = V n (T ). In this sense, taboo-sets are not unique, as we illustrate in the following proposition.
Proposition 4 For a string t and n ≥ |t| + 1, it holds that Proof -⊆ : Any taboo in T 1 := (t • Σ) (Σ • t) has t ∈ T 2 := {t} as substring, and thus Proposition 3.c implies with t ≺ s. Since |s| = n and n ≥ |t| + 1, the substring t is either preceded or followed by some symbol a ∈ Σ. This contradicts {at, ta} Proposition 4 implies that, for any T, we can construct many taboo-sets T such that V n (T) = V n (T ) as long as n ≥ max(M, M ), where M and M denote the length of the longest taboo in T and T , respectively. Example 2 and Proposition 4 motivate the following definition.
Definition 2 A taboo-set T is minimal if the following conditions hold: Condition (a) is easy to justify: If string A A is a taboo, it is redundant that A A A be a taboo. Condition (b) avoids unnecessarily complicated taboo-sets. For example, using the four-nucleotide alphabet, taboo-set T = {A A A, A AC, A AG, A AT , C A A, G A A, T AA} can be minimized as T = {A A}. In general, one can minimize a taboo-set according to Example 2.
Since we want to study taboo-free strings of arbitrary lengths, we need conditions to concatenate taboo-free strings such that the concatenated sequence is taboo-free. The following result gives such a condition.
Proposition 5 Given taboo-set T, consider three strings s 1 , s 2 , s 3 such that s 1 s 2 and s 2 s 3 are taboo-free and |s 2 ] ≥ M − 1. Then s := s 1 s 2 s 3 is taboo-free.
is taboo-free and the result follows.

Prefixes and suffixes of a taboo-free string
Given a taboo-free string s, the construction of set V s n (T) for n > |s| depends on which string w can be concatenated to the left side of s, such that ws ∈ V n (T). This motivates the following definition.
Definition 3 Given a taboo-set T, consider a taboo-free string s and k ∈ N 0 . The k-prefixes of s are the elements of the set L k (s), defined as If L k (s) = ∅, then we will say that s is k-prefixable.
By construction, given s ∈ V |s| (T), for any k ∈ N 0 it holds that Moreover, the following proposition shows that the k-prefixes of a string s induce a disjoint partition of the set V s n (T).
Proposition 6 Given a taboo-set T and a taboo-free string s, consider integers k ∈ N 0 and n ≥ k + |s|. It holds that That is, the set V s n (T) can be partitioned into the disjoint sets of taboo-free strings of length n with suffix ws, where w ∈ L k (s).
Proof If s is not k-prefixable, then L k (s) = ∅ and V s n (T) = ∅, hence the equation holds. Otherwise, the inclusion ⊇ is clear, while the ⊆ follows from the fact that, for any string w ∈ Σ k preceding the suffix s, this w must necessarily belong to L k (s).
Clearly, if a taboo-free string s is k * -prefixable, then it is also k-prefixable for any integer k < k * , while nothing can be said a priori about the case k > k * . Consequently, we need to find conditions under which one can concatenate at least one symbol to the left of a taboo-free string. We will first introduce such taboo-sets in Definition 4 and then characterize prefixability in Proposition 7.
Proposition 7 Consider a left proper taboo-set T and a taboo-free string s such that one of the following conditions holds: Proceeding analogously with (as) [1, M], we infer that s is 2-prefixable. Continuing with this process, we deduce that s is k-prefixable for any k ∈ N. If condition (b) holds, then we can take any string in V s M (T) and proceed as we did assuming (a).
We mainly study left proper taboo-sets due to Proposition 7, because the existence of arbitrary k-prefixes is necessary in many of our proofs. Analogous results for right proper taboo-sets are obtained by reversing the order of the symbols composing the string.
According to Proposition 7, if the length of a taboo-free string is at least M, then the taboo-free string can prefixed for arbitrary lengths. Otherwise, one needs to check the (M − |s|)-prefixability of this string. To that end, the following result comes in handy.

Proposition 8 Consider a left proper taboo-set T and a taboo-free string s.
, then s is (M − |s|)-prefixable, and thus Proposition 7.b implies that s is k-prefixable for every k ∈ N.
Note that, since the assumptions of Proposition 8.a are the negation of the assumptions of Proposition 8.b, in Proposition 8 we have proved that V s n (T) = ∅ for n ≥ M iff string s satisfies |s| ≤ M − 1 and s / ∈ suf(V M (T)). To study the connectivity of Hamming graphs Γ s n (T), we need to know whether two different strings have a k-prefix in common. Thus, we introduce the following.
Definition 5 Given a taboo-set T, we say that two taboo-free strings s 1 and then we say that s 1 and s 2 are right k-synchronized.
In words, two taboo-free strings are left k-synchronized if they are k-prefixable by at least one string w. Clearly, two taboo-free strings s 1 , s 2 that are left k * -synchronized are also left k-synchronized for any k ≤ k * (one simply has to "cut" the k symbols on the left of L k * (s 1 ) L k * (s 2 )). The following proposition states when we can also guarantee k-synchronization for k > k * : Proposition 9 Consider a left proper taboo-set T and two taboo-free strings s 1 , s 2 , with length greater than zero, such that s 1 and s 2 are left (M − 1)-synchronized. Then s 1 and s 2 are left k-synchronized for any k ∈ N.
Proof If k ≤ M −1, then the assertion is true since s 1 and s 2 are (M −1)-synchronized.
For k > M − 1, consider a string w ∈ L M−1 (s 1 ) L M−1 (s 2 ). We know that ws 1 and ws 2 are taboo-free strings with length at least M. Since T is left proper, Proposition 7.a applied to ws 1 and ws 2 implies that ws 1 and ws 2 are k -prefixable for any k ∈ N. Therefore w is k -prefixable for any k ∈ N. For any k , take x ∈ L k (w) and consider strings xws and xwr . The fact that |w| = M − 1, together with the fact that xw and the pair ws 1 , ws 2 are taboo-free, allows applying Proposition 5, hence xws 1 and xws 2 are also taboo-free.
The following proposition provides a Hamming-distance based criterion to quickly decide whether two taboo-free strings of length M are left k-synchronized.

Proposition 10 Consider a left proper taboo-set T. If all pairs s
Proof Given any left 1-synchronized pair s 1 , s 2 with d(s 1 , s 2 ) = 1, there exists an a ∈ Σ such that as 1 and as 2 are taboo-free. Since (as i )[1, M] ∈ V M (T) for i ∈ {1, 2} and the Hamming distance between these two strings is at most 1, as 1 , as 2 are 1-synchronized, hence there exists a symbol b ∈ Σ such that bas 1 and bas 2 are taboofree, i.e. s 1 and s 2 are left 2-synchronized. Continuing with this process, it follows that s 1 and s 2 are k-synchronized.
We will now discuss conditions that allow increasing the string length of an entire set of taboo-free strings. To this end, consider two taboo-free strings s 1 , s and the set V s 1 s n+|s 1 |+|s| (T). It is generally not true that V s 1 s n+|s 1 |+|s| (T) = V s 1 n+|s 1 | (T) • s, because the concatenation of s to a taboo-free string from V s 1 n+|s 1 | (T) can create a taboo string around the junction of both strings. For the remainder of this section we will discuss when the equality holds.
Definition 6 For a taboo-set T and a taboo-free string s, we define the length of the longest taboo suffix-prefix match as i.e. k s denotes the length of the longest prefix of s being a proper suffix of a taboo.
Note that the length k s is well defined, because s[ Using this length k s , in Proposition 11 we give conditions implying that equality V s 1 s n+|s 1 |+|s| (T) = V s 1 n+|s 1 | (T) • s holds. Proposition 11 For a taboo-set T and a taboo-free string s, the following holds: Proof (a) The inclusion ⊆ is clear. The inclusion ⊇ follows from the fact that, if we are given r w ∈ V w n (T) such that ws ∈ V M−1+ j (T), since |w| = M − 1, Proposition 5 yields that the concatenated string r ws is taboo-free. (b) The result is obvious if |s| = 0 or n = 0, hence assume |s| > 0 and n > 0.
Clearly Proof Strings s 1 and s 2 are left k-synchronized iff L k (s 1 ) L k (s 2 ) = ∅. We just have to apply Corollary 12.
Thus, the string s[1, k s ], which is the longest prefix of s that matches a proper suffix of the taboos, provides all the information we need to construct V s n (T) or L k (s).

Isomorphisms between taboo-free Hamming graphs
Here we will discuss isomorphism between Hamming graphs. Let us first introduce the formal definition of a taboo-free Hamming graph. Examples of disconnected Hamming graphs are given in Figs. 1 and 2. When dealing with taboo-free Hamming graphs, the following proposition is a simple way to establish graph isomorphisms.

Proposition 14
Consider a taboo-set T, a taboo-free string s and a taboo-free string w satisfying ws ∈ V |w|+|s| (T). If V ws n+|s| (T) = V w n (T) • s for some n ≥ |w|, then Γ ws n+|s| (T) and Γ w n (T) are isomorphic.
Proof By assumption, the vertex set of Γ ws is well defined and bijective. Moreover, f is an edge-preserving bijection: Given any pair of strings r 1 , r 2 ∈ Σ n and any string s ∈ Σ |s| , then d(r 1 , r 2 ) = 1 iff d(r 1 s, r 2 s) = 1.
Proposition 15 does not describe in which cases V s n+|s| (T) = ∅. However, if T is left proper, Proposition 8 implies that this happens iff |s| ≤ M − 1 and s / ∈ suf(V M (T)). This suggests that we can state a version of Proposition 15 for left proper T. But first, due to our interest in taboo-free strings of length M, we introduce the following. For any s ∈ V 2 (T 1 ), we see k s > 0, hence e / ∈ lsc(T 1 ). Moreover, yielding lsc(T 1 ) = Σ 1 . If we consider Σ 2 := {A, C, G, T , C }, where C could represent a 5-methylcytosine, and T 2 := T 1 , then string s = C A satisfies k s = 0, hence lsc(T 2 ) = suf(T 2 ).
The following theorem classifies graphs Γ s n (T) for left proper T. ], which by definition belongs to suf(T). Since by assumption either |s| ≥ M or s ∈ suf(V M (T)), it follows from Proposition 7 that s is k-prefixable for any k, and thus also w := s[1, k s ] is k-prefixable. We consider x ∈ L M−k s (w), which satisfies xw ∈ V M (T). Therefore

w = (xw)[M − k s + 1, M] ∈ suf(V M (T)). All in all, w ∈ suf(V M (T)) suf(T).
This w is trivially unique since k s is uniquely determined given s.

As for the case |s| ≥ M, the fact that s[1, M] ∈ V M (T) and the definition of lsc(T) implies that w ∈ lsc(T).
In formal terms, Theorem 16 states that the equivalence relation "being isomorphic" divides all graphs Γ s n+|s| (T) into equivalence classes. The representative of each class is a graph Γ w n+|w| (T), where w ∈ suf(V M (T)) suf(T). When |s| ≥ M, string w belongs to lsc(T). This is why lsc(T) is called the long suffix classification.
To efficiently compute lsc(T), we recommend that T be minimal. Theorem 16 implies that and thus we define the short suffix classification as The set ssc(T) is called short suffix classification because only when |s| < M it can happen that a graph Γ s n+|s| (T) is represented by a graph Γ w n+|w| (T) with w ∈ ssc(T). Note that, if a string w satisfies the condition |w| < M − 1 and w . This property is used in the following example.

Connectivity of taboo-free Hamming graphs
We will make extensive use of the quotient graph to study the connectivity of taboofree Hamming graphs. Before we start with the technicalities, we briefly describe our initial strategy. For a Hamming graph Γ n+ j (T), let us consider two different subsets of its vertex set, namely V s b n+ j (T) and V s c n+ j (T), where s b , s c ∈ V j (T). These two subsets are disjoint, so we can use the quotient graph Q[Γ n+ j (T)] to make each of them collapse in a single vertex, represented respectively by s b and s c . We will prove in Proposition 17 that s b and s c are adjacent in Q[Γ n+ j (T)] iff strings s b and s c have Hamming distance 1 and are left n-synchronized. This is specially interesting, because we know from Proposition 9 that two left (M − 1)-synchronized strings are left n-synchronized for any n ∈ N. Thus, it is enough to know that s b , s c are adjacent in Q[Γ (M−1)+ j (T)] to claim that s b , s c are adjacent in all partition graphs Q[Γ n+ j (T)] for n ∈ N (that is the essential content of Lemma 18). More formally, we have the following results.

Proposition 17
Given taboo-set T, j ∈ N 0 and n ∈ N 0 , consider graph Γ n+ j (T) and a subset S ⊆ V n+ j (T) partitioned as S = b∈J V s b n+ j (T), where s b are taboo-free The combination of Propositions 17 and 9 gives the following lemma.

If Q[Γ s |s|+k+M−1 (T)] is connected, then Q[Γ s n (T)] is connected for n ≥ |s| + k.
Proof For some n 0 ≥ |s| + k, consider an edge Regarding connectivity, given graphs G 1 and G 2 with the same vertex set V 1 = V 2 such that G 1 ⊆ G 2 , if subgraph G 1 is connected, then G 2 is connected. We are finally ready to study the connectivity of graphs Γ s n (T) for |s| ≥ M. Let us begin with the following lemma. Proof Proposition 2 states that, in a connected graph, every quotient graph is connected, and thus (b) implies (a) by considering n = 2M. Now we prove by induction that (a) implies (b). For n = M and w ∈ V M (T), we have that

Lemma 19 Given a left proper T, for any w ∈ V M (T) consider the set V w 2M (T) and partition
For the inductive step, assume that Γ w n (T) is connected for every w ∈ V M (T) and up to an integer n ≥ M. We will prove that also every Let us write w separating the first M − 1 symbols from the last one, that is w = rc for r ∈ Σ M−1 and c ∈ Σ. Then for any a ∈ L 1 (w), V aw n+1 (T) = V arc n+1 (T). Since |r | = M − 1, Proposition 11.a implies V arc n+1 (T) = V ar n (T) • c, while the isomorphism established in Proposition 14 yields Thus, every Γ aw n+1 (T) is connected, because the induction hypothesis implies that Γ ar n (T) is connected since ar ∈ V M (T). To prove that graph Γ w n+1 (T) is connected, it remains to apply Proposition 1, so we need to prove that the quotient graph induced by partition V w is connected. Applying Lemma 18 with s = w and k = 1, we get the following chain of inclusions: is connected, every quotient graph of the chain of inclusions is connected, as shown in Lemma 18. In particular, graph Q[Γ w n+1 (T)] is an element of the chain of inclusions because n + 1 ≥ M + 1, so it is connected, as desired.
Lemma 19 is very interesting: We wanted to characterize the connectivity of graphs  L 1 (w[1, k w ]). Moreover, for any w ∈ V M (T) and a, b ∈ L 1 (w), we claim that the following statements are equivalent: Indeed, the implication (i) ⇒ (ii) is obvious, so let us prove (ii) ⇐ (i). Given a taboofree string s ∈ V j (T) such that saw[1, k w ] and sbw[1, k w ] are taboo-free, we want to prove that also saw and sbw are taboo-free. But if that were not the case, it would be the consequence of either (saw) [c, d] ∈ T or (sbw) [c, d] ∈ T for some integers 1 ≤ c ≤ j < j + 1 + k w ≤ d ≤ j + 1 + M. However, that contradicts the maximality of k w , yielding ii) ⇐ i. Our previous claim and Proposition 17 imply that, if r = w[1, Theorem 16 implies that, for every w ∈ V M (T), there exists r = w[1, k w ] ∈ lsc(T). Applying Lemma 19, finally (d) ⇒ (b) follows.
It is worth noticing how simpler the connectivity problem has become. Initially, we were studying whether every Γ s n (T) with |s| ≥ M is connected, obtaining in Lemma 19 that this is equivalent to the connectivity of graphs Γ w 2M (T) for w ∈ V M (T), which are |V M (T)| graphs. Now we see, using Proposition 20 and the fact that lsc(T) ⊆ suf(T), that we only need to prove the connectivity of | lsc(T)| ≤ | suf(T)| ≤ (M − 1)|T| + 1 graphs, namely either Q[Γ r M+|r | (T)] or Γ r M+|r | (T) for r ∈ lsc(T). We give an example. implies that any Γ w n (T) with w ∈ suf(T) is connected. Proposition 15 implies that, for any taboo-free string s and n ≥ |s|, Γ s n (T) is connected.
Proposition 20 characterizes the connectivity of every Γ s n+|s| (T) for |s| ≥ M. We know from Theorem 16 that there exists r ∈ lsc(T) ⊆ suf(V M (T)) suf(T) such that Γ s n+|s| (T) Γ r n+|r | (T). Since ssc(T) := suf(V M (T)) suf(T) − lsc(T), to complete our characterization of the connectivity of every taboo-free Hamming graph, some cases (such as Example 6) require considering the connectivity of graphs Γ p n (T) for p ∈ ssc(T). We have the following.

Proposition 21 Given a left proper T and p ∈ ssc(T), assume that, for every r
satisfies that (wp)[1, k wp ] ∈ lsc(T) for each w ∈ L k ( p), and moreover Q[Γ p | p|+k+M−1 (T)] is connected, then Γ p n (T) is connected for n ≥ |p| + k.
w ∈ V M (T) and n ≥ 0. For any r ∈ lsc(T), there exists by construction a w ∈ V M (T) 1] ∈ Ψ (T) and itself is 0, and thus the assumption of the statement implies that s[1, c a m − 1] is left 1-synchronized with s[1, c a m − 1]. In other words, a symbol a ∈ Σ exists such that as[1, c a m − 1] is taboo-free, which is a contradiction. All in all, s must be 1-prefixable. Taking s ∈ V M (T) we see that T is left proper. (b) Given taboo-free strings s 1 , s 2 such that d(s 1 , s 2 ) = 1, assume that they are not 1-synchronized. Then for every a ∈ Σ, either (as 1 )[1, c a ] ∈ T or (as 2 )[1, c a ] ∈ T for some c a ≥ 2. Denote by C 1 ⊆ a∈Σ {c a } those c a such that (as 1 )[1, c a ] ∈ T, and analogously with C 2 . If C 1 were empty, then s 2 would not be 1-prefixable, contradicting (a). Thus, both C 1 and C 2 must be nonempty. Consider d 1 := max{c : c ∈ C 1 } and d 2 := max{c : c ∈ C 2 }. It holds that s 1 [2, d 1 ] ∈ Ψ (T) and s 2 [2, d 2 ] ∈ Ψ (T). Moreover, we have that the pair s 1 [2, d 1 ], s 2 [2, d 2 ] is not left 1-synchronized. Since d(s 1 , s 2 ) = 1, that contradicts the assumptions of the statement, hence s 1 and s 2 must be left 1-synchronized, as desired. (c) Clearly Γ s |s| (T) is connected, so let us proceed by induction. Assume Γ s n (T) is connected for a fixed n ≥ |s| and consider Γ s Otherwise we take different s 1 , s 2 ∈ V s n+1 (T); we will prove that they are connected. We know that s 1 , s 2 ∈ Σ • V s n (T), hence let us write s 1 = c 1 w 1 and s 2 = c 2 w 2 for c i ∈ Σ and w i ∈ V s n (T). If w 1 = w 2 , the result is obvious, so assume w 1 = w 2 . By hypothesis, Γ s n (T) is connected, and thus there exists a path of vertices of V s n (T), namely y 1 , . . . , y D , such that d(y i , y i+1 ) = 1, y 1 = w 1 and y D = w 2 . For every j ∈ [1, D − 1], the pair y j , y j+1 is left 1-synchronized, and thus there exists b j ∈ Σ such that b j y j and b j y j+1 are taboo-free. Since d(b j y j , b j y j+1 ) = 1, b j y j and b j y j+1 are adjacent in Γ s n+1 (T). Moreover every pair of taboo-free strings contained in Σ • y i is adjacent for i ∈ [1, D − 1]. Since the relation "being connected" is transitive, vertices s 1 ∈ Σ • y 1 and s 2 ∈ Σ • y D are connected, as desired. Proof (a) Assume that taboo-free strings s 1 , s 2 satisfy L 1 (s 1 ) L 1 (s 2 ) = ∅. That is, for each a ∈ Σ, either as 1 or as 2 has a taboo as prefix, contradicting |T[1, 1]| < |Σ|. Therefore every two taboo-free strings are left 1-synchronized, so we can apply Proposition 24.c, implying (a). (b) If |T| < |Σ|, then |T[1, 1]| < |Σ|. Thus, statement (a) yields the result.

Example 10
Corollary 25.b implies that, if |T| < |Σ|, then every Γ s n (T) is connected. In Examples 11 and 12, we give examples of taboo-sets over an alphabet with |Σ| = 2 and |Σ| > 2 symbols respectively, such that |T| = |Σ| and at least one suffix graph is disconnected. In this sense, the upper bound |T| < |Σ| that guarantees connectivity for every suffix graph cannot be improved.
Example 12 For m ≥ 3, Σ = {a 1 , . . . , a m } and the left proper taboo-set a 4 a 1 , a 5 a 1 , . . . , a m a 1 } {a 1 a 2 , a 2 a 2 }, we claim that Γ a 1 n (T) is disconnected for n ≥ 3. Indeed, so take s ∈ V a 2 a 1 a 1 n (T) V a 1 a 1 a 1 n (T) and r ∈ i∈[3,m] V a i a 2 a 1 n (T). It holds that d(s, r ) ≥ 2, hence we found two disconnected components in graph Γ a 1 n (T). This is coherent with |T[1, 1]| = |Σ| = m.
To generalize this example, for i ∈ N 0 , denote by s i := a 1 i) . . .a 1 the concatenation of i a 1 's. The taboo-set In this section, we have stated various results regarding the connectivity of every suffix Hamming graph given a left proper taboo-set T. Up to Theorem 16, our aim was to characterize the connectivity of every suffix Hamming graph. Then we found sufficient conditions in Proposition 24 and Corollary 25 that are easier to apply. When studying this connectivity problem, the practitioner should firstly try to apply the results requiring easy-to-check assumptions, and increasingly use the more complicated ones. Given a taboo-set T, a possible workflow would be the following: (1) We check if |T[1, 1]| < |Σ|. If it holds, we can apply Corollary 25.a. Otherwise go to step 2) (2) In order to apply Proposition 24, we check if every pair of taboo-free strings w 1 , w 2 ∈ Ψ (T) with |w 1 | ≥ |w 2 | and d w 1 [1, |w 2 |] , w 2 ≤ 1 is left 1synchronized. If it does not hold, go to step 3) (3) We check whether T is left proper (this holds in all the biological examples that we considered so far). Otherwise redefine an equivalent left proper taboo-set and apply the characterization of Theorem 22. Two possibilities can arise: Either every suffix Hamming graph is connected, and thus evolution can explore all the space of taboo-free strings; or some taboo-free strings belonging to lsc(T) or ssc(T) induce disconnected suffix graphs Γ s n 0 (T) for some n 0 ≥ |s| + M, implying that Γ s n (T) stays disconnected for n ≥ n 0 .

Examples of plausible bacterial taboo-sets
Taboo-sets as generated by the avoidance of restriction sites can assume various levels of complexities. In this section, we discuss some examples from REBASE (Roberts et al. 2014) using the theory developed in this work. Note that many restriction enzymes of REBASE database have an unknown recognition site, hence our taboo-sets may underestimate the actual amount of taboos. Before describing the examples, we will briefly review essential nomenclature for DNA sequences.
DNA is double-stranded, where A pairs with T and G pairs with C, hence it suffices to discuss only one of the strands. We adopt the convention that, given any of the strands, the DNA sequence is always represented from the 5' end to the 3' end (which is chemically determined). As a consequence, given a DNA sequence, its complementary DNA sequence, the one lying on the opposite strand, is obtained by inverting the order of the symbols and carrying through substitutions A ↔ T and C ↔ G. If a DNA sequence s is identical to its complementary DNA sequence, we say that s is an inverted repeat (Ussery et al. 2008). For example, sequence CCGG is an inverted repeat.
The fact that DNA is double-stranded implies that each recognition site induces taboos in pairs, namely itself and its complementary DNA sequence. For example, if AGGGC is a recognition site, then also the complementary strand GCCC T is a taboo. If, however, the recognition site is an inverted repeat such as T GC A, then this pair is actually one single recognition site. Recognition sites of type II R-M systems are nearly always an inverted repeat (Rusinov et al. 2015;Gelfand and Koonin 1997), and therefore one recognition site induces one single taboo. This is specially interesting because, according to Rusinov et al. (2015Rusinov et al. ( , 2018a, only type II R-M systems induce taboos. A permutation of the symbols of alphabet Σ does not alter any of the results that we proved along this work. Moreover, by reversing the order of the symbols, any statement regarding e.g. left-properness and suffixes has an analogous one in which right-properness and suffixes are involved. On the other hand, taboo-sets induced by restriction enzymes remain invariant when we interchange every recognition site by its complementary sequence. Therefore, note that, for a bacterial taboo-set T, if we prove that every graph Γ s n (T) is connected, then also every graph s Γ n (T) is connected.

A frequent case: Turneriella parva
The Turneriella parva (REBASE organism number 8970) strain produces a restriction enzyme with recognition site G AT C, an inverted repeat. Similarly, another of its enzymes has recognition sites GG ACC and GGT CC. Thus, these restriction enzymes generate the taboo-set Since |T T . pa [1, 1]| < 4, Corollary 25.a implies that every graph Γ s n (T T . pa ) is connected. Therefore the evolution of the DNA sequences can potentially reach any other taboo-free DNA sequence, no matter which suffix was conserved along this process.
Among the 3623 bacteria in REBASE (2020a), only 465 have more than three type II restriction enzymes. Assuming that only type II restriction enzymes induce taboos, as stated by Rusinov et al. (2015Rusinov et al. ( , 2018a, Corollary 25.b implies that at least 87% (3158/3623) of bacterial taboo-sets in REBASE (2020a) yield connected taboo-free Hamming graphs. Similarly, at least 90% (139/153) of archea in REBASE (2020b) induce connected taboo-free Hamming graphs, because they have less than four type II restriction enzymes. The following example describes a more complex collection of restriction enzymes.

Helicobacter pylori
In H. pylori 21-A-EK1, studied by Ailloud et al. (2019), many restriction enzymes have been identified. For the sake of clarity, let us write T H . py = A T G T C T T T, where a T denotes those taboos in T H . py whose first symbol is a ∈ Σ. Then we have where GT • Σ 2 • AC represents taboos of the type GT ab AC with a, b ∈ Σ, and so on for analogous notations.

An imaginary bacterium
The taboo-set can significantly influence evolution in the cases where some Γ s n (T) is disconnected. To explain this, we will create a plausible, nonexistent example. Suppose that a strain of Bacterium imaginara has taboo-set where the second set contains the complementary DNA sequences of the first set, except that of GGCC, which is an inverted repeat. Thus, taboo-set T B.im is induced by 4 restriction enzymes. At first glance, taboo-set T B.im seems less restrictive than T H . py , which has 6 taboos of length four and 22 taboos of length five or more.
Proposition 24 cannot be applied because CCC and GCC are not left 1synchronized, and actually we can find a disconnected suffix graph. Let us take V CCC n (T B.im ), which satisfies This produces the following evolutionary implications: Assume that we have two correctly aligned DNA fragments f α and f β of the genome of Bacterium imaginara. Assume moreover that we can write f α = r α GCCC and f β = r β CCCC for some strings r α and r β , as also that the suffix CCC is invariable due to functional constrains. Then f α cannot have evolved from f β by simple point mutations, because at some point in evolution a taboo string is produced that is lethal for the carrier. Thus, the standard models of sequence evolution (Strimmer and von Haeseler 2009) do not apply.

Concluding remarks
Using the results proven in this work, it is possible to decide whether every Hamming graph Γ s n (T) is connected. The connectivity of the taboo-free Hamming graphs induced by the restriction enzymes of the bacteria listed in REBASE could be quickly analysed with our tools. Unfortunately, for many organisms listed in REBASE, the recognition sites of restriction enzymes are not available.
Based on the current version of REBASE (2020a), we conclude using Corollary 25 that taboo-sets of at least 87% (3158/3623) of bacteria in REBASE induce connected taboo-free Hamming graphs, because they have less than four type II restriction enzymes. For larger taboo-sets, Proposition 24 can be used, as we did in Sect. 9.2, or one can directly use the characterization of Theorem 22. Thus, restriction enzymes in bacteria generally do not lead to any disconnected taboo-free Hamming graph, and our models of sequence evolution are by and large applicable. However, the influence of some missing sequences in the Hamming graph on the estimation of evolutionary parameters deserves further investigations. We also would like to emphasize that still many recognition sites have to be identified, and thus it may be well possible that we find disconnected taboo-free Hamming graphs in the next future.
We consider the formal framework developed in this paper as a first and necessary step to understand the effect of restriction enzymes (and possibly other taboo sequences) on the DNA composition of bacteria and viruses, or more generally on the sequence space modelled as a Hamming graph. Consider, for example, the phylogenetic studies by Ailloud et al. (2019), where the H. pylori taboo-set T H . py of Sect. 9.2 was taken from. The following natural questions arise: How are inferred evolutionary times between the two H. pylori populations affected by T H . py ? Has their GC content varied due to the taboos of restriction enzymes?
To answer such questions, we need to develop models of sequence evolution that take taboos into account. Taboo avoidance induces complex dependencies along a DNA sequence, which can be measured using Markov Chain Monte Carlo (MCMC) simulations. If all taboo-free Hamming graphs Γ s n (T) are connected, then MCMC methods are easy to apply (Manuel et al. unpublished). A disconnected taboo-free Hamming graph, however, leads to a reducible Markov chain, which complicates simulation of taboo-free evolution.
Another application of our framework is the construction of combinations of restriction enzymes that lead to a disconnected Hamming graph, and thus limit evolutionary freedom. This may help to efficiently treat viral infections. Some progress has been made in the usage of restriction enzymes for the treatment of viral infections (Weber et al. 2014). Since one or just a few SNPs can significantly alter the symptoms or even the mortality associated to a pathogen (Collery et al. 2017;Yuan et al. 2017), our characterization of the connectivity of taboo-free Hamming graphs could help to delete SNPs from the viral genome that are detrimental to humans. Although the treatment of an infection using restriction enzymes is mostly unexplored, this work could be a first theoretical guide to a successful treatment. by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.