Appendix
Proof of Theorem 4.7
Claim (1): Let \(l_{max}(X)=4\) and assume that \(l_1 \rightarrow d_1 \rightarrow l_2 \rightarrow d_2 \rightarrow l_3\) is a longest path in \(\mathcal {G}(X)\). Since the path is maximal, there is no trinucleotide of the form \(dl_1\) and no trinucleotide of the form \(l_3d\) in X. It follows that \(c(l_3)=l_1\) and \(d_1, d_2 \in \{l_2,c(l_2) \}^2\). Note that all the nucleotides \(l_1,l_2,l_3\) must be different by circularity. Thus, we have 4 possibilities for \(d_1, d_2\), namely \(l_2l_2\), \(l_2c(l_2)\), \(c(l_2)l_2\) and \(c(l_2)c(l_2)\). As \(l_2l_2l_2 \not \in X\) by circularity, we have the following options for the 2 trinucleotides \(d_1l_2 \in X\) and \(l_2d_2 \in X\)
$$\begin{aligned}&d_1l_2: \quad \quad l_2c(l_2)l_2 \quad c(l_2)l_2l_2 \quad c(l_2)c(l_2)l_2; \\& l_2d_2: \quad \quad l_2l_2c(l_2) \quad l_2c(l_2)l_2 \quad l_2c(l_2)c(l_2). \end{aligned}$$
If \(d_1l_2\) or \(l_2d_2\) is equal to \(l_2c(l_2)l_2\) then self-complementarity yields \(c(l_2)l_2c(l_2) \in X\) and the word \(c(l_2)l_2c(l_2)l_2c(l_2)l_2\) contradicts circularity. Excluding the combinations \(c(l_2)l_2l_2\), \(l_2l_2c(l_2)\) and \(c(l_2)c(l_2)l_2\), \(l_2c(l_2)c(l_2)\) since the trinucleotides are obviously circular permutations of each other, only 2 combinations remain: \(c(l_2)l_2l_2\), \(l_2c(l_2)c(l_2)\) and \(c(l_2)c(l_2)l_2\), \(l_2l_2c(l_2)\). But also here, self-complementarity yields a contradiction to circularity since, for example, the complementary trinucleotide of \(c(l_2)c(l_2)l_2\) is in the same equivalence class as \(l_2l_2c(l_2)\).
Claim (2): Let \(l_{max}(X)=6\) and assume that \(d_1 \rightarrow l_1 \rightarrow d_2 \rightarrow l_2 \rightarrow d_3 \rightarrow l_3 \rightarrow d_4 \) is a longest path in \(\mathcal {G}(X)\). By self-complementarity, there is the reversed complemented path
$$\begin{aligned} \overleftarrow{{c(d_4)}} \rightarrow c(l_3) \rightarrow \overleftarrow{{c(d_{3})}} \rightarrow c(l_2) \rightarrow \overleftarrow{{c(d_2)}} \rightarrow c(l_1) \rightarrow \overleftarrow{{c(d_1)}}. \end{aligned}$$
Now, the middle nucleotides \(l_2\) and \(c(l_2)\) of the 2 paths are either the pair A and T, or C and G. Therefore, it suffices to show that there are paths \(A \rightarrow d \rightarrow T\) or \(T \rightarrow d \rightarrow A\) and \(C \rightarrow d \rightarrow G\) or \(G \rightarrow d \rightarrow C\) in \(\mathcal {G}(X)\); since then, we will obtain a path of length 8 combining the 2 paths, e.g.,
$$\begin{aligned} d_1 \rightarrow l_1 \rightarrow d_2 \rightarrow l_2 \rightarrow d \rightarrow c(l_2) \rightarrow \overleftarrow{{c(d_2)}} \rightarrow c(l_1) \rightarrow \overleftarrow{{c(d_1)}} \end{aligned}$$
contradicting \(l_{max}(X)=6\). However, by maximality, the code X must contain exactly one trinucleotide of the class \(\{ ATT, TTA, TAT\}\) and its complementary trinucleotide as well as exactly one trinucleotide from the class \(\{ GCC, CCG, CGC\}\) and its complementary trinucleotide. It is easy to verify that in each case we obtain either a path of the form \(A \rightarrow d \rightarrow T\) or \(T \rightarrow d \rightarrow A\) and \(C \rightarrow d \rightarrow G\) or \(G \rightarrow d \rightarrow C\), e.g., if \(ATT \in X\) then also \(AAT \in X\) and we get the path \(A \rightarrow AT \rightarrow T\) in \(\mathcal {G}(X)\).
Claim (3): Let \(l_{max}(X)=8\) and assume that \(l_1 \rightarrow d_1 \rightarrow l_2 \rightarrow d_2 \rightarrow l_3 \rightarrow d_3 \rightarrow l_4 \rightarrow d_4 \rightarrow l_5\) is the longest path in \(\mathcal {G}(X)\). Then obviously, 2 out of the 5 nucleotides \(l_1,l_2,l_3,l_4,l_5\) must be equal, which yields a cycle in \(\mathcal {G}(X)\) contradicting the circularity of X. \(\square \)
Proof of Theorem 5.11
Let \(X\subseteq \mathcal {B}^3\) be a maximal self-complementary circular code and \(\mathcal {G}(X)\) its associated graph. Since X is circular then \(\mathcal {G}(X)\) is acyclic, so it has a path \(p=p_{max}(X)\) of maximal length l(p).
Claim (1): Assume that \(p=d_1 \rightarrow b_1 \rightarrow \cdots \rightarrow b_k\), then any concatenation \(d_ib_i \in X\). Choose any trinucleotide \(c=s_1s_2s_3 \in X\). Then \((d_1b_1)\cdots (d_kb_k) (s_1s_2s_3)\in X^{k+1}\) and hence \((d_1b_1)\cdots (d_kb_k) s_1\) is a possible X-frame (for itself) with \(t_b=\epsilon \) and \(t_e=s_1\). Moreover, each concatenation \(b_id_{i+1}\) is also a trinucleotide in X, so \(d_1(b_1d_2)\cdots (b_{k-1}d_k)b_ks_1\) is a second possible X-frame with \(t_b=d_1\) and \(t_e=b_ks_1\). Thus, \(n_X \ge l_w(p)+2\) since the sequence \(d_1b_1 \cdots d_kb_ks_1\) has length \(l_w(p)+1\).
Now assume that \(b_1 \cdots b_k\) is a sequence of nucleotides and assume that \(k \ge l_w(p)+2\) but \(b_1 \cdots b_k\) has 2 different possible X-frames. We have to show a contradiction to conclude that \(n_X=l_w(p)+2\). Assume that \(t_b u_1 \cdots u_l t_e\) and \(t_b' u_1'\cdots u_m't_e'\) with \(u_i, u_i' \in X\) and \(t_b,t_e, t_b',t_e' \in \left( \{ \epsilon \} \cup \mathcal {B}\cup \mathcal {B}^2 \right) \) are the 2 different possible X-frames. Obviously, \(\mid t_bt_e \mid \le 4\). If \(\mid t_bt_e \mid =4\) then by the difference of the 2 possible X-frames, we conclude that at least one of \(t_b'\) or \(t_e'\) has to have length \(\ge 3\), a contradiction to the definition of possible X-frame, or \(\mid t_b't_e' \mid \le 3\). Hence, w.l.o.g. we assume that \(\mid t_bt_e \mid \le 3\). Consequently, \(\mid u_1 \cdots u_l \mid \ge k-3 \ge l_w(p)+2-3=l_w(p)-1\) and hence \(\mid u_1 \cdots u_l \mid \ge l_w(p)+1\). We now have to distinguish cases:
-
(a)
If \(\mid t_bt_e \mid \le 1\) then we even get \(\mid u_1 \cdots u_l \mid \ge k-1 \ge l_w(p)+2-1=l_w(p)+1\) and hence \(\mid u_1 \cdots u_l \mid \ge l_w(p)\). Thus, the path associated with the 2 possible X-frames has word-length at least \(l_w(p)+1\), a contradiction to the maximality of \(l_w(p)\). In this case, the sequence \(u_1 \cdots u_l\) could contain the sequence \(u_1' \cdots u_m'\) as a subsequence.
-
(b)
If \(\mid t_bt_e \mid \ge 2\) then the second possible X-frame is at least shifted by one with respect to the first possible X-frame, i.e., it must extend the sequence \(u_1 \cdots u_l\) to the left or to the right. In this case, the sequence \(u_1 \cdots u_l\) cannot contain the sequence \(u_1' \cdots u_m'\) as a subsequence. The path associated with the 2 possible X-frames has word-length at least \(\mid u_1 \cdots u_l \mid +1 \ge l_w(p)+1\), again a contradiction to the maximality of \(l_w(p)\).
Thus, \(n_X=l(p)+2\).
The case \(p= b_1 \rightarrow d_1 \rightarrow \cdots \rightarrow d_k\) is symmetric and can be similarly dealt with.
Claim (2): Assume that \(p=d_1 \rightarrow b_1 \rightarrow \cdots \rightarrow d_k\), then any concatenation \(d_ib_i \in X\). As in Claim (1), \((d_1b_1)\cdots (d_{k-1}b_{k-1})d_k\) is a possible X-frame (for itself) with \(t_b=\epsilon \) and \(t_e=d_k\). Moreover, each concatenation \(b_id_{i+1}\) is a trinucleotide in X, so \(d_1(b_1d_2)\cdots (b_{k-2}d_{k-1})(b_{k-1}d_k)\) is a second possible X-frame with \(t_b=d_1\) and \(t_e=\epsilon \). Thus, \(n_X \ge l_w(p)\) since the sequence \(d_1b_1 \cdots d_{k-1}b_{k-1}d_k\) has length \(l_w(p)\).
Now assume that \(b_1 \cdots b_k\) is a sequence of nucleotides and assume that \(k \ge l_w(p)+1\) but \(b_1 \cdots b_k\) has 2 different possible X-frames: \(t_b u_1 \cdots u_l t_e\) and \(t_b' u_1'\cdots u_m't_e'\) with \(u_i, u_i' \in X\) and \(t_b,t_e, t_b',t_e' \in \left( \{ \epsilon \} \cup \mathcal {B}\cup \mathcal {B}^2 \right) \). As in Claim (1), we assume w.l.o.g. that \(\mid t_bt_e \mid \le 3\). We distinguish cases:
-
(a)
If \(\mid t_bt_e \mid =0\) then \(\mid u_1 \cdots u_l \mid \ge l_w(p)+1\) and \(u'_1 \cdots u'_m\) is a subsequence of \(u_1 \cdots u_l\). Thus, the path associated with the 2 possible X-frames has word-length \(l_w(p)+1\) with the associated word \( u_1 \cdots u_l\), a contradiction to the maximality of \(l_w(p)\).
-
(b)
If \(\mid t_bt_e \mid =1\) then \(\mid u_1 \cdots u_l \mid \ge l_w(p)\). If the second possible X-frame is shifted by one with respect to the first one, then the path associated with the 2 possible X-frames has word-length \(l_w(p)+1\), again a contradiction to the maximality of \(l_w(p)\). If the second possible X-frame is shifted by two, then the path associated with the 2 possible X-frames has word-length \(l_w(p)\). However, in this case, the path starts with a dinucleotide and ends with a nucleotide, a contradiction to the structure of maximal paths which have to start and end with a dinucleotide.
-
(c)
If \(\mid t_bt_e \mid =2\) then \(\mid u_1 \cdots u_l \mid \ge l_w(p)-1\). Again, we have to distinguish cases:
-
(i)
\(\mid t_b \mid =2\) and \(\mid t_e \mid =0\). Then the associated path to the 2 possible X-frames has word-length \(l_w(p)\) and starts with a nucleotide but ends with a dinucleotide, a contradiction to the structure of maximal paths, or has word-length \(l_w(p)+1\), a contradiction to the maximality of \(l_w(p)\).
-
(ii)
\(\mid t_b \mid =0\) and \(\mid t_e \mid =2\), as (i).
-
(iii)
\(\mid t_b \mid =1\) and \(\mid t_e \mid =1\). As above, if the second possible X-frame is shifted by one, then the path associated with the 2 possible X-frames has word-length \(l_w(p)\) again starting with a nucleotide (\(u_1\)) and ending with a dinucleotide, a contradiction to the structure of maximal paths. If the second possible X-frame is shifted by two, then again the path associated with the 2 possible X-frames has word-length \(l_w(p)\) starting with a nucleotide (\(u'_1\)) and ends with a dinucleotide.
-
(d)
If \(\mid t_bt_e \mid =3\) then \(\mid u_1 \cdots u_l \mid \ge l_w(p)-2\). We distinguish two symmetric cases:
-
(i)
\(\mid t_b \mid =2\) and \(\mid t_e \mid =1\). If the second possible X-frame is shifted by one, then the path associated with the 2 possible X-frames has word-length \(l_w(p)+1\), a contradiction to the maximality of \(l_w(p)\), or has word-length \(l_w(p)\) but starting with a nucleotide and ending with a dinucleotide, a contradiction to the structure of maximal paths. If the second possible X-frame is shifted by two, then either the path associated with the 2 possible X-frames has word-length \(l_w(p)+1\), a contradiction to the maximality of \(l_w(p)\), or has word-length \(l_w(p)-1\) starting with a nucleotide and ending with a nucleotide. But this case cannot exist unless the arrow-length of this path is at least the arrow-length of p, a contradiction to the maximality of p.
-
(ii)
\(\mid t_b \mid =1\) and \(\mid t_e \mid =2\), as (i).
Claim (3): Assume that \(p=b_1 \rightarrow d_1 \rightarrow \cdots \rightarrow b_k\), then any concatenation \(b_id_i \in X\). Choose any 2 trinucleotides \(c=s_1s_2s_3, c'=s'_1s'_2s'_3 \in X\). Then \((s'_1s'_2s'_3)(b_1d_1)\cdots (d_kb_k) (s_1s_2s_3)\in X^{k+2}\) and hence \(s'_3(b_1d_1)\cdots (b_{k-1}d_{k-1})b_k s_1\) is a possible X-frame (for itself) with \(t_b=s'_3\) and \(t_e=b_ks_1\). Moreover, each concatenation \(d_ib_{i+1}\) is a trinucleotide in X, so \(s'_3b_1(d_1b_2)\cdots (d_{k-1}b_k)s_1\) is a second possible X-frame with \(t_b=s'_3b_1\) and \(t_e=s_1\). Thus, \(n_X \ge l_w(p)+3\) since the sequence \(s'_3b_1d_1 \cdots b_{k-1}d_{k-1}b_ks_1\) has length \(l_w(p)+2\).
Now assume that \(b_1 \cdots b_k\) is a sequence of nucleotides with \(k \ge l_w(p)+3\) but \(b_1 \cdots b_k\) has 2 different possible X-frames: \(t_b u_1 \cdots u_l t_e\) and \(t_b' u_1'\cdots u_m't_e'\) with \(u_i, u_i' \in X\) and \(t_b,t_e, t_b',t_e' \in \left( \{ \epsilon \} \cup \mathcal {B}\cup \mathcal {B}^2 \right) \). As in Claim (1), we conclude that w.l.o.g. \(\mid t_bt_e \mid \le 3\) and hence \(\mid u_1 \cdots u_l \mid \ge k-3 \ge l_w(p)+3-3=l_w(p)\). Similar arguments as above show that the path associated with the 2 possible X-frames has word-length greater than \(l_w(p)\), in contradiction to the maximality of p and \(l_w(p)\). \(\square \)