Abstract
Selfcomplementary circular codes are involved in pairing genetic processes. A maximal \(C^3\) selfcomplementary circular code X of trinucleotides was identified in genes of bacteria, archaea, eukaryotes, plasmids and viruses (Michel in Life 7(20):1–16 2017, J Theor Biol 380:156–177, 2015; Arquès and Michel in J Theor Biol 182:45–58 1996). In this paper, selfcomplementary circular codes are investigated using the graph theory approach recently formulated in Fimmel et al. (Philos Trans R Soc A 374:20150058, 2016). A directed graph \(\mathcal {G}(X)\) associated with any code X mirrors the properties of the code. In the present paper, we demonstrate a necessary condition for the selfcomplementarity of an arbitrary code X in terms of the graph theory. The same condition has been proven to be sufficient for codes which are circular and of large size \(\mid X \mid \ge 18\) trinucleotides, in particular for maximal circular codes (\(\mid X \mid = 20\) trinucleotides). For codes of smallsize \(\mid X \mid \le 16\) trinucleotides, some very rare counterexamples have been constructed. Furthermore, the length and the structure of the longest paths in the graphs associated with the selfcomplementary circular codes are investigated. It has been proven that the longest paths in such graphs determine the reading frame for the selfcomplementary circular codes. By applying this result, the reading frame in any arbitrary sequence of trinucleotides is retrieved after at most 15 nucleotides, i.e., 5 consecutive trinucleotides, from the circular code X identified in genes. Thus, an X motif of a length of at least 15 nucleotides in an arbitrary sequence of trinucleotides (not necessarily all of them belonging to X) uniquely defines the reading (correct) frame, an important criterion for analyzing the X motifs in genes in the future.
This is a preview of subscription content, access via your institution.
Notes
Recall that the union \(G_1\cup G_2\) of two graphs \(G_1=(V_1,E_1)\) and \(G_2=(V_2,E_2)\) is defined as \(G=(V_1\cup V_2, E_1\cup E_2)\) (Clark and Holton 1991).
Due to selfcomplementarity of X, \(\mid X \mid \) must be even, but in opposite to circular codes, there are no selfcomplementary commafree codes of sizes 18 or 20.
References
Arquès DG, Michel CJ (1996) A complementary circular code in the protein coding genes. J Theor Biol 182:45–58
Clark J, Holton DA (1991) A first look at graph theory. World Scientific, New Jersey
Crick FH, Brenner S, Klug A, Pieczenik G (1976) A speculation on the origin of protein synthesis. Orig Life 7:389–397
Crick FH, Griffith JS, Orgel LE (1957) Codes without commas. Proc Natl Acad Sci USA 43:416–421
Eigen M, Schuster P (1978) The hypercycle. A principle of natural selforganization. Part C: The realistic hypercycle. Naturwissenschaften 65:341–369
El Soufi K, Michel CJ (2014) Circular code motifs in the ribosome decoding center. Comput Biol Chem 52:9–17
El Soufi K, Michel CJ (2015) Circular code motifs near the ribosome decoding center. Comput Biol Chem 59:158–176
El Soufi K, Michel CJ (2016) Circular code motifs in genomes of eukaryotes. J Theor Biol 408:198–212
Fimmel E, Michel CJ, Strüngmann L (2016) \(n\)Nucleotide circular codes in graph theory. Philos Trans R Soc A 374:20150058
Fimmel E, Michel CJ, Strüngmann L (2017) Strong commafree codes in genetic information. Bull Math Biol 79:1796–1819
Golomb SW, Delbruck M, Welch LR (1958a) Construction and properties of commafree codes. Biol Medd K Dan Vidensk Selsk 23:1–34
Golomb SW, Gordon B, Welch LR (1958b) Commafree codes. Can J Math 10:202–209
Ikehara K (2002) Origins of gene, genetic code, protein and life: comprehensive view of life systems from a GNCSNS primitive genetic code hypothesis. J Biosci 27:165–186
Michel CJ (2012) Circular code motifs in transfer and 16S ribosomal RNAs: a possible translation code in genes. Comput Biol Chem 37:24–37
Michel CJ (2013) Circular code motifs in transfer RNAs. Comput Biol Chem 45:17–29
Michel CJ (2015) The maximal \(C^3\) selfcomplementary trinucleotide circular code \(X\) in genes of bacteria, eukaryotes, plasmids and viruses. J Theor Biol 380:156–177
Michel CJ (2017) The maximal \(C^3\) selfcomplementary trinucleotide circular code \(X\) in genes of bacteria, archaea, eukaryotes, plasmids and viruses. Life 7(20):1–16
Michel CJ, Nguefack Ngoune V, Poch O, Ripp R, Thompson JD (2017) Enrichment of circular code motifs in the genes of the yeast Saccharomyces cerevisiae. Life 7(52):1–20
Michel CJ, Pirillo G, Pirillo MA (2008) Varieties of comma free codes. Comput Math Appl 55:989–996
Shepherd JCW (1981) Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary justification. Proc Natl Acad Sci USA 78:1596–1600
Trifonov EN (1987) Translation framing code and framemonitoring mechanism as suggested by the analysis of mRNA and 16S rRNA nucleotide sequences. J Mol Biol 194:643–652
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Proof of Theorem 4.7
Claim (1): Let \(l_{max}(X)=4\) and assume that \(l_1 \rightarrow d_1 \rightarrow l_2 \rightarrow d_2 \rightarrow l_3\) is a longest path in \(\mathcal {G}(X)\). Since the path is maximal, there is no trinucleotide of the form \(dl_1\) and no trinucleotide of the form \(l_3d\) in X. It follows that \(c(l_3)=l_1\) and \(d_1, d_2 \in \{l_2,c(l_2) \}^2\). Note that all the nucleotides \(l_1,l_2,l_3\) must be different by circularity. Thus, we have 4 possibilities for \(d_1, d_2\), namely \(l_2l_2\), \(l_2c(l_2)\), \(c(l_2)l_2\) and \(c(l_2)c(l_2)\). As \(l_2l_2l_2 \not \in X\) by circularity, we have the following options for the 2 trinucleotides \(d_1l_2 \in X\) and \(l_2d_2 \in X\)
If \(d_1l_2\) or \(l_2d_2\) is equal to \(l_2c(l_2)l_2\) then selfcomplementarity yields \(c(l_2)l_2c(l_2) \in X\) and the word \(c(l_2)l_2c(l_2)l_2c(l_2)l_2\) contradicts circularity. Excluding the combinations \(c(l_2)l_2l_2\), \(l_2l_2c(l_2)\) and \(c(l_2)c(l_2)l_2\), \(l_2c(l_2)c(l_2)\) since the trinucleotides are obviously circular permutations of each other, only 2 combinations remain: \(c(l_2)l_2l_2\), \(l_2c(l_2)c(l_2)\) and \(c(l_2)c(l_2)l_2\), \(l_2l_2c(l_2)\). But also here, selfcomplementarity yields a contradiction to circularity since, for example, the complementary trinucleotide of \(c(l_2)c(l_2)l_2\) is in the same equivalence class as \(l_2l_2c(l_2)\).
Claim (2): Let \(l_{max}(X)=6\) and assume that \(d_1 \rightarrow l_1 \rightarrow d_2 \rightarrow l_2 \rightarrow d_3 \rightarrow l_3 \rightarrow d_4 \) is a longest path in \(\mathcal {G}(X)\). By selfcomplementarity, there is the reversed complemented path
Now, the middle nucleotides \(l_2\) and \(c(l_2)\) of the 2 paths are either the pair A and T, or C and G. Therefore, it suffices to show that there are paths \(A \rightarrow d \rightarrow T\) or \(T \rightarrow d \rightarrow A\) and \(C \rightarrow d \rightarrow G\) or \(G \rightarrow d \rightarrow C\) in \(\mathcal {G}(X)\); since then, we will obtain a path of length 8 combining the 2 paths, e.g.,
contradicting \(l_{max}(X)=6\). However, by maximality, the code X must contain exactly one trinucleotide of the class \(\{ ATT, TTA, TAT\}\) and its complementary trinucleotide as well as exactly one trinucleotide from the class \(\{ GCC, CCG, CGC\}\) and its complementary trinucleotide. It is easy to verify that in each case we obtain either a path of the form \(A \rightarrow d \rightarrow T\) or \(T \rightarrow d \rightarrow A\) and \(C \rightarrow d \rightarrow G\) or \(G \rightarrow d \rightarrow C\), e.g., if \(ATT \in X\) then also \(AAT \in X\) and we get the path \(A \rightarrow AT \rightarrow T\) in \(\mathcal {G}(X)\).
Claim (3): Let \(l_{max}(X)=8\) and assume that \(l_1 \rightarrow d_1 \rightarrow l_2 \rightarrow d_2 \rightarrow l_3 \rightarrow d_3 \rightarrow l_4 \rightarrow d_4 \rightarrow l_5\) is the longest path in \(\mathcal {G}(X)\). Then obviously, 2 out of the 5 nucleotides \(l_1,l_2,l_3,l_4,l_5\) must be equal, which yields a cycle in \(\mathcal {G}(X)\) contradicting the circularity of X. \(\square \)
Proof of Theorem 5.11
Let \(X\subseteq \mathcal {B}^3\) be a maximal selfcomplementary circular code and \(\mathcal {G}(X)\) its associated graph. Since X is circular then \(\mathcal {G}(X)\) is acyclic, so it has a path \(p=p_{max}(X)\) of maximal length l(p).
Claim (1): Assume that \(p=d_1 \rightarrow b_1 \rightarrow \cdots \rightarrow b_k\), then any concatenation \(d_ib_i \in X\). Choose any trinucleotide \(c=s_1s_2s_3 \in X\). Then \((d_1b_1)\cdots (d_kb_k) (s_1s_2s_3)\in X^{k+1}\) and hence \((d_1b_1)\cdots (d_kb_k) s_1\) is a possible Xframe (for itself) with \(t_b=\epsilon \) and \(t_e=s_1\). Moreover, each concatenation \(b_id_{i+1}\) is also a trinucleotide in X, so \(d_1(b_1d_2)\cdots (b_{k1}d_k)b_ks_1\) is a second possible Xframe with \(t_b=d_1\) and \(t_e=b_ks_1\). Thus, \(n_X \ge l_w(p)+2\) since the sequence \(d_1b_1 \cdots d_kb_ks_1\) has length \(l_w(p)+1\).
Now assume that \(b_1 \cdots b_k\) is a sequence of nucleotides and assume that \(k \ge l_w(p)+2\) but \(b_1 \cdots b_k\) has 2 different possible Xframes. We have to show a contradiction to conclude that \(n_X=l_w(p)+2\). Assume that \(t_b u_1 \cdots u_l t_e\) and \(t_b' u_1'\cdots u_m't_e'\) with \(u_i, u_i' \in X\) and \(t_b,t_e, t_b',t_e' \in \left( \{ \epsilon \} \cup \mathcal {B}\cup \mathcal {B}^2 \right) \) are the 2 different possible Xframes. Obviously, \(\mid t_bt_e \mid \le 4\). If \(\mid t_bt_e \mid =4\) then by the difference of the 2 possible Xframes, we conclude that at least one of \(t_b'\) or \(t_e'\) has to have length \(\ge 3\), a contradiction to the definition of possible Xframe, or \(\mid t_b't_e' \mid \le 3\). Hence, w.l.o.g. we assume that \(\mid t_bt_e \mid \le 3\). Consequently, \(\mid u_1 \cdots u_l \mid \ge k3 \ge l_w(p)+23=l_w(p)1\) and hence \(\mid u_1 \cdots u_l \mid \ge l_w(p)+1\). We now have to distinguish cases:

(a)
If \(\mid t_bt_e \mid \le 1\) then we even get \(\mid u_1 \cdots u_l \mid \ge k1 \ge l_w(p)+21=l_w(p)+1\) and hence \(\mid u_1 \cdots u_l \mid \ge l_w(p)\). Thus, the path associated with the 2 possible Xframes has wordlength at least \(l_w(p)+1\), a contradiction to the maximality of \(l_w(p)\). In this case, the sequence \(u_1 \cdots u_l\) could contain the sequence \(u_1' \cdots u_m'\) as a subsequence.

(b)
If \(\mid t_bt_e \mid \ge 2\) then the second possible Xframe is at least shifted by one with respect to the first possible Xframe, i.e., it must extend the sequence \(u_1 \cdots u_l\) to the left or to the right. In this case, the sequence \(u_1 \cdots u_l\) cannot contain the sequence \(u_1' \cdots u_m'\) as a subsequence. The path associated with the 2 possible Xframes has wordlength at least \(\mid u_1 \cdots u_l \mid +1 \ge l_w(p)+1\), again a contradiction to the maximality of \(l_w(p)\).
Thus, \(n_X=l(p)+2\).
The case \(p= b_1 \rightarrow d_1 \rightarrow \cdots \rightarrow d_k\) is symmetric and can be similarly dealt with.
Claim (2): Assume that \(p=d_1 \rightarrow b_1 \rightarrow \cdots \rightarrow d_k\), then any concatenation \(d_ib_i \in X\). As in Claim (1), \((d_1b_1)\cdots (d_{k1}b_{k1})d_k\) is a possible Xframe (for itself) with \(t_b=\epsilon \) and \(t_e=d_k\). Moreover, each concatenation \(b_id_{i+1}\) is a trinucleotide in X, so \(d_1(b_1d_2)\cdots (b_{k2}d_{k1})(b_{k1}d_k)\) is a second possible Xframe with \(t_b=d_1\) and \(t_e=\epsilon \). Thus, \(n_X \ge l_w(p)\) since the sequence \(d_1b_1 \cdots d_{k1}b_{k1}d_k\) has length \(l_w(p)\).
Now assume that \(b_1 \cdots b_k\) is a sequence of nucleotides and assume that \(k \ge l_w(p)+1\) but \(b_1 \cdots b_k\) has 2 different possible Xframes: \(t_b u_1 \cdots u_l t_e\) and \(t_b' u_1'\cdots u_m't_e'\) with \(u_i, u_i' \in X\) and \(t_b,t_e, t_b',t_e' \in \left( \{ \epsilon \} \cup \mathcal {B}\cup \mathcal {B}^2 \right) \). As in Claim (1), we assume w.l.o.g. that \(\mid t_bt_e \mid \le 3\). We distinguish cases:

(a)
If \(\mid t_bt_e \mid =0\) then \(\mid u_1 \cdots u_l \mid \ge l_w(p)+1\) and \(u'_1 \cdots u'_m\) is a subsequence of \(u_1 \cdots u_l\). Thus, the path associated with the 2 possible Xframes has wordlength \(l_w(p)+1\) with the associated word \( u_1 \cdots u_l\), a contradiction to the maximality of \(l_w(p)\).

(b)
If \(\mid t_bt_e \mid =1\) then \(\mid u_1 \cdots u_l \mid \ge l_w(p)\). If the second possible Xframe is shifted by one with respect to the first one, then the path associated with the 2 possible Xframes has wordlength \(l_w(p)+1\), again a contradiction to the maximality of \(l_w(p)\). If the second possible Xframe is shifted by two, then the path associated with the 2 possible Xframes has wordlength \(l_w(p)\). However, in this case, the path starts with a dinucleotide and ends with a nucleotide, a contradiction to the structure of maximal paths which have to start and end with a dinucleotide.

(c)
If \(\mid t_bt_e \mid =2\) then \(\mid u_1 \cdots u_l \mid \ge l_w(p)1\). Again, we have to distinguish cases:

(i)
\(\mid t_b \mid =2\) and \(\mid t_e \mid =0\). Then the associated path to the 2 possible Xframes has wordlength \(l_w(p)\) and starts with a nucleotide but ends with a dinucleotide, a contradiction to the structure of maximal paths, or has wordlength \(l_w(p)+1\), a contradiction to the maximality of \(l_w(p)\).

(ii)
\(\mid t_b \mid =0\) and \(\mid t_e \mid =2\), as (i).

(iii)
\(\mid t_b \mid =1\) and \(\mid t_e \mid =1\). As above, if the second possible Xframe is shifted by one, then the path associated with the 2 possible Xframes has wordlength \(l_w(p)\) again starting with a nucleotide (\(u_1\)) and ending with a dinucleotide, a contradiction to the structure of maximal paths. If the second possible Xframe is shifted by two, then again the path associated with the 2 possible Xframes has wordlength \(l_w(p)\) starting with a nucleotide (\(u'_1\)) and ends with a dinucleotide.

(i)

(d)
If \(\mid t_bt_e \mid =3\) then \(\mid u_1 \cdots u_l \mid \ge l_w(p)2\). We distinguish two symmetric cases:

(i)
\(\mid t_b \mid =2\) and \(\mid t_e \mid =1\). If the second possible Xframe is shifted by one, then the path associated with the 2 possible Xframes has wordlength \(l_w(p)+1\), a contradiction to the maximality of \(l_w(p)\), or has wordlength \(l_w(p)\) but starting with a nucleotide and ending with a dinucleotide, a contradiction to the structure of maximal paths. If the second possible Xframe is shifted by two, then either the path associated with the 2 possible Xframes has wordlength \(l_w(p)+1\), a contradiction to the maximality of \(l_w(p)\), or has wordlength \(l_w(p)1\) starting with a nucleotide and ending with a nucleotide. But this case cannot exist unless the arrowlength of this path is at least the arrowlength of p, a contradiction to the maximality of p.

(ii)
\(\mid t_b \mid =1\) and \(\mid t_e \mid =2\), as (i).

(i)
Claim (3): Assume that \(p=b_1 \rightarrow d_1 \rightarrow \cdots \rightarrow b_k\), then any concatenation \(b_id_i \in X\). Choose any 2 trinucleotides \(c=s_1s_2s_3, c'=s'_1s'_2s'_3 \in X\). Then \((s'_1s'_2s'_3)(b_1d_1)\cdots (d_kb_k) (s_1s_2s_3)\in X^{k+2}\) and hence \(s'_3(b_1d_1)\cdots (b_{k1}d_{k1})b_k s_1\) is a possible Xframe (for itself) with \(t_b=s'_3\) and \(t_e=b_ks_1\). Moreover, each concatenation \(d_ib_{i+1}\) is a trinucleotide in X, so \(s'_3b_1(d_1b_2)\cdots (d_{k1}b_k)s_1\) is a second possible Xframe with \(t_b=s'_3b_1\) and \(t_e=s_1\). Thus, \(n_X \ge l_w(p)+3\) since the sequence \(s'_3b_1d_1 \cdots b_{k1}d_{k1}b_ks_1\) has length \(l_w(p)+2\).
Now assume that \(b_1 \cdots b_k\) is a sequence of nucleotides with \(k \ge l_w(p)+3\) but \(b_1 \cdots b_k\) has 2 different possible Xframes: \(t_b u_1 \cdots u_l t_e\) and \(t_b' u_1'\cdots u_m't_e'\) with \(u_i, u_i' \in X\) and \(t_b,t_e, t_b',t_e' \in \left( \{ \epsilon \} \cup \mathcal {B}\cup \mathcal {B}^2 \right) \). As in Claim (1), we conclude that w.l.o.g. \(\mid t_bt_e \mid \le 3\) and hence \(\mid u_1 \cdots u_l \mid \ge k3 \ge l_w(p)+33=l_w(p)\). Similar arguments as above show that the path associated with the 2 possible Xframes has wordlength greater than \(l_w(p)\), in contradiction to the maximality of p and \(l_w(p)\). \(\square \)
Rights and permissions
About this article
Cite this article
Fimmel, E., Michel, C.J., Starman, M. et al. Selfcomplementary circular codes in coding theory. Theory Biosci. 137, 51–65 (2018). https://doi.org/10.1007/s1206401802594
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s1206401802594
Keywords
 Selfcomplementary circular codes
 Graph properties
 Translation process
 Reading frame
 Genetic code