Skip to main content

Self-complementary circular codes in coding theory

Abstract

Self-complementary circular codes are involved in pairing genetic processes. A maximal \(C^3\) self-complementary circular code X of trinucleotides was identified in genes of bacteria, archaea, eukaryotes, plasmids and viruses (Michel in Life 7(20):1–16 2017, J Theor Biol 380:156–177, 2015; Arquès and Michel in J Theor Biol 182:45–58 1996). In this paper, self-complementary circular codes are investigated using the graph theory approach recently formulated in Fimmel et al. (Philos Trans R Soc A 374:20150058, 2016). A directed graph \(\mathcal {G}(X)\) associated with any code X mirrors the properties of the code. In the present paper, we demonstrate a necessary condition for the self-complementarity of an arbitrary code X in terms of the graph theory. The same condition has been proven to be sufficient for codes which are circular and of large size \(\mid X \mid \ge 18\) trinucleotides, in particular for maximal circular codes (\(\mid X \mid = 20\) trinucleotides). For codes of small-size \(\mid X \mid \le 16\) trinucleotides, some very rare counterexamples have been constructed. Furthermore, the length and the structure of the longest paths in the graphs associated with the self-complementary circular codes are investigated. It has been proven that the longest paths in such graphs determine the reading frame for the self-complementary circular codes. By applying this result, the reading frame in any arbitrary sequence of trinucleotides is retrieved after at most 15 nucleotides, i.e., 5 consecutive trinucleotides, from the circular code X identified in genes. Thus, an X motif of a length of at least 15 nucleotides in an arbitrary sequence of trinucleotides (not necessarily all of them belonging to X) uniquely defines the reading (correct) frame, an important criterion for analyzing the X motifs in genes in the future.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Notes

  1. Recall that the union \(G_1\cup G_2\) of two graphs \(G_1=(V_1,E_1)\) and \(G_2=(V_2,E_2)\) is defined as \(G=(V_1\cup V_2, E_1\cup E_2)\) (Clark and Holton 1991).

  2. Due to self-complementarity of X, \(\mid X \mid \) must be even, but in opposite to circular codes, there are no self-complementary comma-free codes of sizes 18 or 20.

References

  • Arquès DG, Michel CJ (1996) A complementary circular code in the protein coding genes. J Theor Biol 182:45–58

    Article  PubMed  Google Scholar 

  • Clark J, Holton DA (1991) A first look at graph theory. World Scientific, New Jersey

    Book  Google Scholar 

  • Crick FH, Brenner S, Klug A, Pieczenik G (1976) A speculation on the origin of protein synthesis. Orig Life 7:389–397

    CAS  Article  PubMed  Google Scholar 

  • Crick FH, Griffith JS, Orgel LE (1957) Codes without commas. Proc Natl Acad Sci USA 43:416–421

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  • Eigen M, Schuster P (1978) The hypercycle. A principle of natural self-organization. Part C: The realistic hypercycle. Naturwissenschaften 65:341–369

    CAS  Article  Google Scholar 

  • El Soufi K, Michel CJ (2014) Circular code motifs in the ribosome decoding center. Comput Biol Chem 52:9–17

    Article  PubMed  Google Scholar 

  • El Soufi K, Michel CJ (2015) Circular code motifs near the ribosome decoding center. Comput Biol Chem 59:158–176

    Article  PubMed  Google Scholar 

  • El Soufi K, Michel CJ (2016) Circular code motifs in genomes of eukaryotes. J Theor Biol 408:198–212

    Article  PubMed  Google Scholar 

  • Fimmel E, Michel CJ, Strüngmann L (2016) \(n\)-Nucleotide circular codes in graph theory. Philos Trans R Soc A 374:20150058

    Article  Google Scholar 

  • Fimmel E, Michel CJ, Strüngmann L (2017) Strong comma-free codes in genetic information. Bull Math Biol 79:1796–1819

    CAS  Article  PubMed  Google Scholar 

  • Golomb SW, Delbruck M, Welch LR (1958a) Construction and properties of comma-free codes. Biol Medd K Dan Vidensk Selsk 23:1–34

    Google Scholar 

  • Golomb SW, Gordon B, Welch LR (1958b) Comma-free codes. Can J Math 10:202–209

    Article  Google Scholar 

  • Ikehara K (2002) Origins of gene, genetic code, protein and life: comprehensive view of life systems from a GNC-SNS primitive genetic code hypothesis. J Biosci 27:165–186

    CAS  Article  PubMed  Google Scholar 

  • Michel CJ (2012) Circular code motifs in transfer and 16S ribosomal RNAs: a possible translation code in genes. Comput Biol Chem 37:24–37

    CAS  Article  PubMed  Google Scholar 

  • Michel CJ (2013) Circular code motifs in transfer RNAs. Comput Biol Chem 45:17–29

    CAS  Article  PubMed  Google Scholar 

  • Michel CJ (2015) The maximal \(C^3\) self-complementary trinucleotide circular code \(X\) in genes of bacteria, eukaryotes, plasmids and viruses. J Theor Biol 380:156–177

    CAS  Article  PubMed  Google Scholar 

  • Michel CJ (2017) The maximal \(C^3\) self-complementary trinucleotide circular code \(X\) in genes of bacteria, archaea, eukaryotes, plasmids and viruses. Life 7(20):1–16

    Google Scholar 

  • Michel CJ, Nguefack Ngoune V, Poch O, Ripp R, Thompson JD (2017) Enrichment of circular code motifs in the genes of the yeast Saccharomyces cerevisiae. Life 7(52):1–20

    Google Scholar 

  • Michel CJ, Pirillo G, Pirillo MA (2008) Varieties of comma free codes. Comput Math Appl 55:989–996

    Article  Google Scholar 

  • Shepherd JCW (1981) Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary justification. Proc Natl Acad Sci USA 78:1596–1600

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  • Trifonov EN (1987) Translation framing code and frame-monitoring mechanism as suggested by the analysis of mRNA and 16S rRNA nucleotide sequences. J Mol Biol 194:643–652

    CAS  Article  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christian J. Michel.

Appendix

Appendix

Proof of Theorem 4.7

Claim (1): Let \(l_{max}(X)=4\) and assume that \(l_1 \rightarrow d_1 \rightarrow l_2 \rightarrow d_2 \rightarrow l_3\) is a longest path in \(\mathcal {G}(X)\). Since the path is maximal, there is no trinucleotide of the form \(dl_1\) and no trinucleotide of the form \(l_3d\) in X. It follows that \(c(l_3)=l_1\) and \(d_1, d_2 \in \{l_2,c(l_2) \}^2\). Note that all the nucleotides \(l_1,l_2,l_3\) must be different by circularity. Thus, we have 4 possibilities for \(d_1, d_2\), namely \(l_2l_2\), \(l_2c(l_2)\), \(c(l_2)l_2\) and \(c(l_2)c(l_2)\). As \(l_2l_2l_2 \not \in X\) by circularity, we have the following options for the 2 trinucleotides \(d_1l_2 \in X\) and \(l_2d_2 \in X\)

$$\begin{aligned}&d_1l_2: \quad \quad l_2c(l_2)l_2 \quad c(l_2)l_2l_2 \quad c(l_2)c(l_2)l_2; \\& l_2d_2: \quad \quad l_2l_2c(l_2) \quad l_2c(l_2)l_2 \quad l_2c(l_2)c(l_2). \end{aligned}$$

If \(d_1l_2\) or \(l_2d_2\) is equal to \(l_2c(l_2)l_2\) then self-complementarity yields \(c(l_2)l_2c(l_2) \in X\) and the word \(c(l_2)l_2c(l_2)l_2c(l_2)l_2\) contradicts circularity. Excluding the combinations \(c(l_2)l_2l_2\), \(l_2l_2c(l_2)\) and \(c(l_2)c(l_2)l_2\), \(l_2c(l_2)c(l_2)\) since the trinucleotides are obviously circular permutations of each other, only 2 combinations remain: \(c(l_2)l_2l_2\), \(l_2c(l_2)c(l_2)\) and \(c(l_2)c(l_2)l_2\), \(l_2l_2c(l_2)\). But also here, self-complementarity yields a contradiction to circularity since, for example, the complementary trinucleotide of \(c(l_2)c(l_2)l_2\) is in the same equivalence class as \(l_2l_2c(l_2)\).

Claim (2): Let \(l_{max}(X)=6\) and assume that \(d_1 \rightarrow l_1 \rightarrow d_2 \rightarrow l_2 \rightarrow d_3 \rightarrow l_3 \rightarrow d_4 \) is a longest path in \(\mathcal {G}(X)\). By self-complementarity, there is the reversed complemented path

$$\begin{aligned} \overleftarrow{{c(d_4)}} \rightarrow c(l_3) \rightarrow \overleftarrow{{c(d_{3})}} \rightarrow c(l_2) \rightarrow \overleftarrow{{c(d_2)}} \rightarrow c(l_1) \rightarrow \overleftarrow{{c(d_1)}}. \end{aligned}$$

Now, the middle nucleotides \(l_2\) and \(c(l_2)\) of the 2 paths are either the pair A and T, or C and G. Therefore, it suffices to show that there are paths \(A \rightarrow d \rightarrow T\) or \(T \rightarrow d \rightarrow A\) and \(C \rightarrow d \rightarrow G\) or \(G \rightarrow d \rightarrow C\) in \(\mathcal {G}(X)\); since then, we will obtain a path of length 8 combining the 2 paths, e.g.,

$$\begin{aligned} d_1 \rightarrow l_1 \rightarrow d_2 \rightarrow l_2 \rightarrow d \rightarrow c(l_2) \rightarrow \overleftarrow{{c(d_2)}} \rightarrow c(l_1) \rightarrow \overleftarrow{{c(d_1)}} \end{aligned}$$

contradicting \(l_{max}(X)=6\). However, by maximality, the code X must contain exactly one trinucleotide of the class \(\{ ATT, TTA, TAT\}\) and its complementary trinucleotide as well as exactly one trinucleotide from the class \(\{ GCC, CCG, CGC\}\) and its complementary trinucleotide. It is easy to verify that in each case we obtain either a path of the form \(A \rightarrow d \rightarrow T\) or \(T \rightarrow d \rightarrow A\) and \(C \rightarrow d \rightarrow G\) or \(G \rightarrow d \rightarrow C\), e.g., if \(ATT \in X\) then also \(AAT \in X\) and we get the path \(A \rightarrow AT \rightarrow T\) in \(\mathcal {G}(X)\).

Claim (3): Let \(l_{max}(X)=8\) and assume that \(l_1 \rightarrow d_1 \rightarrow l_2 \rightarrow d_2 \rightarrow l_3 \rightarrow d_3 \rightarrow l_4 \rightarrow d_4 \rightarrow l_5\) is the longest path in \(\mathcal {G}(X)\). Then obviously, 2 out of the 5 nucleotides \(l_1,l_2,l_3,l_4,l_5\) must be equal, which yields a cycle in \(\mathcal {G}(X)\) contradicting the circularity of X. \(\square \)

Proof of Theorem 5.11

Let \(X\subseteq \mathcal {B}^3\) be a maximal self-complementary circular code and \(\mathcal {G}(X)\) its associated graph. Since X is circular then \(\mathcal {G}(X)\) is acyclic, so it has a path \(p=p_{max}(X)\) of maximal length l(p).

Claim (1): Assume that \(p=d_1 \rightarrow b_1 \rightarrow \cdots \rightarrow b_k\), then any concatenation \(d_ib_i \in X\). Choose any trinucleotide \(c=s_1s_2s_3 \in X\). Then \((d_1b_1)\cdots (d_kb_k) (s_1s_2s_3)\in X^{k+1}\) and hence \((d_1b_1)\cdots (d_kb_k) s_1\) is a possible X-frame (for itself) with \(t_b=\epsilon \) and \(t_e=s_1\). Moreover, each concatenation \(b_id_{i+1}\) is also a trinucleotide in X, so \(d_1(b_1d_2)\cdots (b_{k-1}d_k)b_ks_1\) is a second possible X-frame with \(t_b=d_1\) and \(t_e=b_ks_1\). Thus, \(n_X \ge l_w(p)+2\) since the sequence \(d_1b_1 \cdots d_kb_ks_1\) has length \(l_w(p)+1\).

Now assume that \(b_1 \cdots b_k\) is a sequence of nucleotides and assume that \(k \ge l_w(p)+2\) but \(b_1 \cdots b_k\) has 2 different possible X-frames. We have to show a contradiction to conclude that \(n_X=l_w(p)+2\). Assume that \(t_b u_1 \cdots u_l t_e\) and \(t_b' u_1'\cdots u_m't_e'\) with \(u_i, u_i' \in X\) and \(t_b,t_e, t_b',t_e' \in \left( \{ \epsilon \} \cup \mathcal {B}\cup \mathcal {B}^2 \right) \) are the 2 different possible X-frames. Obviously, \(\mid t_bt_e \mid \le 4\). If \(\mid t_bt_e \mid =4\) then by the difference of the 2 possible X-frames, we conclude that at least one of \(t_b'\) or \(t_e'\) has to have length \(\ge 3\), a contradiction to the definition of possible X-frame, or \(\mid t_b't_e' \mid \le 3\). Hence, w.l.o.g. we assume that \(\mid t_bt_e \mid \le 3\). Consequently, \(\mid u_1 \cdots u_l \mid \ge k-3 \ge l_w(p)+2-3=l_w(p)-1\) and hence \(\mid u_1 \cdots u_l \mid \ge l_w(p)+1\). We now have to distinguish cases:

  1. (a)

    If \(\mid t_bt_e \mid \le 1\) then we even get \(\mid u_1 \cdots u_l \mid \ge k-1 \ge l_w(p)+2-1=l_w(p)+1\) and hence \(\mid u_1 \cdots u_l \mid \ge l_w(p)\). Thus, the path associated with the 2 possible X-frames has word-length at least \(l_w(p)+1\), a contradiction to the maximality of \(l_w(p)\). In this case, the sequence \(u_1 \cdots u_l\) could contain the sequence \(u_1' \cdots u_m'\) as a subsequence.

  2. (b)

    If \(\mid t_bt_e \mid \ge 2\) then the second possible X-frame is at least shifted by one with respect to the first possible X-frame, i.e., it must extend the sequence \(u_1 \cdots u_l\) to the left or to the right. In this case, the sequence \(u_1 \cdots u_l\) cannot contain the sequence \(u_1' \cdots u_m'\) as a subsequence. The path associated with the 2 possible X-frames has word-length at least \(\mid u_1 \cdots u_l \mid +1 \ge l_w(p)+1\), again a contradiction to the maximality of \(l_w(p)\).

Thus, \(n_X=l(p)+2\).

The case \(p= b_1 \rightarrow d_1 \rightarrow \cdots \rightarrow d_k\) is symmetric and can be similarly dealt with.

Claim (2): Assume that \(p=d_1 \rightarrow b_1 \rightarrow \cdots \rightarrow d_k\), then any concatenation \(d_ib_i \in X\). As in Claim (1), \((d_1b_1)\cdots (d_{k-1}b_{k-1})d_k\) is a possible X-frame (for itself) with \(t_b=\epsilon \) and \(t_e=d_k\). Moreover, each concatenation \(b_id_{i+1}\) is a trinucleotide in X, so \(d_1(b_1d_2)\cdots (b_{k-2}d_{k-1})(b_{k-1}d_k)\) is a second possible X-frame with \(t_b=d_1\) and \(t_e=\epsilon \). Thus, \(n_X \ge l_w(p)\) since the sequence \(d_1b_1 \cdots d_{k-1}b_{k-1}d_k\) has length \(l_w(p)\).

Now assume that \(b_1 \cdots b_k\) is a sequence of nucleotides and assume that \(k \ge l_w(p)+1\) but \(b_1 \cdots b_k\) has 2 different possible X-frames: \(t_b u_1 \cdots u_l t_e\) and \(t_b' u_1'\cdots u_m't_e'\) with \(u_i, u_i' \in X\) and \(t_b,t_e, t_b',t_e' \in \left( \{ \epsilon \} \cup \mathcal {B}\cup \mathcal {B}^2 \right) \). As in Claim (1), we assume w.l.o.g. that \(\mid t_bt_e \mid \le 3\). We distinguish cases:

  1. (a)

    If \(\mid t_bt_e \mid =0\) then \(\mid u_1 \cdots u_l \mid \ge l_w(p)+1\) and \(u'_1 \cdots u'_m\) is a subsequence of \(u_1 \cdots u_l\). Thus, the path associated with the 2 possible X-frames has word-length \(l_w(p)+1\) with the associated word \( u_1 \cdots u_l\), a contradiction to the maximality of \(l_w(p)\).

  2. (b)

    If \(\mid t_bt_e \mid =1\) then \(\mid u_1 \cdots u_l \mid \ge l_w(p)\). If the second possible X-frame is shifted by one with respect to the first one, then the path associated with the 2 possible X-frames has word-length \(l_w(p)+1\), again a contradiction to the maximality of \(l_w(p)\). If the second possible X-frame is shifted by two, then the path associated with the 2 possible X-frames has word-length \(l_w(p)\). However, in this case, the path starts with a dinucleotide and ends with a nucleotide, a contradiction to the structure of maximal paths which have to start and end with a dinucleotide.

  3. (c)

    If \(\mid t_bt_e \mid =2\) then \(\mid u_1 \cdots u_l \mid \ge l_w(p)-1\). Again, we have to distinguish cases:

    1. (i)

      \(\mid t_b \mid =2\) and \(\mid t_e \mid =0\). Then the associated path to the 2 possible X-frames has word-length \(l_w(p)\) and starts with a nucleotide but ends with a dinucleotide, a contradiction to the structure of maximal paths, or has word-length \(l_w(p)+1\), a contradiction to the maximality of \(l_w(p)\).

    2. (ii)

      \(\mid t_b \mid =0\) and \(\mid t_e \mid =2\), as (i).

    3. (iii)

      \(\mid t_b \mid =1\) and \(\mid t_e \mid =1\). As above, if the second possible X-frame is shifted by one, then the path associated with the 2 possible X-frames has word-length \(l_w(p)\) again starting with a nucleotide (\(u_1\)) and ending with a dinucleotide, a contradiction to the structure of maximal paths. If the second possible X-frame is shifted by two, then again the path associated with the 2 possible X-frames has word-length \(l_w(p)\) starting with a nucleotide (\(u'_1\)) and ends with a dinucleotide.

  4. (d)

    If \(\mid t_bt_e \mid =3\) then \(\mid u_1 \cdots u_l \mid \ge l_w(p)-2\). We distinguish two symmetric cases:

    1. (i)

      \(\mid t_b \mid =2\) and \(\mid t_e \mid =1\). If the second possible X-frame is shifted by one, then the path associated with the 2 possible X-frames has word-length \(l_w(p)+1\), a contradiction to the maximality of \(l_w(p)\), or has word-length \(l_w(p)\) but starting with a nucleotide and ending with a dinucleotide, a contradiction to the structure of maximal paths. If the second possible X-frame is shifted by two, then either the path associated with the 2 possible X-frames has word-length \(l_w(p)+1\), a contradiction to the maximality of \(l_w(p)\), or has word-length \(l_w(p)-1\) starting with a nucleotide and ending with a nucleotide. But this case cannot exist unless the arrow-length of this path is at least the arrow-length of p, a contradiction to the maximality of p.

    2. (ii)

      \(\mid t_b \mid =1\) and \(\mid t_e \mid =2\), as (i).

Claim (3): Assume that \(p=b_1 \rightarrow d_1 \rightarrow \cdots \rightarrow b_k\), then any concatenation \(b_id_i \in X\). Choose any 2 trinucleotides \(c=s_1s_2s_3, c'=s'_1s'_2s'_3 \in X\). Then \((s'_1s'_2s'_3)(b_1d_1)\cdots (d_kb_k) (s_1s_2s_3)\in X^{k+2}\) and hence \(s'_3(b_1d_1)\cdots (b_{k-1}d_{k-1})b_k s_1\) is a possible X-frame (for itself) with \(t_b=s'_3\) and \(t_e=b_ks_1\). Moreover, each concatenation \(d_ib_{i+1}\) is a trinucleotide in X, so \(s'_3b_1(d_1b_2)\cdots (d_{k-1}b_k)s_1\) is a second possible X-frame with \(t_b=s'_3b_1\) and \(t_e=s_1\). Thus, \(n_X \ge l_w(p)+3\) since the sequence \(s'_3b_1d_1 \cdots b_{k-1}d_{k-1}b_ks_1\) has length \(l_w(p)+2\).

Now assume that \(b_1 \cdots b_k\) is a sequence of nucleotides with \(k \ge l_w(p)+3\) but \(b_1 \cdots b_k\) has 2 different possible X-frames: \(t_b u_1 \cdots u_l t_e\) and \(t_b' u_1'\cdots u_m't_e'\) with \(u_i, u_i' \in X\) and \(t_b,t_e, t_b',t_e' \in \left( \{ \epsilon \} \cup \mathcal {B}\cup \mathcal {B}^2 \right) \). As in Claim (1), we conclude that w.l.o.g. \(\mid t_bt_e \mid \le 3\) and hence \(\mid u_1 \cdots u_l \mid \ge k-3 \ge l_w(p)+3-3=l_w(p)\). Similar arguments as above show that the path associated with the 2 possible X-frames has word-length greater than \(l_w(p)\), in contradiction to the maximality of p and \(l_w(p)\). \(\square \)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Fimmel, E., Michel, C.J., Starman, M. et al. Self-complementary circular codes in coding theory. Theory Biosci. 137, 51–65 (2018). https://doi.org/10.1007/s12064-018-0259-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12064-018-0259-4

Keywords

  • Self-complementary circular codes
  • Graph properties
  • Translation process
  • Reading frame
  • Genetic code