Introduction

In modern biomedical sciences methods derived from physics, mathematics, and numerical analysis are frequently applied. Therefore this branch of science is, in fact, interdisciplinary. In particular, the analysis of biological sequences (DNA, RNA, protein) combines interdisciplinary methodology. Powerful methods are graphical representations which allow for both graphical and numerical characterization of the sequences. The sequences are usually very long, and it is not obvious how to represent these objects. The questions how to avoid the degeneracy and how to express the features of the objects both graphically and numerically, result in numerous methods.

In the present work, we introduce a new 3D graphical representation method. The proposed method is a 3D generalization of the 2D-dynamic representation of DNA sequences [1]. The 2D-dynamic graphs represent the DNA sequences. They are composed of the “material points” distributed in a 2D-space. Their distribution is determined by the sequence. We proposed the moments of inertia and the coordinates of the centers of mass of the 2D-dynamic graphs for the numerical characterization of the DNA sequences [1]. We also considered the high-order moments of the mass-density distributions based on 2D-dynamic graphs as the descriptors [2]. The mass overlaps and the angles between X axis and the principal axis of inertia are also used for the description of similarity/dissimilarity of the DNA sequences [3].

Both our methods (2D and 3D-dynamic representations) are based on a walk in a space which is one of the common approaches in this field. The 2D graphical representation methods took their origin in visualizations of these walks [46]. The approaches based on a walk in a 3D space may be found in [711]. The differences between them are due to assigning different basis vectors to particular bases and due to different numerical characterizations of the graphs. Examples of various 3D graphical representation methods may be found in [1223].

In the present work we model a DNA sequence as a set of “material points” in the 3D space. As a consequence, the sequence is characterized by the dynamical quantities, e.g., moments of inertia, analogously as in 2D-dynamic representations. Therefore we retained the name ‘3D-dynamic representation of DNA sequences’. Using the new model we construct the classification diagrams.

Method

The proposed method is based on the convention of a walk in a 3D space. A base in a sequence is represented by a material point in the 3D space. To each point an abstract mass is assigned. We start the walk in the point with coordinates (0,0). In each step this point is shifted by a unit vector. We represent the bases by the following unit vectors: A = (−1,0,1), G = (1,0,1), C = (0,1,1), and T = (0,−1,1). At the end of the vector we locate a mass m = 1. As a consequence, the 3D-dynamic graph is obtained. It consists of the material points in the 3D space with the unit masses. The distribution of the points in the space is determined by the sequence.

The coordinates of the center of mass of the 3D-dynamic graph, in the {X,Y,Z} coordinate system are defined as

$$ {\mu}_x=\frac{{\displaystyle {\sum}_i}\;{m}_i{x}_i}{{\displaystyle {\sum}_i}\;{m}_i},\kern2em {\mu}_y=\frac{{\displaystyle {\sum}_i}\;{m}_i{y}_i}{{\displaystyle {\sum}_i}\;{m}_i},\kern2em {\mu}_z=\frac{{\displaystyle {\sum}_i}\;{m}_i{z}_i}{{\displaystyle {\sum}_i}\;{m}_i}, $$
(1)

where x i , y i , z i are the coordinates of the mass m i . Since m i  = 1 for all the points, the total mass of the sequence is N = ∑  i  m i , where N is the length of the sequence. Then, the coordinates of the center of mass of the 3D-dynamic graph may be expressed as

$$ {\mu}_x=\frac{1}{N}{\displaystyle \sum_i}\;{x}_i,\kern2em {\mu}_y=\frac{1}{N}{\displaystyle \sum_i}\;{y}_i,\kern2em {\mu}_z=\frac{1}{N}{\displaystyle \sum_i}\;{z}_i. $$
(2)

The tensor of the moment of inertia is given by the matrix

$$ \widehat{I}=\left(\begin{array}{ccc}\hfill {I}_{xx}\hfill & \hfill {I}_{xy}\hfill & \hfill {I}_{xz}\hfill \\ {}\hfill {I}_{yx}\hfill & \hfill {I}_{yy}\hfill & \hfill {I}_{yz}\hfill \\ {}\hfill {I}_{zx}\hfill & \hfill {I}_{zy}\hfill & \hfill {I}_{zz}\hfill \end{array}\right) $$
(3)

with

$$ \begin{array}{c}\hfill {I}_{xx}={\displaystyle \sum_i}\;{m}_i\left[{\left({y}_i^{\mu}\right)}^2+{\left({z}_i^{\mu}\right)}^2\right],\hfill \\ {}\hfill {I}_{yy}={\displaystyle \sum_i}\;{m}_i\left[{\left({x}_i^{\mu}\right)}^2+{\left({z}_i^{\mu}\right)}^2\right],\hfill \\ {}\hfill {I}_{zz}={\displaystyle \sum_i}\;{m}_i\left[{\left({x}_i^{\mu}\right)}^2+{\left({y}_i^{\mu}\right)}^2\right],\hfill \\ {}\hfill {I}_{xy}={I}_{yx}=-{\displaystyle \sum_i}\;{m}_i{x}_i^{\mu }{y}_i^{\mu },\hfill \\ {}\hfill {I}_{xz}={I}_{zx}=-{\displaystyle \sum_i}\;{m}_i{x}_i^{\mu }{z}_i^{\mu },\hfill \\ {}\hfill {I}_{yz}={I}_{zy}=-{\displaystyle \sum_i}\;{m}_i{y}_i^{\mu }{z}_i^{\mu },\hfill \end{array} $$
(4)

where x μ i , y μ i , z μ i are the coordinates of m i in the Cartesian coordinate system for which the origin has been selected at the center of mass.

The eigenvalue problem of the tensor of inertia is defined as

$$ \widehat{I}{\omega}_k={I}_k{\omega}_k,\kern1.5em k=1,2,3, $$
(5)

where I k are the eigenvalues and ω k –the eigenvectors. The eigenvalues are obtained by solving the third-order secular equation

$$ \left|\begin{array}{ccccc}\hfill {I}_{xx}-I\hfill & \hfill \hfill & \hfill {I}_{xy}\hfill & \hfill \hfill & \hfill {I}_{xz}\hfill \\ {}\hfill {I}_{yx}\hfill & \hfill \hfill & \hfill {I}_{yy}-I\hfill & \hfill \hfill & \hfill {I}_{yz}\hfill \\ {}\hfill {I}_{zx}\hfill & \hfill \hfill & \hfill {I}_{zy}\hfill & \hfill \hfill & \hfill {I}_{zz}-I\hfill \end{array}\right|=0. $$
(6)

The eigenvectors ω 1, ω 2, ω 3 are orthonormal. Thus, they form a basis for a new coordinate system. The corresponding axes of this new system are denoted Ω 1, Ω 2, Ω 3 and referred to as the principal axes. The eigenvalues I 1, I 2, I 3, are called the principal moments of inertia and are equal to the moments of inertia associated with the rotations around the principal axes.

The relative orientation of the new and old coordinate system may be described by the cosines of properly defined angles. Let M 1, M 2, and M 3 denote, respectively, the planes (X,Y), (X,Z), and (Y,Z). Similarly, N 1, N 2, N 3 stand for the planes (Ω 1,Ω 2), (Ω 1,Ω 3), (Ω 2,Ω 3), respectively. For the characterization of the 3D-dynamic graphs we use the cosines of the angles between the planes of the two systems of coordinates:

$$ {C}_{ij}\equiv cos\left({M}_i,{N}_j\right),\kern1.5em i,j=1,2,3. $$
(7)

It is also convenient to use square roots of the normalized principal moments of inertia:

$$ {r}_1=\sqrt{\frac{I_1}{N}},\kern2em {r}_2=\sqrt{\frac{I_2}{N}},\kern2em {r}_3=\sqrt{\frac{I_3}{N}}. $$
(8)

As the descriptors of the 3D-dynamic graphs we take:

  • The coordinates of the centers of mass of the graphs,

  • The principal moments of inertia of the graphs,

  • The values of C ij .

Results and discussion

The new approach has been applied to histone H4 coding sequences of different species listed in Table 1 and for alpha globin coding sequences of different species listed in Table 4. The lengths of all histone H4 coding sequences are N = 312 and of all alpha globing coding sequences are N = 429.

Table 1 Coordinates of the centers of mass of the graphs representing histone H4 coding sequences

Some examples of 3D-dynamic graphs are shown in Fig. 1.

Fig. 1
figure 1

Examples of 3D-dynamic graphs: No. 3 (M60749, former gene ID HSHISAD) and 6: (M12277, former gene ID TAH4091)–see Table 1

Figure 2 shows 2D-dynamic graph for the same sequence (No. 3 in Table 1) as in Fig. 1. 2D-dynamic graphs remove the degeneracy present in the Nandy plots [5]. This degeneracy comes from the so called repetitive walks (walks performed back and forth along the same trace). By the introduction in the 2D-dynamic graphs points with different masses the repetitive walks can be recognized both graphically and numerically (the descriptors depend on masses different than 1). However, the 2D-dynamic graphs still do not retain the history of the sequence. Introducing the third dimension one can avoid self-overlapping of the graph.

Fig. 2
figure 2

2D-dynamic graph: No. 3 (M60749)

Numerically, each graph is characterized by descriptors. The values of the descriptors considered in this work are shown in Tables 1, 2, 3, 4, 5, and 6. Due to the choice of the unit vectors representing the four bases, μ x and μ y give information about the relative number of particular bases in the sequences, and μ z contains information about the lengths of the sequences only. μ x and μ y shown in Tables 1 and 4 are identical to μ x and μ y for the 2D-dynamic graphs for the same sequences [1]. New information is contained in other descriptors (Tables 2, 3, 5, and 6). The descriptors are very sensitive: they correctly identify a single-base difference between two sequences. The sequence no. 6 in Table 4 (EF605407) differs by two bases from the sequence (MMAGL1) used in the calculations in [1]. The base T in MMAGL1 is replaced by C in EF605407 on the 132 position in the sequence, and the base A in MMAGL1 is replaced by G in EF605407 on the 366 position in the sequence. As a consequence of the change T to C μ y increased, and as a consequence of the change A to G μ x increased: μ x  = 15.49, μ y  = 14.80 for MMAGL1, and μ x  = 15.79, μ y  = 16.19 for EF605407.

Table 2 Principal moments of inertia of the graphs and cosines of the angles relative to M 1 representing histone H4 coding sequences
Table 3 Cosines of the angles relative to M 2 and M 3 representing histone H4 coding sequences
Table 4 Coordinates of the centers of mass of the graphs representing alpha globing coding sequences
Table 5 Principal moments of inertia of the graphs and cosines of the angles relative to M 1 representing alpha globing coding sequences
Table 6 Cosines of the angles relative to M 2 and M 3 representing alpha globing coding sequences

The descriptors have been used for the construction of the classification diagrams shown in Figs. 3, 4, 5, 6, 7, and 8. Figure 3 shows the classification diagram \( {\scriptscriptstyle \frac{\mu_x}{r_1}} \)\( {\scriptscriptstyle \frac{\mu_y}{r_2}} \)\( {\scriptscriptstyle \frac{\mu_z}{r_3}} \). The descriptors representing histone H4 coding sequences are represented in the figure by crosses and alpha globin coding sequences by triangles. The crosses and the triangles are located in a different part of the diagram. In the figure these parts are separated by a plane.

Fig. 3
figure 3

Classification diagram \( {\scriptscriptstyle \frac{\mu_x}{r_1}} \)\( {\scriptscriptstyle \frac{\mu_y}{r_2}} \)\( {\scriptscriptstyle \frac{\mu_z}{r_3}} \)

Fig. 4
figure 4

Classification diagram C 11C 12C 13

Fig. 5
figure 5

Classification diagram C 21C 22C 23

Fig. 6
figure 6

Classification diagram C 31C 32C 33

Fig. 7
figure 7

Classification diagram \( {\scriptscriptstyle \frac{\mu_x}{I_3}} \)\( {\scriptscriptstyle \frac{\mu_y}{I_3}} \)\( {\scriptscriptstyle \frac{\mu_z}{I_3}} \)

Fig. 8
figure 8

Classification diagram \( {\scriptscriptstyle \frac{\mu_x}{r_1}} \)\( {\scriptscriptstyle \frac{\mu_y}{r_2}} \)\( {\scriptscriptstyle \frac{\mu_z}{r_3}} \)

Using the present approach one can also create very detailed classification diagrams (in this case, for histone H4 coding sequences of evolutionary similar organisms). The similarity matrix using the standard Clustal W approach for histone H4 coding sequences we gave in [3] (the similarity values are either larger or equal 78%). The considered sequences are rather similar to each other and it is difficult to find a property which allows to distinguish between different species. In particular a good test of the new methods is finding descriptors for which we observe clusterization of the descriptors representing sequences of evolutionarily similar organisms: plants and vertebrates for histone H4 coding sequences. Most of the descriptors give larger similarity values between the sequences of chicken (No. 1, 2 in Table 1) with the sequences of plants rather than with the ones of vertebrates. Using 2D-dynamic representation we found some properties that in effect give the classification of the sequences representing plants and vertebrates [24]. In the present work, we find more descriptors that give a similar classification.

The histone H4 coding sequences of plants are represented by the full squares, and of vertebrates by the empty circles in Figs. 4, 5, 6, 7, and 8. A clusterization of the sequences representing evolutionarily similar organisms is obtained for C ij ,  i, j = 1, 2, 3 parameters (Figs. 4, 5, and 6) and for the descriptors composed of moments of inertia, coordinates of centers of mass of the graphs, and the coefficients r i ,  i = 1, 2, 3 (Figs. 7 and 8). Figure 4 corresponds to i = 1, j = 1, 2, 3, Fig. 5 to i = 2, j = 1, 2, 3, and Fig. 6 to i = 3, j = 1, 2, 3.

The descriptors representing the sequences of plants and of vertebrates are located in different parts of the diagrams. In order to visualize the classifications, the clusters of descriptors corresponding to different species have been separated by planes.

Summarizing, both approaches (2D and 3D-dynamic representations) are examples of graphical representation methods. Very popular methods based on the alignment of the sequences give rather limited information about similarity/dissimilarity of the sequences. Their degeneracy is relatively high. The same similarity values are obtained if T, C, G, or A bases align. Using graphical representation methods one has a chance to consider different aspects of similarity separately, both graphically and numerically. The computing time of these methods is low.

The 3D-dynamic graphs are generalizations of the 2D-dynamic graphs. The descriptors used for the characterization of the graphs are also related to the dynamics. The proposed descriptors of the 3D-dynamic graphs lead to new classifications diagrams for the considered data, analogously as for the 2D-dynamic graphs [24]. Therefore the descriptors proposed for both 2D and 3D-dynamic graphs are good, reliable and sensitive, tools for similarity/dissimilarity analysis of DNA sequences. The 3D-dynamic graphs retain the history of the sequences and this is one of their advantages. The consecutive bases in the sequences are represented by the appropriate parts of the 3D-dynamic graphs (the 3D graph never overlaps with itself). Therefore the future applications of the 3D method both as a graphical and as a numerical tool seem to be promising.