Descriptors of 2D-dynamic graphs as a classification tool of DNA sequences

A new tool of the classification of DNA sequences is introduced. The method is based on 2D-dynamic graphs and their descriptors. Using the descriptors created by centers of masses, moments of inertia, angles between the x axis and the principal axis of inertia of the 2D-dynamic graphs one can obtain classification diagrams in which similar sequences are clustered in separated areas.

graphical representations have been created. In particular, easy for visualization 2Dmethods are of interest, as for example . A review of graphical representation methods may be found in [1,24,25].
The 2D graphical representations of DNA sequences have been used for many applications including phylogenetic relationships of coronavirus gene sequences [26], long range palindromes [27], characterization of avian flu neuraminidase genes [28], study of plant germplasm identificators [29], among others. The first order descriptors had also been used to propose a scale for grading toxicity of chemicals [30] and classification of SNP genes [31]. In this paper we propose to use the improved descriptors in our 2D-dynamic representation model to construct classification diagrams.
The studies on the classification of DNA sequences are the continuation of our earlier works [1,9,32] where we have constructed classification diagrams based on some other methods. In these diagrams different groups of the sequences are located in different parts of the plots. We have also shown that using these diagrams, sequences which differ by only one base can be distinguished [1].

Methods
In the present work we use a graphical representation of DNA sequences called by us 2D-dynamic representation. In this method a DNA sequence is represented by a 2D-graph described in [8]. The name of the representation comes from another area of science due to form of the descriptors (numerical characteristics) representing these graphs. The sequence is represented by material-like points which are treated as rigid bodies as in the Newtonian dynamics. We have proposed several descriptors related to the 2D-dynamic graphs: centers of mass [8], moments of inertia of the graphs [8], moments of the mass-density distribution [33,34], angles between the x axis and the principal axis of inertia of the graphs [35]. These descriptors were the basis for the creation of similarity measures between the sequences. We also performed a similarity/dissimilarity analysis using mass-overlaps of the 2D-dynamic graphs [35].
In the present work we construct the classification diagrams using the descriptors built of the coordinates of the centers of the mass (μ x , μ y ), the principal moments of inertia (I 11 , I 22 ), and of the angles between the x axis and the principal axis of inertia of the 2D-dynamic graphs (α).
We define the coordinates of the center of the mass in the same way as it is in the dynamics: where x i , y i are the coordinates of the mass m i in the Cartesian coordinate system for which the point (0,0) is the origin, the same for all the sequences. The total mass of the graph is equal to the sum of the masses: i.e. it is equal to the length of the sequence. Also the moment of inertia tensor is defined as in the dynamics (see also [8]): where and where γ = x, y and k = 1, 2.
The descriptors are related to some particular properties of the graphs and their interpretation is very intuitive and analogous as it is in dynamics. The moments of inertia are associated with the rotations about the principal axes. If the mass is concentrated close to the axis of rotation, the moment of inertia is small and it is easier to accelerate the spinning of the body. If the mass is dispersed, the moment of inertia is large and the acceleration of spinning is more difficult. Thus, these descriptors carry the information about the concentrations of masses around the axes. The location of the center of the mass of the 2D-dynamic graph depends on the number of particular bases in the sequence. Each base is represented by a 2D unit vector in the (x, y) plane: A = (−1,0), G = (1,0), C = (0,1), T = (0,−1). Since the graph is obtained using a method of walk in 2D space [8] the location of the graph depends on the relative number of particular bases. Thus, if for example, the number of A bases is larger than the number of G bases then the graph is shifted towards the negative x values by the appropriate amount.
In the present work we consider diagrams D γ k − D β l , k = l. We show that using these diagrams one can classify different groups of DNA sequences. We also use a diagram α − I 22 for some detailed classification (see subsequent section).

Results and discussion
The descriptors for histone H4 coding sequences and alpha globin coding sequences of different species have been calculated using the values of centers of masses, moments of inertia, and angles α between x axis and the principal axes of the 2D-dynamic graphs obtained in our earlier works [8,35]. The descriptors have been used to the construction of the classification diagrams. Some examples of the 2D-dynamic graphs for the sequences under consideration are shown in Figs. 1, 2.  Figures 3, 4 show the classification diagrams based on the descriptors defined in Eq. 9 of the 2D-dynamic graphs. We observe that the descriptors corresponding to histone H4 coding sequences are located in different parts of the diagrams than the descriptors corresponding to alpha globin coding sequences. Each point corresponds to a different species (for details about histone H4 coding sequences see [1] and about alpha globin coding sequences see [8]). . Squares correspond to histone H4 coding sequences and circles correspond to alpha globin coding sequences sequences of plants and of vertebrates we have already performed using moments of the mass-density distributions of the 2D-dynamic graphs [1] and using the descriptors of Four-Component Spectral Representation of DNA sequences [1,32].
Summarizing, a variety of graphical representations of DNA sequences gives an opportunity of considering different properties of the sequences. Different aspects of similarity can be compared. It is interesting, that the ideas brought from different The symbols are the same as in Fig. 3 areas of science can be mixed and, in effect, we can reveal various aspects of similarity of the DNA sequences. In particular, Four-Component Spectral Representation [9] only visually resembles molecular spectrum. Also 2D-dynamic graphs are not real dynamical objects. However, using methods and terminology from other fields one can obtain a convenient and intuitive classification tool of the DNA sequences.