Abstract
A new 3D graphical representation of DNA sequences is introduced. This representation is called 3D-dynamic representation. It is a generalization of the 2D-dynamic dynamic representation. The sequences are represented by sets of “material points” in the 3D space. The resulting 3D-dynamic graphs are treated as rigid bodies. The descriptors characterizing the graphs are analogous to the ones used in the classical dynamics. The classification diagrams derived from this representation are presented and discussed. Due to the third dimension, “the history of the graph” can be recognized graphically because the 3D-dynamic graph does not overlap with itself. Specific parts of the graphs correspond to specific parts of the sequence. This feature is essential for graphical comparisons of the sequences. Numerically, both 2D and 3D approaches are of high quality. In particular, a difference in a single base between two sequences can be identified and correctly described (one can identify which base) by both 2D and 3D methods.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
In modern biomedical sciences methods derived from physics, mathematics, and numerical analysis are frequently applied. Therefore this branch of science is, in fact, interdisciplinary. In particular, the analysis of biological sequences (DNA, RNA, protein) combines interdisciplinary methodology. Powerful methods are graphical representations which allow for both graphical and numerical characterization of the sequences. The sequences are usually very long, and it is not obvious how to represent these objects. The questions how to avoid the degeneracy and how to express the features of the objects both graphically and numerically, result in numerous methods.
In the present work, we introduce a new 3D graphical representation method. The proposed method is a 3D generalization of the 2D-dynamic representation of DNA sequences [1]. The 2D-dynamic graphs represent the DNA sequences. They are composed of the “material points” distributed in a 2D-space. Their distribution is determined by the sequence. We proposed the moments of inertia and the coordinates of the centers of mass of the 2D-dynamic graphs for the numerical characterization of the DNA sequences [1]. We also considered the high-order moments of the mass-density distributions based on 2D-dynamic graphs as the descriptors [2]. The mass overlaps and the angles between X axis and the principal axis of inertia are also used for the description of similarity/dissimilarity of the DNA sequences [3].
Both our methods (2D and 3D-dynamic representations) are based on a walk in a space which is one of the common approaches in this field. The 2D graphical representation methods took their origin in visualizations of these walks [4–6]. The approaches based on a walk in a 3D space may be found in [7–11]. The differences between them are due to assigning different basis vectors to particular bases and due to different numerical characterizations of the graphs. Examples of various 3D graphical representation methods may be found in [12–23].
In the present work we model a DNA sequence as a set of “material points” in the 3D space. As a consequence, the sequence is characterized by the dynamical quantities, e.g., moments of inertia, analogously as in 2D-dynamic representations. Therefore we retained the name ‘3D-dynamic representation of DNA sequences’. Using the new model we construct the classification diagrams.
Method
The proposed method is based on the convention of a walk in a 3D space. A base in a sequence is represented by a material point in the 3D space. To each point an abstract mass is assigned. We start the walk in the point with coordinates (0,0). In each step this point is shifted by a unit vector. We represent the bases by the following unit vectors: A = (−1,0,1), G = (1,0,1), C = (0,1,1), and T = (0,−1,1). At the end of the vector we locate a mass m = 1. As a consequence, the 3D-dynamic graph is obtained. It consists of the material points in the 3D space with the unit masses. The distribution of the points in the space is determined by the sequence.
The coordinates of the center of mass of the 3D-dynamic graph, in the {X,Y,Z} coordinate system are defined as
where x i , y i , z i are the coordinates of the mass m i . Since m i = 1 for all the points, the total mass of the sequence is N = ∑ i m i , where N is the length of the sequence. Then, the coordinates of the center of mass of the 3D-dynamic graph may be expressed as
The tensor of the moment of inertia is given by the matrix
with
where x μ i , y μ i , z μ i are the coordinates of m i in the Cartesian coordinate system for which the origin has been selected at the center of mass.
The eigenvalue problem of the tensor of inertia is defined as
where I k are the eigenvalues and ω k –the eigenvectors. The eigenvalues are obtained by solving the third-order secular equation
The eigenvectors ω 1, ω 2, ω 3 are orthonormal. Thus, they form a basis for a new coordinate system. The corresponding axes of this new system are denoted Ω 1, Ω 2, Ω 3 and referred to as the principal axes. The eigenvalues I 1, I 2, I 3, are called the principal moments of inertia and are equal to the moments of inertia associated with the rotations around the principal axes.
The relative orientation of the new and old coordinate system may be described by the cosines of properly defined angles. Let M 1, M 2, and M 3 denote, respectively, the planes (X,Y), (X,Z), and (Y,Z). Similarly, N 1, N 2, N 3 stand for the planes (Ω 1,Ω 2), (Ω 1,Ω 3), (Ω 2,Ω 3), respectively. For the characterization of the 3D-dynamic graphs we use the cosines of the angles between the planes of the two systems of coordinates:
It is also convenient to use square roots of the normalized principal moments of inertia:
As the descriptors of the 3D-dynamic graphs we take:
-
The coordinates of the centers of mass of the graphs,
-
The principal moments of inertia of the graphs,
-
The values of C ij .
Results and discussion
The new approach has been applied to histone H4 coding sequences of different species listed in Table 1 and for alpha globin coding sequences of different species listed in Table 4. The lengths of all histone H4 coding sequences are N = 312 and of all alpha globing coding sequences are N = 429.
Some examples of 3D-dynamic graphs are shown in Fig. 1.
Figure 2 shows 2D-dynamic graph for the same sequence (No. 3 in Table 1) as in Fig. 1. 2D-dynamic graphs remove the degeneracy present in the Nandy plots [5]. This degeneracy comes from the so called repetitive walks (walks performed back and forth along the same trace). By the introduction in the 2D-dynamic graphs points with different masses the repetitive walks can be recognized both graphically and numerically (the descriptors depend on masses different than 1). However, the 2D-dynamic graphs still do not retain the history of the sequence. Introducing the third dimension one can avoid self-overlapping of the graph.
Numerically, each graph is characterized by descriptors. The values of the descriptors considered in this work are shown in Tables 1, 2, 3, 4, 5, and 6. Due to the choice of the unit vectors representing the four bases, μ x and μ y give information about the relative number of particular bases in the sequences, and μ z contains information about the lengths of the sequences only. μ x and μ y shown in Tables 1 and 4 are identical to μ x and μ y for the 2D-dynamic graphs for the same sequences [1]. New information is contained in other descriptors (Tables 2, 3, 5, and 6). The descriptors are very sensitive: they correctly identify a single-base difference between two sequences. The sequence no. 6 in Table 4 (EF605407) differs by two bases from the sequence (MMAGL1) used in the calculations in [1]. The base T in MMAGL1 is replaced by C in EF605407 on the 132 position in the sequence, and the base A in MMAGL1 is replaced by G in EF605407 on the 366 position in the sequence. As a consequence of the change T to C μ y increased, and as a consequence of the change A to G μ x increased: μ x = 15.49, μ y = 14.80 for MMAGL1, and μ x = 15.79, μ y = 16.19 for EF605407.
The descriptors have been used for the construction of the classification diagrams shown in Figs. 3, 4, 5, 6, 7, and 8. Figure 3 shows the classification diagram \( {\scriptscriptstyle \frac{\mu_x}{r_1}} \)–\( {\scriptscriptstyle \frac{\mu_y}{r_2}} \)–\( {\scriptscriptstyle \frac{\mu_z}{r_3}} \). The descriptors representing histone H4 coding sequences are represented in the figure by crosses and alpha globin coding sequences by triangles. The crosses and the triangles are located in a different part of the diagram. In the figure these parts are separated by a plane.
Using the present approach one can also create very detailed classification diagrams (in this case, for histone H4 coding sequences of evolutionary similar organisms). The similarity matrix using the standard Clustal W approach for histone H4 coding sequences we gave in [3] (the similarity values are either larger or equal 78%). The considered sequences are rather similar to each other and it is difficult to find a property which allows to distinguish between different species. In particular a good test of the new methods is finding descriptors for which we observe clusterization of the descriptors representing sequences of evolutionarily similar organisms: plants and vertebrates for histone H4 coding sequences. Most of the descriptors give larger similarity values between the sequences of chicken (No. 1, 2 in Table 1) with the sequences of plants rather than with the ones of vertebrates. Using 2D-dynamic representation we found some properties that in effect give the classification of the sequences representing plants and vertebrates [24]. In the present work, we find more descriptors that give a similar classification.
The histone H4 coding sequences of plants are represented by the full squares, and of vertebrates by the empty circles in Figs. 4, 5, 6, 7, and 8. A clusterization of the sequences representing evolutionarily similar organisms is obtained for C ij , i, j = 1, 2, 3 parameters (Figs. 4, 5, and 6) and for the descriptors composed of moments of inertia, coordinates of centers of mass of the graphs, and the coefficients r i , i = 1, 2, 3 (Figs. 7 and 8). Figure 4 corresponds to i = 1, j = 1, 2, 3, Fig. 5 to i = 2, j = 1, 2, 3, and Fig. 6 to i = 3, j = 1, 2, 3.
The descriptors representing the sequences of plants and of vertebrates are located in different parts of the diagrams. In order to visualize the classifications, the clusters of descriptors corresponding to different species have been separated by planes.
Summarizing, both approaches (2D and 3D-dynamic representations) are examples of graphical representation methods. Very popular methods based on the alignment of the sequences give rather limited information about similarity/dissimilarity of the sequences. Their degeneracy is relatively high. The same similarity values are obtained if T, C, G, or A bases align. Using graphical representation methods one has a chance to consider different aspects of similarity separately, both graphically and numerically. The computing time of these methods is low.
The 3D-dynamic graphs are generalizations of the 2D-dynamic graphs. The descriptors used for the characterization of the graphs are also related to the dynamics. The proposed descriptors of the 3D-dynamic graphs lead to new classifications diagrams for the considered data, analogously as for the 2D-dynamic graphs [24]. Therefore the descriptors proposed for both 2D and 3D-dynamic graphs are good, reliable and sensitive, tools for similarity/dissimilarity analysis of DNA sequences. The 3D-dynamic graphs retain the history of the sequences and this is one of their advantages. The consecutive bases in the sequences are represented by the appropriate parts of the 3D-dynamic graphs (the 3D graph never overlaps with itself). Therefore the future applications of the 3D method both as a graphical and as a numerical tool seem to be promising.
References
Bielińska-Wąż D, Clark T, Wąż P, Nowak W, Nandy A (2007) 2D-dynamic representation of DNA sequences. Chem Phys Lett 442:140–144
Bielińska-Wąż D, Nowak W, Wąż P, Nandy A, Clark T (2007) Distribution moments of 2D-graphs as descriptors of DNA sequences. Chem Phys Lett 443:408–413
Bielińska-Wąż D, Wąż P, Clark T (2007) Similarity studies of DNA sequences using genetic methods. Chem Phys Lett 445:68–73
Gates MA (1985) Simpler DNA sequence representations. Nature 316:219
Nandy A (1994) A new graphical representation and analysis of DNA sequence structure. I: Methodology and application to globin genes. Curr Sci 66:309–314
Leong PM, Morgenthaler S (1995) Random walk and gap plots of DNA sequences. Comput Appl Biosci 11:503–507
Hamori E, Ruskin J (1983) H Curves, a novel method of representation of nucleotide series especially suited for long DNA sequences. J Biol Chem 258:1318–1327
Randić M, Vračko M, Nandy A, Basak SC (2000) On 3-D graphical representation of DNA primary sequences and their numerical characterization. J Chem Inf Comp Sci 40:1235–1244
Li C, Wang J (2004) On a 3-D Representation of DNA Primary Sequences. Comb Chem High Throughput Screen 7:23–27
Yao Y, Nan X, Wang T (2005) Analysis of similarity/dissimilarity of DNA sequences based on a 3-D graphical representation. Chem Phys Lett 411:248–255
Yang Y, Zhang Y, Jia M, Li C, Meng L (2013) High Throughput Screen. Comb Chem 16:585–589
Yuan C, Liao B, Wang T (2003) New 3D graphical representation of DNA sequences and their numerical characterization. Chem Phys Lett 379:412–417
Zhang C-T, Zhang R, Ou H-Y (2003) The Z curve database: a graphic representation of genome sequences. Bioinformatics 19:593–599
Liao B, Wang T (2004) 3-D graphical representation of DNA sequences and their numerical characterization. J Mol Struct Theochem 681:209–212
Liao B, Wang T (2004) Analysis of similarity/dissimilarity of DNA sequences based on 3-D graphical representation. Chem Phys Lett 388:195–200
Liao B, Zhang Y, Ding K, Wang TJ (2005) Analysis of similarity/dissimilarity of DNA sequences based on a condensed curve representation. J Mol Struct Theochem 717:199–203
Cao Z, Liao B, Li R (2008) A group of 3D graphical representation of DNA sequences based on dual nucleotides. Int J Quantum Chem 108:1485–1490
Pesek I, Žerovnik J (2008) A numerical characterization of modified Hamori curve representation of DNA sequences. MATCH Commun Math Comput Chem 60:301–312
Chen W, Liao B, Xiang X, Zhu W (2009) An Improved Binary Representation of DNA Sequences and Its Applications. MATCH Commun Math Comput Chem 61:767–780
Cao Z, Li R, Chen W (2010) A 3D graphical representation of DNA sequence based on numerical coding method. Int J Quantum Chem 110:975–985
Yu J-F, Wang J-H, Sun X (2010) Analysis of similarities/dissimilarities of DNA sequences based on a novel graphical representation. MATCH Commun Math Comput Chem 63:493–512
Li Y, Qin Y, Zheng X, Zhang Y (2012) Three-unit semicircles curve: A compact 3D graphical representation of DNA sequences based on classifications of nucleotides. Int J Quantum Chem 112:2330–2335
Jafarzadeh N, Iranmanesh A (2013) C-curve: a novel 3D graphical representation of DNA sequence based on codons. Math Biosci 241:217–24
Wąż P, Bielińska-Wąż D, Nandy A (2014) Descriptors of 2D-dynamic graphs as a classification tool of DNA sequences. J Math Chem 52:132–140
Author information
Authors and Affiliations
Corresponding author
Additional information
This paper belongs to a Topical Collection on the occasion of Prof. Tim Clark’s 65th birthday
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.
About this article
Cite this article
Wąż, P., Bielińska-Wąż, D. 3D-dynamic representation of DNA sequences. J Mol Model 20, 2141 (2014). https://doi.org/10.1007/s00894-014-2141-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00894-014-2141-8