Keywords

1 Introduction

In modern biologics, DNA sequences are being sequenced from wider species from human to simple cells in DNA data banks as big data streams. It is difficult to process various DNA streams for classification and identification on various species from whole sequences. The main task of present genomic research [1, 2] is to obtain more biological information by processing and analyzing of the DNA sequence from multi-angles and multilevels [4,5,6,7]. In recent years, the processing and utilization of biological gene data are being carried out in a variety of ways, such as gene feature extraction, gene sequence location [7,8,9], and so on.

Variant map is an emerging technology to handle four symbols as meta-structure to process random sequences from cryptographic sequences, DNA sequences [3, 10] to ECG signals. Multiple statistical probability distributions are generated from selected sequences to form 2D–3D visual maps in representation. This scheme makes whole data sequences more compact and effectively visualized, and mapping results may be useful to explore nonlinear complex behaviors of whole genomics. A whole DNA sequence of a night monkey has mapped [11] on variant maps.

In this chapter, a special scheme is proposed to show a series of mapping results from a selected gene sequence of a capuchin monkey.

2 Process Model

  1. A.

    Architecture

The architecture of the process model is shown in Fig. 1a. The process model consists of five parts: input, processing, measurement, projection, and output. There are three modules: Processing, Measurement, and Projection.

Fig. 1
figure 1

Architecture of mapping scheme (a)–(c). a Architecture; b Measurement module; c Projection module

Input: A DNA sequence

Output: A 2D map

Modules: Processing, Measurement, and Projection

Process: From a selected DNA sequence, multiple segments are divided by a fixed length m on the whole sequence sequentially in the Processing module. Each segment needs to count four symbols: {A, C, G, T} in the segment to transfer all segments into a measuring sequence of four measures in Measurement module. A special combination on X: {AT} and Y: {AG} is selected to determine four measures in a projection position and the whole measuring sequence projected to be a 2D map in Projection module.

  1. B.

    Processing Module

From an input DNA sequence, multiple segments can be separated by a fixed length m to generate a sequence of segments.

Input: a DNA sequence

Output: a sequence of segments

  1. C.

    Measurement Module

In this module, shown in Fig. 1b, each segment counts four numbers of {A, G, C, T} in each proportions, respectively. As the result, each count is an integer number between 0 and m to transfer a segment sequence into a measuring sequence of four measures.

Input: a sequence of segments

Output: a sequence of four measures

  1. D.

    Projection Module

The projection module is shown in Fig. 1c as two units: Position and Projecting. For each four measures, two axis positions are determined by X(AT) and Y(AG), respectively. When all measures are processed, a 2D histogram is established as a statistical distribution as a 2D map.

Input: a sequence of four measures

Output: a 2D map

3 Details

  1. A.

    Relevant Parameters

m: segment length

V: Two bases of combination: {AT, AG}

$$ {\text{num}}\left( {AT} \right) = {\text{num}}\left( A \right) + {\text{num}}\left( T \right); $$
$$ {\text{num}}\left( {AG} \right) = {\text{num}}\left( A \right) + {\text{num}}\left( G \right); $$
$$ P_{v} = {\text{num}}\left( V \right) $$

\( P_{v} \): The proportion of a base or combinatorial base

\( (X_{{P_{AT} }} ,Y_{{P_{AG} }} ) \): a pair of XY mapping positions.

  1. B.

    Parameter in Module

Since the output quality of generating maps is dependent on the number of projection points, it is necessary for a refined map to include a larger number of coordinate points. The mapping projection forms the superposition to add up a larger number of coordinate points in 2D histogram representing a color map.

  1. C.

    Measurement module.

  • m: subsection length of a DNA sequence

  • num(AT) = num(A)+num(T)

  • V: AT or AG, {AT, AG} ∈D.

  • \( P_{v} \): The proportion of AT or AG on the length of the sequence M.

  • \( P_{v} = {\text{num}}\left( V \right)/m \)

  • \( P \): The proportion of AT

  • \( P_{AG} \): The proportion of AG

  • \( \left( {X_{{P_{AT} }}^{i} ,Y_{{P_{AG} }}^{j} } \right) \): a pair of XY mapping coordinates. i, j are different subsections.

  1. D.

    Parameter in Module

Calculating the proportion of AT and AG in the subsection according to the basic rules of mathematics. Two proportions can form a coordinate \( \left( {X_{{P_{AT} }}^{i} ,Y_{{P_{AG} }}^{j} } \right) \), which map a point on the two-dimensional graph.

The mapping relation between x and y:

$$ X:P_{AT} $$
$$ Y:P_{AG} $$

It is necessary for a distinct graph that includes a large number of coordinate points. Only a large number of DNA sequences can get a large number of coordinates points and pretty projection results. The graphics projection module completes the superposition of a large number of coordinate points.

4 Results Display

4.1 Maps on Various Segmented Length

Different parameters are shown in Fig. 2a–l for m = {20, 30, 40, 50, 60, 70, 80, 90, 100, 120, 150, 200}, Fig. 3a–f for m = {54, 56, 58, 60, 62, 64}, Fig. 4a–d for m = {59, 60, 61, 62} and Fig. 5 for m = 60, respectively.

Fig. 2
figure 2

Variant maps of Cebus capucinus on various segmented lengths (a)–(l) m = {20, 30, 40, 50, 60, 70, 80, 90, 100, 120, 150, 200}. a m = 20; b m = 30; c m = 40; d m = 50; e m = 60; f m = 70; g m = 80; h m = 90; i m = 100; j m = 120; k m = 250; l m = 200

Fig. 3
figure 3

Variant maps of Cebus capucinus on various segmented lengths (a)–(f) m = {54, 56, 58, 60, 62, 64}; a m = 54; b m = 56; c m = 58; d m = 60; e m = 62; f m = 64

Fig. 4
figure 4

Variant maps of Cebus capucinus on various segmented lengths (a)–(d) m = {59, 60, 61, 62}. a m = 59; b m = 60; c m = 61; d m = 62

Fig. 5
figure 5

Variant maps of Cebus capucinus on segmented lengths m = 60

In the map, similar color of pixels indicates the similar number of segments in the cluster.

4.2 Brief Analysis

From Fig. 2, it is interesting to notice that when m <50, maps have more symmetric properties than larger numbers. Changing segmented lengths, significant patterns appear in m = 54–64 region shown in Fig. 3 and refined lengths are shown in Fig. 4.

From a visual observation, when m = 60, the map has shown the better effects.

5 Conclusion

Using the proposed mapping scheme, it is feasible to transfer a whole DNA sequence as a color map with significant visual features. In addition to mapping method and selected functions, a set of sample sequences in various segmented lengths illustrate colorful distributions as variant maps.

Checking symmetric information among different maps, it is possible to identify specific spatial features under different configurations.

Since this is an initial step to make a whole DNA sequence in mapping operation, further researches and explorations are required.