Optimized Context Weighting for the Compression of the Un-repetitive Genome Sequence Fragment

The details of context weighting is discussed first and some fomulas for describing the progress of context weighting is also given. During the weighting, the weighting cost can influence the performance of context modeling entropy coding. However, the context weighting rely on corresponding counting vectors which are joined in weighting. Meanwhile, the equivalence between context weighting and the weighting of description length implies that the weights can be optimized. The corresponding methods are discussed in details in our paper. At last, the proposed approach is used to compress these simple genome sequences. A large number of experiment results indicate that the corresponding compression results is better than the results of other existing algorithms.


Introduction
Due to the wide application of the rapid DNA sequencing technology, the storing memory for saving these genome data is becoming larger and larger during the past decade [1].In order to store the genome data more efficiently, some compression algorithms are proposed to enhance the storage efficiency.By reducing redundancies contained among bases of a genome sequence, the compression efficiency can be improved.In the research of genome sequence compression, previous researchers focus mainly on the method by using existing compression algorithms to reduce the requirement for storing genome sequence data.However, it is not said that all previous compression algorithms are suitable for genome sequence compression.Although the information have not been understood sufficiently which is contained in different species' genome sequence.It implies that which base can be dropped or ignored in the compression process is unknown yet.In other words, only those lossless compression algorithms can be suggested for this application.
Broadly speaking, there are three types of genome compression algorithms that can be employed to achieve our compression objective.The first is the substitutional algorithm based on LZ77.The second is the algorithm referred to as referential algorithm.The last is the compression algorithm based on the entropy coding technology.In Grumbach and Tahi [2], BioCompress which is based on LZ77 was proposed to compress genome sequences and the corresponding modified algorithms [3,4] illustrate the basic pattern of those genome sequence compression algorithms.The subsequent algorithms paid more attentions to improve the method for coding those no-repetitive sequences and the sequences which are not in the dictionary.In Grumbach and Tahi [3], the non-repetitive sequences are compressed by the entropy coding with 2-order Markov model.In Deorowicz et al. [4], a further improvement was proposed that the context modeling based entropy coding is used to substitute the Markov model used in [3] for coding those no-repetitive sequences.In fact, these algorithms rely greatly on their dictionaries.The cost for coding contain the cost for coding the dictionary.Moreover, to improve compression performance, the dictionary can be initialized.But it can not achieve better efficiency.
In recent years, the ''referential genome sequence compression algorithms'' [5,6] have attracted the attention of more and more researchers for the reason that these algorithms can achieve high compression efficiency which both substitutional algorithms and entropy coding technologies cannot attain in the mammalian genome sequence compression.Meanwhile, these algorithms cooperate with the rapid DNA sequencing technology very well.When a new genome sequence is ready to be compressed, it is aligned with a reference sequence firstly.The location where the difference between the bases in the sequence to be coded and those in the reference sequence occurs is searched.Then the location information where the bases in these two sequences start to differ and the length which denotes the number of different bases are encoded.For decoding, both the sender and the receiver should have this reference sequence, with the encoded information for those differences, the receiver can reconstruct the objective sequence.In this case, the content need to be coded and transmitted will be much less.Consequently, high compression efficiency can be achieved.However, these algorithms rely heavily on the amount of the repetitive sequences contained in the genome sequences.It implies that various compression results can be obtained when genome sequences from different species are input to such an algorithm.For some species, including human, their genome sequence contain many repetitive sequences, the corresponding compression results are excellent.Moreover, if the accuracy of the repetitive sequence alignment is enhanced, the compression efficiency can further be increased.Based on this reason, many algorithms [7][8][9] pay more attention to improve the accuracy of alignment.Especially in [9], the compression ratio for human genome sequences can reach 80 times.However, these results rely on large number of repetitive sequences contained in the objective sequences.Actually, there are almost 90% repetitive sequences in a complete human genome sequence, which can ensure the algorithm in [9] maintaining an high compression ratio.In this case, the content need to be coded is small enough and the compression efficiency is enhanced.But for some simple species, such as microorganism, there are only about 20% or less repetitive sequences in a complete sequence, subsequently, the referential algorithm cannot play its role sufficiently.Thus, in those previous researches [10][11][12][13] on the microorganism genome sequence compression, the context modeling based entropy coding technology is suggested instead of the referential algorithm.
The work in [14] illustrate the method for compressing the genome sequence with Context modeling and give the approaches to select which bases to be conditions for constructing the conditional probability distributions.Specially for the work by [21], the ''expert models''(XM) are one of efficient context modeling methods.Every XM comes from different models, such as the copy model and the Markov models (with order 1 and 2).Then the context weighting is suggested to merge various context models to construct the probability distribution which is used to assign the code words, such as the words in [14,15].However, how to determine the weights is one problem to be tackled.
Actually, during context weighting, the conditional probability distributions are estimated by using their corresponding counting vectors.The description length is one important parameter to illustrate the statistic performance of this context model.It means that if the description length of the counting vector weighted is small, the weighting operation can be considered to be correct.In [16,17], the approaches to calculate the description are proposed.For inspiring, the weights can also be optimized under the criterion of minimizing the description length.
We discuss the context weighting first and give the description of the weighting operation.Then the weighting cost is discussed in details.The relation between context weighting and the costs are described by using some formulas and tables.At last, the proposed method is used to compress some genome sequences.

Context Weighting
Genome sequence x s ; . ..; x 0 holds the alphabet x i 2 Alphabet ¼ fA; T; G; Cg.It should be note that only the genome sequences are our objective sequences.Therefore, the uracil (N) is not considered in our compression algorithm.During the context modeling, Pðx t jx tÀ1 ; . ..; x tÀK Þ are constructed with the structure of x tÀ1 ; . ..; x tÀK for x t with order K. Let Pðx t js c i Þ denote probability distribution in the i th context model with context event s c i and weights w i .N is the total number of distributions joined in weighting.Then, the context weighting is represented as (1) Then the distribution Pðx t jSÞ can be re-write as (3).0 1 2 3 Pðx t jSÞ : And Pðx t jSÞ is calculated by ( 4) Pðx t jSÞ : Pðx t jSÞ : According to Chen and Chen [18], L 1 , the description length of CV 1 , can be calculated by (5): L 2 is also the description length of CV 2 and be calculated similarly.When Stirling's formula ( 6) is used to approximately calculate the factorials and when log is the natural logarithm, L 1 and L 2 can be calculated with ( 7) and ( 8) respectively: and where r ¼ À log 3! À 3 log ffiffiffiffiffi ffi 2p p is a constant.The description length L of the count vector CV can also be calculated similar to (5).To simplify our representation, let c i denote the ratio between the weighted total number of training bases in the count vector CV i (corresponding to Pðx t js c i Þ) and the total number of bases in the weighted count vector CV.Let t ðiÞ j denote the ratio between the weighted total number of training bases with value j in the count vector CV i (corresponding to Pðx t js c i Þ) and the total number of bases with value j in the weighted count vector CV.For instance, parameters c 1 ,t ð1Þ 2 and t ð1Þ 2 are calculated as follows: Apparently, when all weights w i are given, all parameters c i and t ðiÞ j are determined.The description length L can also be calculated with the equation in (9), in which the first line actually equals to Let Q denote the last two lines in (9), then and L can be represented as: Actually, if the number of counting vectors is more than two.The similar formula for the description length of the weighted counting vector can be obtained.It is given in ( 12) where N denotes the number of I À ary counting vectors.The parameters c i and t ðiÞ j are determined by ( 13) and ( 14) respectively. And After derivation, the formula ( 12) can be represented by ( 15) where the weighting cost Q is determined by ( 16) Here Q is referred to as the weighting cost and in most of the time, Q is small enough to be ignored in the weighting process.From (15), it is concluded that the context weighting is almost equivalent to the weighting of the description lengths of those counting vectors.Meanwhile, if the weighting cost is ignored, the weights can be optimized from a linear expression (15).As an example, the two counting vectors are used to illustrate the optimization method.

The Optimization of Weights
Considering the formula (15), the description length L can be represented approximately as: Then the least square algorithm is suggested to optimize weights w 1 and w 2 .Let L ¼ ðL 1 ; L 2 Þ T denote the observing vector consisting of all description lengths which come from all counting vectors joined in the weighting operation.Let W ¼ ðw 1 ; w 2 Þ T be the corresponding weights vector.Then the estimated value of the description length L, can be formed in the vector as: However, for minimizing L by using the LS algorithm, the corresponding ideal value L Ã should be given firstly.For tackle this problem, the method is discussed as follows: According to Wu and Zhai [19], the description length L can also be represented as: where Hðx t jSÞ is the entropy of the conditional probability distribution Pðx t jSÞ and D represents the model cost.Actually, the value of ÞHðx t jSÞ is considered as the ideal value of L. Therefore, L Ã can be the objective of the minimization of L. Let denote the difference between L and L Ã .The optimizing objective is to: where f ðWÞ is the cost function that is related to W. The minimization of f ðWÞ can be obtained by solving the following equations.
of ðWÞ The solution of Eq. ( 16) can directly be represented as.
where R is the correlation matrix of the observing vector L,d is the correlation vector between L and L Ã .They can be written as: After solving these equations, the optimized weights can be obtained and the coding distribution can also be obtained by using (4).

Experiment Design
In order to evaluate the performance of our optimization method for the context weighting problem, the proposed algorithm is employed in the microbial genome sequence compression.Four experiments are designed on their corresponding genome sequences to evaluate various aspects of our algorithm.The testing genome sequences used in our experiments are listed as follows: CHMPXX: Marchantia polymorpha chloroplast genome DNA CHNTXX: Nicotiana tabacum chloroplast genome DNA HEHCMVCG: Human cytomegalovirus strain AD169 complete genome HUMD YSTROP: Homo sapiens dystrophin (DHD) gene, intron 44 HUMGHCSA: Human growth hormone (GH1 and GH2) and chorionic somatomammotropin (CS-1, CS-2 and CS-5) genes, complete cds.HUMHBB: Human beta globin region on chromosome 11.HUMHPRTB: Human hypoxanthine phosphoribosyltransferase gene, complete cds.MPOMTCG: Marchantia polymorpha mitochondrion, complete genome SCCHR III: S.cerevisiae chromosome III complete DNA sequence.VACCG: Vaccinia virus, complete genome.
These genome sequences are all obtained from NCBI [17].
In experiment 1, we pay more attention to examine the value of the weighting cost Q in (11).The purpose of this experiment is to illustrate that the weighting cost can be small enough to be ignored in coding process in most of the time.Two counting vectors for a few binary sources are obtained and weighted respectively.The corresponding weighting cost are calculated by (10).
There are three cases when two counting vectors are weighted: (1) they are similar with each other, (2) they are not similar, (3) they are similar but they are also similar to the uniform distribution.Then three groups of counting vectors are used for weighting on these three cases.Their corresponding counting vectors are listed as follows: Where ða 0 ; a 1 Þ denotes the counting vector and a 0 , a 1 denote the number of the symbols whose values equal to 0 and 1 respectively.For comparison, the results for the three pairs of count vectors are all listed in Table 1, meanwhile, the corresponding description lengths of the respective counting vectors are also listed.In this experiment, the weights are both set to: 0.5.
Obviously, if two counting vectors are similar, the weighting cost is small enough to be ignored in the optimization process.Furthermore, when two similar counting vectors are given, whatever weights are chosen, the corresponding weighting cost will still be small.In order to illustrate this, these count vectors: (400, 30) and (450, 30) are used for weighting with different weights whose value range from 0.01 to 0.99.The value of the corresponding weighting costs for various weights are given in Fig. 1.According to the observation of Fig. 1, when weights are changed, the maximum value of the weighting cost is still less than 0.5.It implies that whatever the weights are, the weighting cost can be small.Meanwhile, the weighting costs with different weights of other two groups of count vectors (50, 600) and (500, 40), (200, 290) and (300, 360) are also given in Figs. 2 and 3 respectively.According to the observation of Fig. 1, when weights are changed, the maximum value of the weighting cost is still less than 0.5.It implies that whatever the weights are, the weighting cost can be small.Meanwhile, the weighting costs with different weights of other two groups of count vectors (50, 600) and (500, 40), (200, 290) and (300, 360) are also given in Figs. 2 and 3 respectively.
The results indicate that the weighting cost will not be small unless one weight equals to 0, meaning no context weighting should be applied.
Finally, the result from the last group counting vector is given in Fig. 3.The results from Figs. 1 and 3 indicate that if two counting vectors are similar, the weighting cost is small, no matter what distribution the counting vectors are.
Fig. 1 The weighting costs with different weights for counting vectors (400, 30) and (450, 30) Fig. 2 The weighting costs with different weights for counting vectors (50, 600) and (500, 40) For comparison, we let three curves be illustrated into one image, which is given in Fig. 4. In order to show more clearly, the weighting cost of the case 3 are amplified 100 times.
Meanwhile, by observations from Figs. 1 and 2 and Table 1, if two counting vectors are close enough, the weighting cost can become even smaller than each of the corresponding description lengths of the two counting vectors.It implies that in this case the weighting cost can be ignored in weighting process.If these counting vectors are not similar, the corresponding weighting cost will be too large to be ignored.However, if two counting vectors are similar but the weighted counting vector is nearly uniformly distributed, the resulting description length will become large.In this case, the context weighting process should not be considered although the weighting cost is small.
Actually, these conclusion from observations can also be extended to the case that the number of counting vectors is more than two.According to the formula ( 16), the corresponding weighting cost with different group of counting vectors can be calculated easily.Based on the discussion above.
Fig. 3 The weighting costs with different weights for counting vectors (200, 290) and (300, 360) Fig. 4 The weighting costs for three cases.The black curve denotes the cost for case 1, The red curve denotes the cost for case 2 and the blue one denotes the case 3 Then we will discuss the weighting cost when three or more counting vectors are weighted.Furthermore, the counting vectors under the I-ary cases are also discussed.
Firstly, we give some counting vectors under binary case for our experiments.In this case, there are four situations to describe their similarities.Situation 1: These three counting vectors are all similar.Situation 2: These three counting vectors are all not similar.Situation 3: There are two counting vectors similar but the other one is not similar with these two vectors.Situation 4: These three counting vectors are similar but also similar with the uniform distribution.
To evaluate the corresponding weighting costs, four groups of counting vectors are listed below for our experiments.They are: Actually, the weighting costs are related to the value of weights.In our experiments, three weights w 1 , w 2 and w 3 are traversing from 0.01 to 0.99 respectively.The nine possible results are listed in Table 2.Meanwhile, the corresponding description length of three counting vectors are also listed.For abbreviation, the description lengths of three counting vectors are denoted by L 1 , L 2 and L 3 ordering by the given above.Lw denotes the description length of the weighted vector.Meanwhile, in last four tables, this abbreviation is also valid.
From Table 2, whatever the weights change, the value of the weighting cost is smaller enough to be ignored when optimization algorithm is executing to optimize the weights.However, for situation 2, the results are different from the results in situation 1.
Similarly, the corresponding results for situation 2 are listed in Table 3.It is obviously that the value of weighting cost is so large whatever the weights are.In this case, actually, the context weighting should not be executed.
For situation 3, the corresponding results can also be obtained and listed in Table 4.
There are similar results in Table 4 with the results in Table 4.These results indicate that the weighting cost will be increased when those un-similar counting vectors are weighted.Thus, there is an interesting result in the last line in Table 4.If the weight for this 'unlike' is small enough, the weighting cost can be also smaller to be ignored.Namely, in this case, the optimization can be achieved under situation 3. Actually, in this paper, our algorithm can obtain smallest value of weight for this unlike counting vector in order to ensure the other weights can be optimized.
For situation 4, the corresponding results are listed in Table 5.
From Table 5, although the weighting costs are all small, the corresponding description lengths are large.Namely, the context weighting should not be executed.
Then the three counting vectors in 3-ary case are used to testify the performance of the weighting cost.Only the results in situations 1-3 above are given here.First Group counting vectors are: (500, 30, 35), (520, 35, 44) and (600, 44, 52).
The description length and the weighting cost are also given as the results.The abbreviation here is similar with the utilization above (Table 6).
Under 3-ary case, the value of weighting cost is larger the value in binary case, but they can also be ignored in the optimization process.Meanwhile, if these three counting vectors are not similar, the same results with the binary case results can be obtained, which is listed in Table 7.This group counting vectors are: There is a useful information contained in Table 7.When three counting vectors are not similar with each other, the description length and the weighting cost can not be optimized at the same time.Etc, in this time, the context weighting should not be executed.
At last, the third group counting vectors are given, they are: (700, 10, 30), (680, 15, 40) and (20,720,50) In this group counting vectors, only two are similar.According to the discussion above, in binary case, if the weight for this unlike one is small, the optimization can also be executed.In 3-ary, the corresponding results are listed in Table 8.
It is obviously, the same results with the binary case is obtained again.It implies that the description length and the weighting cost can be minimized though modifying the exact value of weights.Namely, the weights are also optimized by minimizing both description length and weighting costs.It is the evidence that our optimization algorithm can play its role.Furthermore, in the last two line in Table 8, the value of the weight for this unlike counting vector is changed only a little, the weighing cost can be influenced more.Actually, the most practicable method in this situation is to prohibit this unlike counting vector joining the context weighting, namely, its weight is equal to 0. According to these experiments and results, the conclusion of the context weighting can be summarized as follows: There are two conclusions: the first is that the optimization for weights based on the proposed algorithm is feasible.The second is that the context weighting should be executed only if the weighting cost can be ignored.While in previous context weighting studies, if the weights are given, the weighting process will take place throughout the whole coding process.It is apparent that the code length will be increased based on the above analysis.On the other hand, although this experiment is designed for binary sources, the conclusion can also be applied to non binary cases.Therefore, the proposed algorithm can be used for the genome sequence compression to improve the coding results.
In experiment 2, we evaluate the performance of the proposed weighting algorithm.Three pairs of counting vectors are used for weighting.The weighting algorithm in [20] and the proposed algorithm are used to obtain the weights respectively.In [20], the weights are obtained by using the average code length but the weights of the proposed algorithm are optimized by using the description length.Actually, once the weights is calculated there is no optimization for these weights in [20].The three pairs of counting vectors listed below are obtained from counting the bases in a microbial genome sequence:  The description lengths of the weighted counting vectors obtained by the two algorithms are listed in Table 9.The second and third columns contain the weights and the description lengths obtained by the algorithms in [20] for the counting vectors.The fourth and the fifth columns list the weights and the description lengths of the weighted vectors obtained by the proposed algorithm.
It is obvious that the description lengths by the proposed algorithm is shorter than the description lengths by the algorithm in [20].Especially in case 2, although the counts in the two counting vectors are distributed singularly, the weighting by the algorithm in [20] will lead to the counts in the weighted counting vector symmetrically distributed, resulting in longer description length.On the contrary, in our algorithm, the counting vector with the minimum description length is selected for coding.It implies that the weighting is not executed.
The proposed context weighting method is employed to compress the microbial genome sequences.In this experiment, the genome sequences MTOMTCG, VACCG are used as the sources and three context templates with different orders are constructed.The context templates are given in Fig. 2 in which xðiÞ is the current symbol to be encoded and xði À kÞ; k ¼ 1; . ..; 8 are those symbols already encoded (Table 10).
Meanwhile, in order to accelerate the coding process, the updating period is set to 100, which implies that the weights will be re-calculated or optimized when 100 bases are encoded.
For comparison, the same context models are also weighted by the algorithm in [20,21] and the results by the two algorithms are listed in Table 3.
From Table 11, the results indicate that the proposed algorithm is better than the algorithm in [20].The improvement on the coding performance is resulted from the optimization to the weights.
Finally, the proposed algorithm is employed for compressing those benchmark genome sequences which are used in [15,[20][21][22].The same context models with those used in these algorithms are also applied in our experiments.The coding results are given in Table 12.For abbreviation, our algorithm is referred to as: OPGC.
The compared algorithms are BioCompress-2 (BioC) [12], GenCompress (GenC) [6], DNACompress (DNAC) [7], DNAPack (DNAP) [4], CDNA [15], GeMNL [14] and XM [20].Actually, for most of the tested genome sequences, our algorithm can obtain better coding results.It is apparent that the average result is better than all of the compared algorithms listed.At last, the complexity of our algorithm should be considered.To against those algorithms, such as BioC, GenC and DNAC, it is meaningless which is reasoned to the fact that the our compression method is not similar with those algorithm.Therefore, in this paper, our objective algorithm is XM.We implement XM on Matlab 7.0 and under the computer: CPU 2.6 GHz, 4 GB RAM.The times to compress four genome sequences are given in Fig. 5.Meanwhile, the updating period of our algorithm is set to 150 bases.The four genome sequences are: Chntxx, Vaccg, mtomtcg and chmpxx respectively.
Our algorithm is slower that XM obviously, which is resulted from the optimization process.However, it is worth to waste these time to obtain the optimized results.Actually, in practice application, the arithmetic encoder can be accelerated and the executing time of the proposed algorithm can be shorten.The context weighting can be optimized with the weights optimization.This is one significant conclusion from our works.The weighting cost influence the efficiency of context weighting and the relation between weighting operation and weighting cost is illustrated here.A large number of experiments demonstrate that the proposed algorithm can achieve better results and the weighting cost can be controlled in details.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Table 1
The comparison of the weighting cost

Table 2
The weighting cost for three similar counting vectors

Table 3
The weighting cost for three un-similar counting vectors

Table 5
The weighting cost for the situation 4

Table 6
The weighting cost for the first group counting vectors

Table 7
The weighting cost for these un-similar counting vectors

Table 8
The weighting cost for situation 3

Table 9
The comparison of the weights and description lengths by two algorithms

Table 12
The comparison of the compression results from existing algorithms