Understanding the statistical properties of genomic sequences helps recreate the dynamical processes that led to the current DNA structure, and determine gene-related diseases like cancer and Alzheimer disease. For instance, based on the view that non-coding DNA exhibits long-range correlations [16], Li [7] proposed an expansion-modification model of gene evolution. The model incorporates the two basic features of DNA evolution: (i) sequence elongation due to gene duplication and (ii) mutations. It can be shown that the limiting sequence created by this dynamical process exhibits a long-range correlation structure, as attested by a 1/fαspectrum, where the exponent α is a function of the probability of mutation. To understand the relationship between the DNA correlation structure and possible gene abberations, Dodin et al. [8] designed a simple correlation function intended to visualize the regular patterns encountered in DNA sequences. This function is used to revisit the intriguing question of triplet repeats with the aim of providing a visual estimate of the propensity of genes to be highly expressed and/or to lead to possible aberrant structures formed upon strand slippage.

Statistical analysis of genomic sequences has, however, relied, for a long time, on signal processing techniques for stationary signals (correlation and power spectrum) [2, 4, 9, 10], and statistical tools for slowly-varying trends within stationary signals (Detrended Fluctuation Analysis or DFA) [1, 5, 6]. Stationarity can be argued as a valid assumption for time-series of short duration. However, such an assumption rapidly loses its credibility in the enormous databases maintained by various genome projects. Standard statistical tests (e.g., Priestley's test for stationarity) have been used to verify that genomic sequences are not stationary and the nature of their non-stationarity varies and is often more complex than a simple trend [11, 12]. Subsequently, more recent analysis of genomic data [1] has relied on time-varying power spectral methods (the evolutionary spectrum and periodogram) to capture the statistical characteristics of genomic sequences [11, 12]. The main difficulty in using time-varying spectral methods is that they are extremely unstable and very noisy. Typically, the power spectrum and the evolutionary spectrum are averaged over time in order to obtain smooth and less jittery curves. Moreover, as was pointed out in [13], the evolutionary spectrum is restricted to the class of oscillatory processes. A stochastic process, X(t), is oscillatory if it has a representation of the form

X(t) = ∫ A(t, λ)e2iπλtdZ(λ),   (1)

Where Z(λ) is an orthogonal increment process, and the evolutionary power spectrum of the process is defined by P (t, λ) = |A(t, λ)|2. This definition of the evolutionary power spectrum has the following disadvantages [13]:

(i) It is not uniquely defined for a given non-stationary process.

(ii) The estimation procedure for the evolutionary spectrum depends greatly on the nature of theamplitude function A(t, λ), which is not known a priori.

(iii) An increase in the number of observations does not provide added information on the local behavior of the evolutionary spectrum, and thus does not improve estimation accuracy.

We propose to model non-stationary genomic sequences by a time-dependent autoregressive moving average (TD-ARMA) process. Cramer [14] showed that a non-stationary process still possesses a Wold decomposition in terms of its innovation and its generating system. However, the linear system generating the non-stationary signal, x(t), when driven by the innovation, w(t), is no longer shift-invariant; the parameters of the impulse response, h u , of this system are time-dependent so that

x ( t ) = u = 0 h u ( t ) w ( t u ) . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiEaGNaeiikaGIaemiDaqNaeiykaKIaeyypa0ZaaabCaeaacqWGObaAdaWgaaWcbaGaemyDauhabeaakiabcIcaOiabdsha0jabcMcaPiabdEha3jabcIcaOiabdsha0jabgkHiTiabdwha1jabcMcaPaWcbaGaemyDauNaeyypa0JaeGimaadabaGaeyOhIukaniabggHiLdGccqGGUaGlaaa@46EC@

The existence of a time-varying ARMA representation of this process is ensured by two theorems due, independently, to Grenier [15] and Huang and Aggarwal [16]. The uniqueness of the TD-ARMA representation is obtained by constraining the ARMA model to be invertible, but this leads to conditions on the time-varying impulse response {h u (t)} and its inverse (namely to be absolutely summable at any time t), which cannot be easily expressed in terms of the time-dependent coefficients of the ARMA model. In this paper, we estimate the time-dependent coefficients of the general TD-ARMA model using mean-squares, least-squares and recursive least-squares algorithms. The mean-squares estimation leads to generalized Yule-Walker type equations [15]. Once the non-stationary parameters are estimated (as time series), we use them to provide a basis for statistical inference by defining an index of randomness, which quantitatively assesses how close the non-stationary signal is to white noise. Our simulation results on various gene sequences show that (i) both the coding and non-coding segments of a gene are not random, and (ii) the coding segments are "closer" to random sequences than non-coding segments. Our results support the view that both coding and non-coding sequences are not random [11, 12, 9, 1720], and revokes the stationary study that maintains that non-coding DNA sustains long-range correlations whereas coding DNA behaves like random sequences [13, 5, 6, 10].


Numerical representation of genomic sequences

Converting the DNA sequence into a digital signal offers the opportunity to apply powerful signal processing methods for the handling and analysis of genomic information. This is, however, not an easy task as the analysis results might depend on the chosen map. Various numerical mappings have been adopted in the literature. To cite few, Peng et al. [1] construct a one-dimensional map of nucleotide sequences onto a walk, u(i), which they termed "DNA walk". The DNA walk is defined by the rule that the walker steps up (u(i) = +1) if a pyrimidine resides at position i, and steps down (u(i) = -1) otherwise. Voss [9] represents a DNA sequence by four binary indicator sequences, which indicate the locations of the four nucleotides A, T, C and G. Berthelsen et al. [21] proposed a two-dimensional representation of DNA sequences, characterized by a Hausdorff dimension (also called fractal dimension) that is invariant under (i) complementarity, (ii) reflection symmetry, (iii) compatibility and (iv) substitution symmetry of A͘T and C↔G. The corresponding embedding assignment is given by A = (-1; 0), T = (1; 0), C = (0; -1) and G = (0; 1). In this paper, since we are interested in time-dependent ARMA modeling of one-dimensional non-stationary genomic sequences, we adopt the widely used "DNA walk" map proposed by Peng et al [1]. The DNA walk provides a nice graphical representation for each gene. For instance, Figure 1 shows the structure of the Human gene 276 located in chromosome 1, and its DNA walk is displayed in Fig. 2.

Figure 1
figure 1

Gene Structure. Gene structure of the Human gene 276 located in chromosome 1: The boxes correspond to the exons (coding regions), and the lines between them represent the introns (non-coding regions). The total length of the gene is N = 8208 bases, including 1536 coding bases and 6672 non-coding bases.

Figure 2
figure 2

DNA Walk. DNA walk of the Human gene 276.

Time-dependent ARMA model

Grenier [22] showed that a discrete non-stationary signal, {x [n]}, can be represented by finite-order time-varying ARMA processes of the form

x [ n ] + i = 1 p a i [ n i ] x [ n i ] = w [ n ] + i = 1 q b i [ n i ] w [ n i ] , n = 0 , , N 1 , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeqabeGaaaqaaiabdIha4jabcUfaBjabd6gaUjabc2faDjabgUcaRmaaqahabaGaemyyae2aaSbaaSqaaiabdMgaPbqabaGccqGGBbWwcqWGUbGBcqGHsislcqWGPbqAcqGGDbqxcqqGGaaicqWG4baEcqGGBbWwcqWGUbGBcqGHsislcqWGPbqAcqGGDbqxcqGH9aqpcqWG3bWDcqGGBbWwcqWGUbGBcqGGDbqxaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabdchaWbqdcqGHris5aOGaey4kaSYaaabCaeaacqWGIbGydaWgaaWcbaGaemyAaKgabeaakiabcUfaBjabd6gaUjabgkHiTiabdMgaPjabc2faDjabbccaGiabdEha3jabcUfaBjabd6gaUjabgkHiTiabdMgaPjabc2faDjabcYcaSaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaemyCaehaniabggHiLdaakeaacqWGUbGBcqGH9aqpcqaIWaamcqGGSaalcqWIVlctcqGGSaalcqWGobGtcqGHsislcqaIXaqmcqGGSaalaaaaaa@7677@

where N is the length of the sequence x [n], a i [n] and b i [n] are the time-dependent model parameters, p and q are the model orders and w [n] is a white noise process. Observe that the coefficients a i [n] and b i [n] appear with an argument n - i depending on i. This is purely arbitrary since any time origin can be chosen, without restraining the generality of the model. We assume that the time-dependent coefficients a i [n] and b i [n] can be expressed as linear combinations of some basis functions { f k [ n ] } k = 0 m MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaei4EaSNaemOzay2aaSbaaSqaaiabdUgaRbqabaGccqGGBbWwcqWGUbGBcqGGDbqxcqGG9bqFdaqhaaWcbaGaem4AaSMaeyypa0JaeGimaadabaGaemyBa0gaaaaa@3A87@ ,

a i [ n ] = k = 0 m c i , k f k [ n ] MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyyae2aaSbaaSqaaiabdMgaPbqabaGccqGGBbWwcqWGUbGBcqGGDbqxcqGH9aqpdaaeWbqaaiabdogaJnaaBaaaleaacqWGPbqAcqGGSaalcqWGRbWAaeqaaOGaemOzay2aaSbaaSqaaiabdUgaRbqabaGccqGGBbWwcqWGUbGBcqGGDbqxaSqaaiabdUgaRjabg2da9iabicdaWaqaaiabd2gaTbqdcqGHris5aaaa@46D0@
b i [ n ] = k = 0 m d i , k f k [ n ] MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOyai2aaSbaaSqaaiabdMgaPbqabaGccqGGBbWwcqWGUbGBcqGGDbqxcqGH9aqpdaaeWbqaaiabdsgaKnaaBaaaleaacqWGPbqAcqGGSaalcqWGRbWAaeqaaOGaemOzay2aaSbaaSqaaiabdUgaRbqabaGccqGGBbWwcqWGUbGBcqGGDbqxaSqaaiabdUgaRjabg2da9iabicdaWaqaaiabd2gaTbqdcqGHris5aaaa@46D4@

The advantage of the basis parametrization is clear from the fact that the identification of the time-dependent coefficients a i [n] and b i [n] reduces to the identification of the constant coefficients { c i , k } k = 0 m MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaei4EaSNaem4yam2aaSbaaSqaaiabdMgaPjabcYcaSiabdUgaRbqabaGccqGG9bqFdaqhaaWcbaGaem4AaSMaeyypa0JaeGimaadabaGaemyBa0gaaaaa@38D7@ and { d i , k } k = 0 m MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaei4EaSNaemizaq2aaSbaaSqaaiabdMgaPjabcYcaSiabdUgaRbqabaGccqGG9bqFdaqhaaWcbaGaem4AaSMaeyypa0JaeGimaadabaGaemyBa0gaaaaa@38D9@ , and therefore the linear non-stationary problem reduces to a linear time-invariant problem. The basis functions { f k [ n ] } k = 0 m MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaei4EaSNaemOzay2aaSbaaSqaaiabdUgaRbqabaGccqGGBbWwcqWGUbGBcqGGDbqxcqGG9bqFdaqhaaWcbaGaem4AaSMaeyypa0JaeGimaadabaGaemyBa0gaaaaa@3A87@ do not have to be limited to the standard choices of Legendre, Fourier, or the prolate spheroidal basis but can also take advantage of any prior information, such as the presence of a jump in the coefficients at a known instant [22].

Estimation of the time-dependent ARMA coefficients

From Eqs. (4) and (5), the unknown parameters of the TD-ARMA model are the weights of the linear combinations defining each time-varying parameter. The linearity is the key to the algorithms proposed in this paper. We will derive mean-squares, least-squares and recursive least-squares solutions to estimate the time-dependent coefficients { a i [ n ] } i = 1 p MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaei4EaSNaemyyae2aaSbaaSqaaiabdMgaPbqabaGccqGGBbWwcqWGUbGBcqGGDbqxcqGG9bqFdaqhaaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaemiCaahaaaaa@3A7D@ and { b j [ n ] } j = 1 q MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaei4EaSNaemOyai2aaSbaaSqaaiabdQgaQbqabaGccqGGBbWwcqWGUbGBcqGGDbqxcqGG9bqFdaqhaaWcbaGaemOAaOMaeyypa0JaeGymaedabaGaemyCaehaaaaa@3A85@ .

Mean-squares estimation

Define the process

v [ n ] = x [ n ] + i = 1 p a i [ n i ] x [ n i ] = w [ n ] + i = 1 q b i [ n i ] w [ n i ] , n = 0 , N 1 , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeqabeGaaaqaaiabdAha2jabcUfaBjabd6gaUjabc2faDjabg2da9iabdIha4jabcUfaBjabd6gaUjabc2faDjabgUcaRmaaqahabaGaemyyae2aaSbaaSqaaiabdMgaPbqabaGccqGGBbWwcqWGUbGBcqGHsislcqWGPbqAcqGGDbqxcqqGGaaicqWG4baEcqGGBbWwcqWGUbGBcqGHsislcqWGPbqAcqGGDbqxcqGH9aqpcqWG3bWDcqGGBbWwcqWGUbGBcqGGDbqxaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabdchaWbqdcqGHris5aOGaey4kaSYaaabCaeaacqWGIbGydaWgaaWcbaGaemyAaKgabeaakiabcUfaBjabd6gaUjabgkHiTiabdMgaPjabc2faDjabbccaGiabdEha3jabcUfaBjabd6gaUjabgkHiTiabdMgaPjabc2faDjabcYcaSaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaemyCaehaniabggHiLdaakeaacqWGUbGBcqGH9aqpcqaIWaamcqGGSaalcqWIVlctcqWGobGtcqGHsislcqaIXaqmcqGGSaalaaaaaa@7BF7@

and define the vector

X [n] = [f0[n]x[n], ⋯, f m [n]x[n]]t,   (7)

where the symbol tstands for the transpose of a vector or a matrix. It is possible to derive for this process orthogonality conditions similar to the stationary ARMA model conditions [23]. Observe that the process v [n], defined in Eq. (6), is orthogonal to [w[n - q - 1], w [n - q - 2], ⋯]; hence, it is orthogonal to x [n - q - i] for all i > 0, and orthogonal to X [n - q - i] for all i > 0 [22]. This orthogonality condition leads to a generalized Yule-Walker equation [22]

E ( [ X [ n q 1 ] X [ n q p ] ] [ X [ n 1 ] t X [ n p ] t ] t ) θ = E ( [ X [ n q 1 ] X [ n q p ] ] x [ n ] ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyrau0aaeWaaeaadaWadaqaauaabeqadeaaaeaacqWGybawcqGGBbWwcqWGUbGBcqGHsislcqWGXbqCcqGHsislcqaIXaqmcqGGDbqxaeaacqWIUlstaeaacqWGybawcqGGBbWwcqWGUbGBcqGHsislcqWGXbqCcqGHsislcqWGWbaCcqGGDbqxaaaacaGLBbGaayzxaaGaei4waSLaemiwaGLaei4waSLaemOBa4MaeyOeI0IaeGymaeJaeiyxa01aaWbaaSqabeaacqWG0baDaaGccqWIVlctcqWGybawcqGGBbWwcqWGUbGBcqGHsislcqWGWbaCcqGGDbqxdaahaaWcbeqaaiabdsha0baakiabc2faDnaaCaaaleqabaGaemiDaqhaaaGccaGLOaGaayzkaaGaeqiUdeNaeyypa0JaeyOeI0Iaemyrau0aaeWaaeaadaWadaqaauaabeqadeaaaeaacqWGybawcqGGBbWwcqWGUbGBcqGHsislcqWGXbqCcqGHsislcqaIXaqmcqGGDbqxaeaacqWIUlstaeaacqWGybawcqGGBbWwcqWGUbGBcqGHsislcqWGXbqCcqGHsislcqWGWbaCcqGGDbqxaaaacaGLBbGaayzxaaGaeyyXICTaemiEaGNaei4waSLaemOBa4Maeiyxa0facaGLOaGaayzkaaaaaa@829D@

Although the process x [n] is non-stationary, the stationarity and ergodicity of the process w [n], together with the linearity of the model, allow us to replace in Eq. (8) the expectation by a summation. However, although consistent, the above estimator is often considered a poor one [22].

Least-squares estimation

Equations (4) and (5) can be written in vector format as follows

a i [n] = ft[n] c i ,   and   b i [n] = ft[n] d i ,


f [ n ] = [ f 0 [ n ] f m [ n ] ] , c i = [ c i , 0 c i , m ] , d i = [ d i , 0 d i , m ] . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeqabeWaaaqaaiabhAgaMjabcUfaBjabd6gaUjabc2faDjabg2da9maadmaabaqbaeqabmqaaaqaaiabdAgaMnaaBaaaleaacqaIWaamaeqaaOGaei4waSLaemOBa4Maeiyxa0fabaGaeSO7I0eabaGaemOzay2aaSbaaSqaaiabd2gaTbqabaGccqGGBbWwcqWGUbGBcqGGDbqxaaaacaGLBbGaayzxaaGaeiilaWcabaGaeC4yam2aaSbaaSqaaiabhMgaPbqabaGccqGH9aqpdaWadaqaauaabeqadeaaaeaacqWGJbWydaWgaaWcbaGaemyAaKMaeiilaWIaeGimaadabeaaaOqaaiabl6UinbqaaiabdogaJnaaBaaaleaacqWGPbqAcqGGSaalcqWGTbqBaeqaaaaaaOGaay5waiaaw2faaiabcYcaSaqaaiabhsgaKnaaBaaaleaacqWHPbqAaeqaaOGaeyypa0ZaamWaaeaafaqabeWabaaabaGaemizaq2aaSbaaSqaaiabdMgaPjabcYcaSiabicdaWaqabaaakeaacqWIUlstaeaacqWGKbazdaWgaaWcbaGaemyAaKMaeiilaWIaemyBa0gabeaaaaaakiaawUfacaGLDbaacqGGUaGlaaaaaa@699C@


ut[n] = x [n] ft[n],   and   vt[n] = w [n] ft[n].

Then, we have

a i [ n i ] x [ n i ] = u t [ n i ] c i b i [ n i ] w [ n i ] = v t [ n i ] d i MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeqabiqaaaqaaiabdggaHnaaBaaaleaacqWGPbqAaeqaaOGaei4waSLaemOBa4MaeyOeI0IaemyAaKMaeiyxa0LaeeiiaaIaemiEaGNaei4waSLaemOBa4MaeyOeI0IaemyAaKMaeiyxa0Laeyypa0JaeCyDau3aaWbaaSqabeaacqWG0baDaaGccqGGBbWwcqWGUbGBcqGHsislcqWGPbqAcqGGDbqxcqqGGaaicqWHJbWydaWgaaWcbaGaeCyAaKgabeaaaOqaaiabdkgaInaaBaaaleaacqWGPbqAaeqaaOGaei4waSLaemOBa4MaeyOeI0IaemyAaKMaeiyxa0LaeeiiaaIaem4DaCNaei4waSLaemOBa4MaeyOeI0IaemyAaKMaeiyxa0Laeyypa0JaeCODay3aaWbaaSqabeaacqWG0baDaaGccqGGBbWwcqWGUbGBcqGHsislcqWGPbqAcqGGDbqxcqqGGaaicqWHKbazdaWgaaWcbaGaeCyAaKgabeaaaaaaaa@6B18@

Using this vector notation, Eq. (3) can be written as

x [n] + ut[n - 1] c 1 + ⋯ + ut[n - p] c p = w [n] + vt[n - 1] d 1 + ⋯ + vt[n - q] d q    (9)

Or equivalently

x [n] + φt[n] θ = w [n],   (10)

where φt[n] is the row vector

φt[n] = [ut[n - 1], ⋯, ut[n -p], - vt[n - 1], ⋯, vt[n -q]],


θ= [c 1 , ⋯,c p , d 1 , ⋯, d q ]t.

Observe that the vector θ contains all the unknowns of the TD-ARMA model. Writing Eq. (10) for n = 0, 1, ⋯, N - 1 leads to

x = Φ θ + w,   (11)


Φ = [ φ t [ 0 ] φ t [ N 1 ] ] , x = [ x [ 0 ] x [ N 1 ] ] , w = [ w [ 0 ] w [ N 1 ] ] . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeqabeWaaaqaaiabfA6agjabg2da9maadmaabaqbaeqabmqaaaqaaiabgkHiTiabeA8aMnaaCaaaleqabaGaemiDaqhaaOGaei4waSLaeGimaaJaeiyxa0fabaGaeSO7I0eabaGaeyOeI0IaeqOXdy2aaWbaaSqabeaacqWG0baDaaGccqGGBbWwcqWGobGtcqGHsislcqaIXaqmcqGGDbqxaaaacaGLBbGaayzxaaGaeiilaWcabaGaeCiEaGNaeyypa0ZaamWaaeaafaqabeWabaaabaGaemiEaGNaei4waSLaeGimaaJaeiyxa0fabaGaeSO7I0eabaGaemiEaGNaei4waSLaemOta4KaeyOeI0IaeGymaeJaeiyxa0faaaGaay5waiaaw2faaiabcYcaSaqaaiabhEha3jabg2da9maadmaabaqbaeqabmqaaaqaaiabdEha3jabcUfaBjabicdaWiabc2faDbqaaiabl6UinbqaaiabdEha3jabcUfaBjabd6eaojabgkHiTiabigdaXiabc2faDbaaaiaawUfacaGLDbaacqGGUaGlaaaaaa@6B73@

The least-squares solution of Eq. (11) is given by

θ = (ΦtΦ)-1 Φtx   (12)

To overcome the computational complexity associated with the least-squares estimation (involving in particular the inversion of a square (m + 1)(p + q) × (m + 1)(p + q) matrix), we opted for a recursive least-squares estimation as follows.

Recursive least-squares estimation

The recursive least squares algorithm is summarized as [24]

θ ^ [ n ] = θ ^ [ n 1 ] + L [ n ] { x [ n ] + φ t [ n ] θ ^ [ n 1 ] } MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqiUdeNbaKaacqGGBbWwcqWGUbGBcqGGDbqxcqGH9aqpcuaH4oqCgaqcaiabcUfaBjabd6gaUjabgkHiTiabigdaXiabc2faDjabgUcaRiabdYeamjabcUfaBjabd6gaUjabc2faDjabbccaGiabcUha7jabdIha4jabcUfaBjabd6gaUjabc2faDjabgUcaRiabeA8aMnaaCaaaleqabaGaemiDaqhaaOGaei4waSLaemOBa4Maeiyxa0LafqiUdeNbaKaacqGGBbWwcqWGUbGBcqGHsislcqaIXaqmcqGGDbqxcqGG9bqFaaa@5919@
L [ n ] = P [ n 1 ] φ [ n ] 1 + φ t [ n ] P [ n 1 ] φ [ n ] MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemitaWKaei4waSLaemOBa4Maeiyxa0Laeyypa0JaeyOeI0scfa4aaSaaaeaacqWGqbaucqGGBbWwcqWGUbGBcqGHsislcqaIXaqmcqGGDbqxcqqGGaaicqaHgpGzcqGGBbWwcqWGUbGBcqGGDbqxaeaacqaIXaqmcqGHRaWkcqaHgpGzdaahaaqabeaacqWG0baDaaGaei4waSLaemOBa4Maeiyxa0LaeeiiaaIaemiuaaLaei4waSLaemOBa4MaeyOeI0IaeGymaeJaeiyxa0LaeeiiaaIaeqOXdyMaei4waSLaemOBa4Maeiyxa0faaaaa@5824@
P [ n ] = P [ n 1 ] P [ n 1 ] φ [ n ] φ t [ n ] P [ n 1 ] 1 + φ t [ n ] P [ n 1 ] φ [ n ] MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiuaaLaei4waSLaemOBa4Maeiyxa0Laeyypa0JaemiuaaLaei4waSLaemOBa4MaeyOeI0IaeGymaeJaeiyxa0LaeyOeI0scfa4aaSaaaeaacqWGqbaucqGGBbWwcqWGUbGBcqGHsislcqaIXaqmcqGGDbqxcqqGGaaicqaHgpGzcqGGBbWwcqWGUbGBcqGGDbqxcqqGGaaicqaHgpGzdaahaaqabeaacqWG0baDaaGaei4waSLaemOBa4Maeiyxa0LaeeiiaaIaemiuaaLaei4waSLaemOBa4MaeyOeI0IaeGymaeJaeiyxa0fabaGaeGymaeJaey4kaSIaeqOXdy2aaWbaaeqabaGaemiDaqhaaiabcUfaBjabd6gaUjabc2faDjabbccaGiabdcfaqjabcUfaBjabd6gaUjabgkHiTiabigdaXiabc2faDjabbccaGiabeA8aMjabcUfaBjabd6gaUjabc2faDbaaaaa@6EC1@

The initial conditions can be chosen arbitrarily.

Index of randomness

Over the past decade, there has been a flow of conflicting papers about the long-range power-law correlations detected in eukaryotic DNA [13, 5, 6, 912, 1720]. The controversy is generated by conflicting views that either advocate that non-coding DNA sustains long-range correlations whereas coding DNA behaves like random sequences [1, 10, 2, 3, 5, 6], or maintains that both coding and non-coding DNA exhibit long-range power-law correlations [11, 12, 9, 1720]. Based on the analysis of the time-dependent power spectrum of genomic sequences, Bouaynaya and Schonfeld [11, 12] showed that the statistical differences between coding and non-coding sequences are more subtle than previously concluded using stationary analysis tools. In fact they found that both coding and non-coding sequences are non-random. However, coding sequences are "whiter" than non-coding sequences.

We propose to qualitatively assess the degree of randomness of both coding and non-coding sequences using the time-dependent ARMA coefficients a i [n] and b i [n]. Consider the system function, H (z), of a stationary ARMA model (whose coefficients a i and b i are constant, i.e., independent of time). We know that

H ( z ) = k = 0 q b k z k k = 0 p a k z k = k = 1 q ( 1 r k z 1 ) k = 1 p ( 1 p k z 1 ) , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemisaGKaeiikaGIaemOEaONaeiykaKIaeyypa0tcfa4aaSaaaeaadaaeWaqaaiabdkgaInaaBaaabaGaem4AaSgabeaacqWG6bGEdaahaaqabeaacqGHsislcqWGRbWAaaaabaGaem4AaSMaeyypa0JaeGimaadabaGaemyCaehacqGHris5aaqaamaaqadabaGaemyyae2aaSbaaeaacqWGRbWAaeqaaiabdQha6naaCaaabeqaaiabgkHiTiabdUgaRbaaaeaacqWGRbWAcqGH9aqpcqaIWaamaeaacqWGWbaCaiabggHiLdaaaOGaeyypa0tcfa4aaSaaaeaadaqeWaqaaiabcIcaOiabigdaXiabgkHiTiabdkhaYnaaBaaabaGaem4AaSgabeaacqWG6bGEdaahaaqabeaacqGHsislcqaIXaqmaaGaeiykaKcabaGaem4AaSMaeyypa0JaeGymaedabaGaemyCaehacqGHpis1aaqaamaaradabaGaeiikaGIaeGymaeJaeyOeI0IaemiCaa3aaSbaaeaacqWGRbWAaeqaaiabdQha6naaCaaabeqaaiabgkHiTiabigdaXaaacqGGPaqkaeaacqWGRbWAcqGH9aqpcqaIXaqmaeaacqWGWbaCaiabg+GivdaaaiabcYcaSaaa@7075@

where { r k } k = 1 q MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaei4EaSNaemOCai3aaSbaaSqaaiabdUgaRbqabaGccqGG9bqFdaqhaaWcbaGaem4AaSMaeyypa0JaeGymaedabaGaemyCaehaaaaa@36C4@ (resp. { p k } k = 1 p MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaei4EaSNaemiCaa3aaSbaaSqaaiabdUgaRbqabaGccqGG9bqFdaqhaaWcbaGaem4AaSMaeyypa0JaeGymaedabaGaemiCaahaaaaa@36BE@ ) are the zeros (resp. poles) of the system function. From the fact that a stationary white noise process has a at spectrum, we observe that the closer (in L2 distance) the zeros and poles are, the flatter is the spectrum of the process. Following the same reasoning locally for non-stationary processes, we define the curve of randomness, CR [n], of the non-stationary process x [n] by

{ C R [ n ] = min ( r k [ n ] , p k [ n ] ) ( 1 q k = 1 q | r k [ n ] p k [ n ] | + 1 p q k = q + 1 p | p k [ n ] | ) , if  p > q ; C R [ n ] = min ( r k [ n ] , p k [ n ] ) ( 1 p k = 1 p | r k [ n ] p k [ n ] | + 1 p q k = p + 1 q | r k [ n ] | ) , if  q > p ; C R [ n ] = min ( r k [ n ] , p k [ n ] ) ( 1 p k = 1 p | r k [ n ] p k [ n ] | ) , if  p = q . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaiqaaeaafaqaaeWacaaabaGaem4qamKaemOuaiLaei4waSLaemOBa4Maeiyxa0Laeyypa0JagiyBa0MaeiyAaKMaeiOBa42aaSbaaSqaaiabcIcaOiabdkhaYnaaBaaameaacqWGRbWAaeqaaSGaei4waSLaemOBa4Maeiyxa0LaeiilaWIaemiCaa3aaSbaaWqaaiabdUgaRbqabaWccqGGBbWwcqWGUbGBcqGGDbqxcqGGPaqkaeqaaOWaaeWaaeaajuaGdaWcaaqaaiabigdaXaqaaiabdghaXbaakmaaqadabaGaeiiFaWNaemOCai3aaSbaaSqaaiabdUgaRbqabaGccqGGBbWwcqWGUbGBcqGGDbqxcqGHsislcqWGWbaCdaWgaaWcbaGaem4AaSgabeaakiabcUfaBjabd6gaUjabc2faDjabcYha8bWcbaGaem4AaSMaeyypa0JaeGymaedabaGaemyCaehaniabggHiLdGccqGHRaWkjuaGdaWcaaqaaiabigdaXaqaaiabdchaWjabgkHiTiabdghaXbaakmaaqadabaGaeiiFaWNaemiCaa3aaSbaaSqaaiabdUgaRbqabaGccqGGBbWwcqWGUbGBcqGGDbqxcqGG8baFaSqaaiabdUgaRjabg2da9iabdghaXjabgUcaRiabigdaXaqaaiabdchaWbqdcqGHris5aaGccaGLOaGaayzkaaGaeiilaWcabaGaeeyAaKMaeeOzayMaeeiiaaIaemiCaaNaeyOpa4JaemyCaeNaei4oaSdabaGaem4qamKaemOuaiLaei4waSLaemOBa4Maeiyxa0Laeyypa0JagiyBa0MaeiyAaKMaeiOBa42aaSbaaSqaaiabcIcaOiabdkhaYnaaBaaameaacqWGRbWAaeqaaSGaei4waSLaemOBa4Maeiyxa0LaeiilaWIaemiCaa3aaSbaaWqaaiabdUgaRbqabaWccqGGBbWwcqWGUbGBcqGGDbqxcqGGPaqkaeqaaOWaaeWaaeaajuaGdaWcaaqaaiabigdaXaqaaiabdchaWbaakmaaqadabaGaeiiFaWNaemOCai3aaSbaaSqaaiabdUgaRbqabaGccqGGBbWwcqWGUbGBcqGGDbqxcqGHsislcqWGWbaCdaWgaaWcbaGaem4AaSgabeaakiabcUfaBjabd6gaUjabc2faDjabcYha8bWcbaGaem4AaSMaeyypa0JaeGymaedabaGaemiCaahaniabggHiLdGccqGHRaWkjuaGdaWcaaqaaiabigdaXaqaaiabdchaWjabgkHiTiabdghaXbaakmaaqadabaGaeiiFaWNaemOCai3aaSbaaSqaaiabdUgaRbqabaGccqGGBbWwcqWGUbGBcqGGDbqxcqGG8baFaSqaaiabdUgaRjabg2da9iabdchaWjabgUcaRiabigdaXaqaaiabdghaXbqdcqGHris5aaGccaGLOaGaayzkaaGaeiilaWcabaGaeeyAaKMaeeOzayMaeeiiaaIaemyCaeNaeyOpa4JaemiCaaNaei4oaSdabaGaem4qamKaemOuaiLaei4waSLaemOBa4Maeiyxa0Laeyypa0JagiyBa0MaeiyAaKMaeiOBa42aaSbaaSqaaiabcIcaOiabdkhaYnaaBaaameaacqWGRbWAaeqaaSGaei4waSLaemOBa4Maeiyxa0LaeiilaWIaemiCaa3aaSbaaWqaaiabdUgaRbqabaWccqGGBbWwcqWGUbGBcqGGDbqxcqGGPaqkaeqaaOWaaeWaaeaajuaGdaWcaaqaaiabigdaXaqaaiabdchaWbaakmaaqadabaGaeiiFaWNaemOCai3aaSbaaSqaaiabdUgaRbqabaGccqGGBbWwcqWGUbGBcqGGDbqxcqGHsislcqWGWbaCdaWgaaWcbaGaem4AaSgabeaakiabcUfaBjabd6gaUjabc2faDjabcYha8bWcbaGaem4AaSMaeyypa0JaeGymaedabaGaemiCaahaniabggHiLdaakiaawIcacaGLPaaacqGGSaalaeaacqqGPbqAcqqGMbGzcqqGGaaicqWGWbaCcqGH9aqpcqWGXbqCcqGGUaGlaaaacaGL7baaaaa@25EA@

where the minimum is taken over all pairs (r k [n], p k [n]). Observe that the case p <q is obtained from the p >q case by interchanging the roles of r k and p k , and the indices p and q. The curve of randomness defined in Eq. (17) is a measure of how close the zeros and the poles of the system function are, and therefore, is a measure of how flat the system function is, and how close is the process from a white noise. The index of randomness, IR(p, q), of a TD-ARMA(p, q), is then defined as the average of the curve of randomness, i.e.,

I R ( p , q ) = 1 N n = 0 N 1 C R [ n ] . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemysaKKaemOuaiLaeiikaGIaemiCaaNaeiilaWIaemyCaeNaeiykaKIaeyypa0tcfa4aaSaaaeaacqaIXaqmaeaacqWGobGtaaGcdaaeWbqaaiabdoeadjabdkfasjabcUfaBjabd6gaUjabc2faDbWcbaGaemOBa4Maeyypa0JaeGimaadabaGaemOta4KaeyOeI0IaeGymaedaniabggHiLdGccqGGUaGlaaa@4740@

In particular, the index of randomness of a TD-ARMA(1,1) (x [n] + a[n - 1]x[n - 1] = w[n] + b[n]w[n - 1]) is given by

I R ( 1 , 1 ) = 1 N n = 0 N 1 | a [ n ] b [ n ] | . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemysaKKaemOuaiLaeiikaGIaeGymaeJaeiilaWIaeGymaeJaeiykaKIaeyypa0tcfa4aaSaaaeaacqaIXaqmaeaacqWGobGtaaGcdaaeWbqaaiabcYha8jabdggaHjabcUfaBjabd6gaUjabc2faDjabgkHiTiabdkgaIjabcUfaBjabd6gaUjabc2faDjabcYha8bWcbaGaemOBa4Maeyypa0JaeGimaadabaGaemOta4KaeyOeI0IaeGymaedaniabggHiLdGccqGGUaGlaaa@4E7A@

Observe that the index of randomness of a white noise process is equal to zero. We say that the sequence x1 [n] with index of randomness IR1 is more random than the sequence x2 [n] with index of randomness IR2 if the index of randomness of the former is lower than the index of randomness of the latter, i.e., IR1 <IR2.


All genome sequences considered in this paper have been extracted from the NIH website The algorithms were implemented in MATLAB. The DNA sequences were mapped to numerical sequences using the purine-pyrimidine DNA walk proposed in [1]. In our simulations, the recursive least squares algorithm was found to best estimate the time-dependent coefficients of the TD-ARMA model. We used the MATLAB function rarmax, which implements the recursive least-squares algorithm for TD-ARMA models. The choice of the orders p and q of the model were determined experimentally as follows: For each genomic sequence, we computed 100 TD-ARMA models corresponding to the orders (1, 1) up to (10, 10). The best model was chosen to be the one that minimizes the average squared error between the actual and the fitted sequences. Our extensive simulations on various DNA sequences from different organisms show that most of the sequences are best fitted with low-order TD-ARMA models, e.g., TD-ARMA(1,1), TD-ARMA(1,2) and TD-ARMA(2,1). Figure 3 shows the DNA walk of the Human gene 276 and its TD-ARMA(1,1) fitted sequence. Observe that the TD-ARMA(1,1) model accurately fits this gene sequence. The estimated time-varying coefficients a [n] and b [n] are displayed in Fig. 4 for both the coding and non-coding regions of this gene. Their statistical differences are not clear from the plot of the time-series coefficients. The curves of randomness of the coding and non-coding regions are displayed in Fig. 5. Table 1 shows the index of randomness of various gene sequences. For concise representation, the column titles have been abbreviated as follows: "C. length" (resp."N.C. length") denotes the length (in base pairs) of the coding (resp. non-coding) segment of the gene. The total length of the gene is the sum of the lengths of its coding and non-coding regions. "C. (p, q)" (resp. "N.C. (p, q)") denotes the optimal TD-ARMA parameters (p, q) for the coding (resp. non-coding) region of the gene. Finally, "C. IR" (resp. "N.C. IR") is the index of randomness of the coding (resp. non-coding) segment of the gene. Observe that, in all considered genes, the index of randomness of both the coding and non-coding segments are strictly positive, and the index of randomness of the coding region is consistently lower than the index of randomness of the non-coding region (recall that the index of randomness of a white noise is zero). These observations bring to bear two important conclusion: (i) Both the coding and non-coding sequences are not random, as attested by an index of randomness greater than zero. (ii) The coding sequences are "whiter" than the non-coding sequences. This conclusion revokes previous work on statistical correlation of DNA sequences, which, based on stationary time-series analysis, presumed that coding DNA behaves like random sequences [13, 5, 6, 10]; and supports the conflicting view that both coding and non-coding sequences are not random [11, 12, 9, 1720]. In particular, our conclusion is in accordance with the evolutionary periodogram analysis conducted in [11, 12].

Figure 3
figure 3

TD-ARMA modeling. TD-ARMA modeling of the Human gene 276: The blue signal is the DNA walk, and the red signal is the TD-ARMA(1,1) fitted signal. The TD-ARMA(1,1) model accurately fits the genomic signal.

Figure 4
figure 4

TD-ARMA coefficients estimation. Estimation of the TD-ARMA(1,1) coefficients of the Human gene 276. The TD-ARMA(1,1) model is given by x [n] + a [n - 1] x [n - 1] = w [n] + b [n - 1] w [n - 1]. The blue and black (resp. red and green) curves depict the time series a[n] (resp. b[n]) for the coding and non-coding regions of the gene, respectively.

Figure 5
figure 5

Curve of randomness. The curves of randomness of the coding and non-coding regions of the Human gene 276 are shown in blue and red, respectively. The index of randomness of the coding sequence is equal to 1.0603, whereas its corresponding value for the non-coding sequence is equal to 1.0627.

Table 1 Index of Randomness of the Coding and Non-Coding segments of Various Gene Sequences


In this paper, we modelled the non-stationary genomic sequences by a time-dependent autoregressive moving average (TD-ARMA) model. By expressing the time-dependent coefficients as linear combinations of parametric basis functions, we were able to transform a linear non-stationary problem into a linear time-invariant problem. Subsequently, we proposed three methods to estimate the time-dependent coefficients: Mean -square, least-squares, and recursive least-squares algorithms. Based on the estimated TD-ARMA coefficients, we defined an index of randomness to quantify the degree of randomness of both coding and non-coding sequences. We found that both coding and non-coding sequences are not random. However, a higher index of randomness attests that coding sequences are "whiter" than non-coding sequences. These results corroborate the evolutionary periodogram analysis of genomic sequences performed in [11] and [12], and revoke the stationary analysis' conclusion that coding DNA behaves like random sequences.