1 Introduction

This paper proposes a new method to specify linear models for vectors of time series which draws on ideas from two different sources: the subspace methods, originated in the system identification literature, and the time series analysis approach, built on the seminal work of Box and Jenkins (1970).

Since the pioneer work of Ho and Kalman (1966), the term “system identification” refers to the use of statistics and algebraic tools to fit a State-Space (SS) model to a dataset without a priori restrictions. Much of the recent work in this area follows the subspace methods approach: a family of system identification procedures that rely heavily on canonical correlation and Least Squares (LS) methods. For a survey see, e.g., Qin (2006).

On the other hand, modern time series analysis is critically influenced by the classical Box and Jenkins (1970) book and subsequent works. The philosophical contribution of this methodology can be synthesized in three basic principles: (i) “let the data speak” meaning that the specification of reduced form models should be based on the observable properties of the sample; (ii) “parameterize with parsimony,” meaning that models with few parameters should be preferred, all other aspects being equal, and; (iii) “all the models are wrong, but some are useful,” meaning that all specifications must be considered tentative and subject to a structured process of critical revision and diagnostic testing. We subscribe these principles and implemented them in the foundations of our approach.

On these bases, our method is structured in three sequential steps:


Determine the dynamic order of the system (or McMillan degree), defined as the minimum number of dynamic components required to realize the system output. We do this by using subspace methods to estimate a sequence of SS models over a predetermined range of orders, and then choosing the best one by means of Likelihood Ratio (LR) tests and information criteria (ICs). This simple decision, similar to the VAR order selection, is enough to fit a preliminary “Step-1 model” to the dataset, which might be overparameterized but useful to cover simple needs.


Determine the dynamic order for each individual time series, which are also known as Kronecker index (KI). We do this by applying a new subspace algorithm, named SUBEST1, which estimates the parameters of the Luenberger Canonical Form corresponding to a given vector of KIs. We use this algorithm to estimate the models corresponding to all the combinations of KIs which add up to the previously determined system order. The model with the best ICs (“Step-2 model”) is therefore a canonical representation for the time series, and will often be more parsimonious than the Step-1 model. In the univariate case this step is not required, because the system order coincides with the KI for the single series and the Step-1 model is, therefore, canonical.


The third and final step consists in refining the models obtained in previous steps, for example by computing Maximum-Likelihood Estimates (MLE) or by pruning non-significant parameters. Such a “Step-3 model” might be more parsimonious than the previously fitted models but, on the other hand, requires more effort and human decision-making.

The computational procedures employed to implement the Step-2, as well as the distinction between Steps 1, 2 and 3, are original contributions to the literature on canonical modeling, although closely related to the works of Hannan and Deistler (1988), Deistler et al. (1995), Bauer and Deistler (1999) and Bauer (2005c).

In comparison with mainstream econometric procedures, our proposal has three specific advantages, as it:

  1. 1.

    ...provides a unified modeling approach for single and multiple time series, as the same model-building process is followed in both cases,

  2. 2.

    ...is optionally automatic, meaning that each model is characterized by a few key integer parameters to be chosen following a clearly structured process, based on the use of a limited set of sample statistics; so the modeling decisions can be adopted automatically or by a human analyst,

  3. 3.

    ...is scalable, meaning that: (i) it provides a possibly crude but statistically adequate Step-1 model, which may be enough to satisfy simple analytical needs, which (ii) can be made parsimonious if required, by enforcing a canonical structure in Step-2, and (iii) additional efficiency can be achieved in Step-3.

The structure of the paper is as follows. Section 2 presents the notation and reviews some ideas on VARMA and SS systems. It also introduces the subspace methods and its asymptotic properties. Section 3 outlines the methodology proposed in the relatively simple case of stationary and non-seasonal time series and Sect. 4 extends the previous method to nonstationary and/or seasonal series. We do this by adding to the previous schema two specification phases to determine the number of unit roots and cointegrating relations, and to identify the dynamic orders of the seasonal and non-seasonal sub-systems. Sections 5 and 6 illustrate the practical application of this approach to single and multiple time series, respectively, with benchmark datasets taken from the literature. Last, Sect. 7 compares the method proposed with alternative methods, provides some concluding remarks and indicates how to obtain a free MATLAB code which implements all the algorithms required by our proposal.

2 Notation and previous results

In this Section we present the notation employed and the main previous results upon which our work is based. We begin by defining VARMA and SS systems. Then, we introduce the subspace methods and finally discuss the relation between canonical VARMA models (named echelon) and their equivalent SS representation.

2.1 VARMA models

Much work in applied time series analysis is based on the linear dynamic model:

$$\begin{aligned} \varvec{\bar{F}}(B)\textbf{z}_t \, = \, \varvec{\bar{L}}(B) \textbf{a}_t, \end{aligned}$$

where \(\textbf{z}_t\in {\mathbb {R}}^m\) is observed for \(t=1,\ldots ,T\); \(\textbf{a}_t \in {\mathbb {R}}^m\) is an innovation vector such that \(\textbf{a}_t \sim iid(\textbf{0},\varvec{\Sigma _a})\); B denotes the backshift operator, such that for any sequence \(\omega _t\): \(B^i\omega _t=\omega _{t-i}\), \(i=0,\pm 1,\pm 2,\ldots ,I\) and, finally, factors \(\varvec{\bar{F}}(B)\) and \(\varvec{\bar{L}}(B)\) in (1) are given by:

$$\begin{aligned} \varvec{\bar{F}}(B)=\sum _{j=0}^p\varvec{\bar{F}_j}B^j, \quad \varvec{\bar{L}}(B)=\sum _{j=0}^q\varvec{\bar{L}_j}B^j. \end{aligned}$$

Model (12) is assumed to be left coprime with the roots of \(\varvec{\bar{F}}(B)\) and \(\varvec{\bar{L}}(B)\) greater than unity, so \(\textbf{z}_t\) is stationary and invertible.

If we multiply the left and right-hand side of Eq. (1) by any arbitrary non-defective matrix we would obtain an observationally equivalent model for \(\textbf{z}_t\). Therefore, left coprimeness is not enough to identify the model. Achieving identifiability requires additional constraints. For example, normalising \(\bar{\textbf{F}}_0=\bar{\textbf{L}}_0=\textbf{I}\) yields the standard VARMA(p,q) introduced by Quenouille (1957). That is because, for a given transfer function, several VARMA(p,q) models exists, even under left coprimeness. For more details, the equivalence classes of VARMA systems is given in Hannan and Deistler (1988), Theorem 2.2.1.

An interesting alternative to a standard VARMA is the VARMA echelon, or VARMA\(_{E}\), representation. System (12) is in echelon form if, denoting by \(\bar{F}_{kl}(B)\) and \(\bar{L}_{kl}(B)\) the kl-th elements of \(\varvec{\bar{F}}(B)\) and \(\varvec{\bar{L}}(B)\), respectively, the polynomial factors in (1) are uniquely defined by:

$$\begin{aligned} \bar{F}_{kk}(B)&=1+\sum _{i=1}^{n_k}\bar{F}_{kk}(i)B^i,\text { for }k=1,\ldots ,m., \end{aligned}$$
$$\begin{aligned} \bar{F}_{kl}(B)&=\sum _{i=n_k-n_{kl}+1}^{n_k}\bar{F}_{kl}(i)B^i,\text { for }k\ne l. , \end{aligned}$$
$$\begin{aligned} \bar{L}_{kl}(B)&=\sum _{i=1}^{n_k}\bar{L}_{kl}(i)B^i,\text { with } \bar{L}_{kl}(0) = \bar{F}_{kl}(0)\text { for }k,l=1,\ldots ,m, , \end{aligned}$$

where the multi-index \(\alpha = (n_1,n_2,\ldots ,n_m)\), previously named KI, represents the dynamic order of each series, and:

$$\begin{aligned} n_{kl}=\left\{ \begin{array}{ll} \min (n_k+1,n_l) &{} \text {for} \quad l \le k\\ \min (n_k,n_l) &{} \text {for} \quad l > k \end{array} \right\} \quad k,l=1,2,\ldots ,m. \end{aligned}$$

The KIs uniquely define an echelon canonical VARMA form by means of Eqs. (3a4); see Hannan and Deistler (1988), Theorem 2.5.1. Broadly speaking, these indices determine the zeros, ones and free parameters structure of the echelon form. By canonical, we mean a unique representation (commonly with the minimum number of parameters) among the output-equivalent VARMA class. Another property of KIs is that \(\sum _{k=1}^{m}n_k=|\alpha |=n\), meaning that the echelon form distributes the dynamic order between the different time series so that their sum equals the McMillan degree, denoted by n. Specific details about echelon forms can be found in Hannan and Deistler (1988), Chapter 2, and Lütkepohl (2005), Definition 12.2.

To clarify the exposition above, the following example compares two equivalent VARMA representations of the same transfer function with \(\alpha =(1,0)\). We will use this model in different examples throughout this paper.

2.1.1 Example 1: A VARMA(1,1) model vs. its equivalent VARMA\(_{E}\)

Consider the left coprime standard VARMA(1,1) model:

$$\begin{aligned} (\textbf{I}+\varvec{\bar{F}_1}B)\textbf{z}_t=(\textbf{I}+\varvec{\bar{L}_1}B)\textbf{a}_t,\quad \textbf{a}_t\sim iid N(\textbf{0},\varvec{\Sigma _a}), \end{aligned}$$


$$\begin{aligned} \varvec{\bar{F}_1}=\begin{bmatrix} -\,0.50&{}0\\ 0.25&{}0\end{bmatrix}, \quad \varvec{\bar{L}_1}=\begin{bmatrix} -\,0.30&{}0.20\\ 0.15&{}-0.10\end{bmatrix}, \quad \varvec{\Sigma _a}=\begin{bmatrix} 1.00&{}0.40\\ 0.40&{}1.00\end{bmatrix}. \end{aligned}$$

As \(\alpha =(1,0)\), its equivalent VARMA\(_E\) form is defined as:

$$\begin{aligned} (\mathbf {F_0}+\mathbf {F_1}B)\textbf{z}_t=(\mathbf {L_0}+\mathbf {L_1}B)\textbf{a}_t, \end{aligned}$$


$$\begin{aligned} \mathbf {F_0=L_0}=\begin{bmatrix} 1&{}0\\ 0.50&{}1\end{bmatrix}, \quad \mathbf {F_1}=\begin{bmatrix} -\,0.50&{}0\\ 0&{}0\end{bmatrix}, \quad \mathbf {L_1}=\begin{bmatrix} -\,0.30&{}0.20\\ 0&{}0\end{bmatrix}, \end{aligned}$$

and \(\varvec{\Sigma _a}\) coincides with the one in (5), because both models are observationally equivalent and, therefore, have the same innovations.

This example motivates the following remarks:

Remark 1

It is easy to see the output equivalence of models (5) and (6), as pre-multiplying both sides of (6) by \({{\textbf {F}}_{\textbf {0}}^{\textbf {-1}}}\) yields (5).Footnote 1

Remark 2

The standard VARMA specification (5) has six coefficients while its equivalent echelon form (6) only has four, being parsimony one of the advantages of canonical versus non-canonical formulations.

Remark 3

As \(\alpha =(1,0)\) in (6), each scalar in the vector corresponds to the highest dynamic order for the corresponding series. Therefore, the order of the whole system is \(n=1\).

Remark 4

When all the KIs are equal, and only in this case, the standard VARMA model (1-2) has an echelon structure and, accordingly, is a canonical representation. In other case, the echelon form has less parameters than the equivalent standard VARMA which, accordingly, is not canonical.

Remark 5

Models with different KI vectors are not nested in general, so they cannot be compared using LR tests. Their relative adequacy is often assessed by means of ICs, see, e.g., Hannan and Kavalieris (1984a); Poskitt (1992); Lütkepohl and Poskitt (1996).

As the above remarks point out, when the KIs are different, one can generally obtain the equivalent VARMA representation for any echelon form by pre-multiplying it by an appropriate unimodular transformation. Nonetheless, Gevers (1985) shows that, in general, the standard VARMA representation (12) implies that the McMillan degree is \(m \times \max \{p,q\}\) and, therefore, only transfer functions whose system order is a multiple of m can be adequately represented that way. Additional details on the KIs nesting structure can be found in Hannan and Deistler (1988), Section 2.6. These facts are fundamental for our paper, and motivate Sects. 3.1 and 3.2.

2.2 State-Space systems

The linear time-invariant SS representation employed in this paper is:

$$\begin{aligned} \textbf{x}_{t+1}&= \varvec{\Phi } \textbf{x}_t + \textbf{K} \textbf{a}_t , \end{aligned}$$
$$\begin{aligned} \textbf{z}_t&= \textbf{H}\textbf{x}_t+ \textbf{a}_t , \end{aligned}$$

where \(\textbf{z}_t\) and \(\textbf{a}_t\) are as in (12). We assume that the SS system is:

  1. A1.

    ...strictly minimum-phase, that is, \(|\lambda _i(\varvec{\Phi }-\textbf{KH})|<1,\forall i=1,...,n\), where \(\lambda _i(.)\) denotes the i-th eigenvalue,

  2. A2.

    ...minimal, which implies that the pair \((\varvec{\Phi },\textbf{H})\) is observable and \((\varvec{\Phi ,\textbf{K}})\) is controllable, and

  3. A3.

    ...stable, so: \(|\lambda _i(\varvec{\Phi })|<1,\forall i=1,...,n\).

In comparison with mainstream alternatives, model (7a7b) has a unique error term affecting both, the state and observation equations. This model is sometimes known as Prediction Error or Innovations Form (IF).Footnote 2 We will use the latter denomination throughout the paper. The importance of the IF lies in the fact that

  1. 1.

    ...it is a general representation, because any SS system with different unobserved inputs in (7a7b) can be written in an equivalent IF (see, Hannan and Deistler 1988, Chapter 1). Additionally, Casals et al. (1999) present an algorithm to compute the matrices of an IF equivalent to any SS model by solving an exactly determined system of equations, and

  2. 2.

    ...it has clear computational advantages for likelihood computation and signal extraction, because its states can be estimated with an uncertainty that converges to zero under nonrestrictive assumptions, see Casals et al. (2013).

Most of these results are the core of Chapters 1 and 2 in Hannan and Deistler (1988), and have also been empirically discussed in Casals et al. (2016), Chapters 2, 5 and 9.

2.3 Subspace methods

Subspace methods are used to estimate the parameters in the IF (7a7b) by applying generalized LS to a rank-deficient system, see e.g. Qin (2006); Favoreel et al. (2000). To do this, we need to write the model (7a7b) in subspace form by substituting (7b) into (7a) in \(\textbf{a}_t\), and solving the resulting recursion. This yields:

$$\begin{aligned} \textbf{x}_{t}=(\varvec{\Phi }-\textbf{KH})^{t-1} \textbf{x}_1+\sum _{j=1}^{t-1} (\varvec{\Phi }-\textbf{KH})^{t-1-j}\textbf{K}\textbf{z}_{j}, \end{aligned}$$

where the states at time t depend on its initial value, \(\textbf{x}_1\), as well as the past values of the output.

Substituting (8) into the observation Eq. (7b) results in:

$$\begin{aligned} \textbf{z}_{t}=\textbf{H}\varvec{\Phi }^{t-1} \textbf{x}_1+ \textbf{H}\sum _{j=1}^{t-1}\varvec{\Phi }^{t-1-j} \textbf{Ka}_{j}+\textbf{a}_{t}. \end{aligned}$$

Equations (8) and (9) can be written in matrix form as:

$$\begin{aligned} \textbf{X}_f&=\varvec{\mathcal {L}}_x^i \textbf{X}_p + \textbf{L}_z \textbf{Z}_p, \end{aligned}$$
$$\begin{aligned} \textbf{Z}_f&=\varvec{\mathcal {O}}_i\textbf{X}_f + \textbf{V}_{i}\textbf{A}_f, \end{aligned}$$

where i is a user-defined index which represents the number of past block rows, p. For simplicity, we will assume hereafter that the scalars p and f are equal to i.Footnote 3 For now, the only requisite for this value is that \(i>n\). This dimensioning implies that \(\varvec{\mathcal {L}}\) matrices in (10a) are:

$$\begin{aligned} \varvec{\mathcal {L}}_x&:= (\varvec{\Phi }-\textbf{KH})_{n \times n}, \end{aligned}$$
$$\begin{aligned} \varvec{\mathcal {L}}_z&:= \big [\varvec{\mathcal {L}}_x^{i-1}\textbf{K} \quad \ldots \quad \varvec{\mathcal {L}}_x\textbf{K} \quad \textbf{K}\big ]_{n \times im}. \end{aligned}$$

Second, given these previous choices, \(\textbf{Z}_f\) and \(\textbf{Z}_p\) are block-Hankel matrices whose rows are \([\textbf{z}_{t}^{\intercal },\textbf{z}_{t+1}^{\intercal }, \dots ,\textbf{z}_{t+f-1}^{\intercal }]^{^{\intercal }}\) and \([\textbf{z}_{t-p}^{\intercal }, \textbf{z}_{t-p+1}^{\intercal }, \dots , \textbf{z}_{t-1}^{\intercal }]^{^{\intercal }}\), respectively, and each column is characterized by \(t=p+1,\dots ,T-f+1\). Therefore, the dimension of both matrices is \(im \times T-2i+1\). The block-Hankel matrix \(\textbf{A}_f\) has the same general structure as \(\textbf{Z}_f\), with \(\textbf{a}_t\) instead of \(\textbf{z}_t\), while \(\textbf{X}_p:=[\textbf{x}_1, \textbf{x}_2, \ldots , \textbf{x}_{T-p-f+1}]\) and \(\textbf{X}_f:=[\textbf{x}_{p+1}, \textbf{x}_{p+2}, \ldots , \textbf{x}_{T-f+1}]\), which have n rows (from assumption A2) and the same number of columns as \(\textbf{Z}_f\), are the sequence of past and future states, respectively.

Finally, matrices \(\varvec{\mathcal {O}}_i\) and \(\textbf{V}_i\) are known functions of the original parameter matrices, \(\varvec{\Phi }\), \(\textbf{K}\) and \(\textbf{H}\), see Section 2 of Garcia-Hiernaux et al. (2010), such that:

$$\begin{aligned}{} & {} {\varvec{\mathcal {O}}}_i:= \begin{bmatrix} \textbf{H}^{^{\intercal }}&(\mathbf {H \varvec{\Phi }})^{^{\intercal }}&(\mathbf {H \varvec{\Phi }}^2)^{^{\intercal }}&\ldots&(\mathbf {H \varvec{\Phi }}^{i-1})^{^{\intercal }} \end{bmatrix}^{\intercal }_{i \cdot m \times n}, \end{aligned}$$
$$\begin{aligned}{} & {} \textbf{V}_i:= \begin{bmatrix} \textbf{I}_m &{} \textbf{0} &{} \textbf{0} &{}\ldots &{} \textbf{0}\\ \textbf{HK} &{} \textbf{I}_m &{} \textbf{0} &{} \ldots &{} \textbf{0}\\ \mathbf {H\varvec{\Phi } K} &{} \textbf{HK} &{} \textbf{I}_m &{} \ldots &{} \textbf{0} \\ \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots \\ \mathbf {H\varvec{\Phi }}^{i-2} \textbf{K} &{} \mathbf {H\varvec{\Phi }}^{i-3} \textbf{K} &{} \mathbf {H\varvec{\Phi }}^{i-4} \textbf{K} &{} \ldots &{} \textbf{I}_m \end{bmatrix}_{i \cdot m}. \end{aligned}$$

From assumption A1 and for sufficiently large i, \(\varvec{\mathcal {L}}^i_x\) tends to the null matrix. Consequently, substituting Eqs. (10a) in (10b) gives:

$$\begin{aligned} \textbf{Z}_f \simeq \varvec{\Theta }_i \textbf{Z}_p + \textbf{V}_{i}\textbf{A}_f, \end{aligned}$$

where \(\varvec{\Theta }_i:= \varvec{\mathcal {O}}_i\varvec{\mathcal {L}}_z\).

Equation (15) is the basis of subspace methods. It is a regression model which can be consistently estimated by LS because the columns in \(\textbf{V}_{i}\textbf{A}_f\) are uncorrelated with the regressor \(\textbf{Z}_p\) (from assumption A3).

Note that, for sufficiently large i, matrix \(\varvec{\Theta }_i\) has a reduced rank, n, which is less than the matrix dimension \(im \times im\). So, estimating (15) requires reduced-rank LS. The estimation problem can be written as:

$$\begin{aligned} \min _{\{\varvec{\Theta }_i\}}&\quad \Big \Vert \textbf{W}_1\big (\textbf{Z}_f-\varvec{\Theta }_i\textbf{Z}_p\big )\Big \Vert _F^2, \qquad \text {such that rank}(\varvec{\Theta }_i)=n, \end{aligned}$$

where \(\Vert \cdot \Vert _F\) is the Frobenius norm. This problem is solved by computing:

$$\begin{aligned} \varvec{\hat{\Theta }}_i = \textbf{Z}_f \textbf{Z}_p^{^{\intercal }} (\textbf{Z}_p \textbf{Z}_p^{^{\intercal }})^{-1}, \end{aligned}$$

and then applying a SVD decomposition (denoted as \(\overset{svd}{=}\)) on a weighted version of \(\varvec{\hat{\Theta }}_i\), to deal with the rank reduction. Van Overschee and De Moor (1995) proved that several subspace methods are equivalent to computing:

$$\begin{aligned} \textbf{W}_1 \varvec{\hat{\Theta }}_i \textbf{W}_2 \overset{svd}{=}\ \textbf{U}_{n} \textbf{S}_{n} \textbf{V}_{n}^{^{\intercal }} + \hat{\textbf{E}}_n, \end{aligned}$$

where \(\hat{\textbf{E}}_n\) denotes the approximation error arising from the use of a system order n, and \(\textbf{W}_1\), \(\textbf{W}_2\) are two nonsingular weighting matrices, whose composition depends on the specific algorithm employed. While \(\textbf{W}_2 = (\textbf{Z}_p \textbf{Z}_p^{^{\intercal }})^{\frac{1}{2}}\) is common for most subspace methods (only for systems without observable inputs) there are differences for \(\textbf{W}_1\); see Table 1.

Table 1 Weighting matrix \(\textbf{W}_1\) in Eq. (18) in different subspace algorithms

Table 1 shows that MOESP and N4SID subspace algorithms use the identity as weighting matrix. In contrast, when CVA is used, the singular values obtained from the SVD decomposition coincide with the canonical correlations between \(\textbf{Z}_f\) and \(\textbf{Z}_p\).

Alternatively, the weighting matrix used in this paper is an estimate of the inverse of the square root of the error covariance matrix, as \(\varvec{\Pi }^{\perp }_{\textbf{Z}_p}\) is the projection in the orthogonal complement of \(\textbf{Z}_{p}\). As we will discuss later (see Sect. 2.3.3) adequately choosing these weightings is crucial for some asymptotic results.

From Eq. (18), matrices in (7a7b) are usually estimated following two main approaches. The first one uses the extended observability matrix, sometimes known as shift invariance approach, as it uses the shift invariance property of \(\varvec{\mathcal {O}}_i\). The second uses the estimates of the states sequence, and is usually termed state approach.

2.3.1 Shift invariance approach

The extended observability matrix in (13) can be estimated from (18) as \(\varvec{\hat{\varvec{\mathcal {O}}}}_i = \textbf{W}_1^{-1}\textbf{U}_n \textbf{S}_n^{1/2}\), where we use that \(\varvec{\Theta }_i:= \varvec{\mathcal {O}}_i \varvec{\mathcal {L}}_z\). Note that if a different factorization is used, the resulting estimates would be observationally equivalent.

From Arun and Kung (1990) and Verhaegen (1994), we can obtain the matrix \(\varvec{\Phi }\) in (7a7b) as:

$$\begin{aligned} \varvec{\hat{\Phi }} = \underline{\hat{\varvec{\mathcal {O}}}_{i}^{\dag }} \overline{\hat{\varvec{\mathcal {O}}}_{i}}, \end{aligned}$$

where \(\underline{\hat{\varvec{\mathcal {O}}}_{i}}\) and \(\overline{\hat{\varvec{\mathcal {O}}}_{i}}\) result from removing the last and first m rows, respectively, from \(\hat{\varvec{\mathcal {O}}}_{i}\), and \(\textbf{A}^{\dag }\) denotes the Moore–Penrose pseudo–inverse of \(\textbf{A}\) (see, e.g., Golub and Van Loan 1996). An estimate of \(\textbf{H}\) is obtained from the first m rows of \(\hat{\varvec{\mathcal {O}}}_{i}\).

Last, from the SVD (18) we also get:

$$\begin{aligned} \hat{\varvec{\mathcal {L}}}_z = \textbf{S}_n^{1/2}\textbf{V}^{^{\intercal }}_n\textbf{W}_2^{-1}, \end{aligned}$$

and \(\hat{\textbf{K}}\) can be easily obtained using (12).Footnote 4

2.3.2 State approach

This second approach is based on the estimation of the state sequence, see e.g., Larimore (1990); Van Overschee and De Moor (1994). These algorithms estimate the model parameters by solving a simple set of LS problems.

From \(\hat{\varvec{\mathcal {L}}}_z\) obtained in (20), we can compute the state sequence as:

$$\begin{aligned} \hat{\textbf{X}}_f = \hat{\varvec{\mathcal {L}}}_z \textbf{Z}_p, \end{aligned}$$

and similarly, \(\hat{\textbf{X}}_{f^+} = \hat{\varvec{\mathcal {L}}}_z \textbf{Z}_{p^+}\), where \(\textbf{Z}_{p^+}\) is like \(\textbf{Z}_p\), but adding a ‘\(+1\)’ to all the subscripts.Footnote 5 Once the state sequence is approximated, the estimates \((\hat{\textbf{H}},\varvec{\hat{\Phi }},\hat{\textbf{K}})\) are obtained as follows:

$$\begin{aligned} \varvec{\hat{H}} = \textbf{Z}_{f_{1}} \hat{\textbf{X}}_{f}^{^{\intercal }} {\big [ \hat{\textbf{X}}_{f} \hat{\textbf{X}}_{f}^{^{\intercal }} \big ]}^{-1}, \end{aligned}$$

where \(\textbf{Z}_{f_{1}}\) denotes the first row of \(\textbf{Z}_{f}\). Building on these estimates, we compute the residuals \(\varvec{\hat{A}}_{f_{1}} = \textbf{Z}_{f_{1}}-\hat{\textbf{H}}\hat{\textbf{X}}_f\) and, finally, get:

$$\begin{aligned} \begin{bmatrix} \varvec{\hat{\Phi }}&\varvec{\hat{K}} \end{bmatrix} = \hat{\textbf{X}}_{f^+} \begin{bmatrix} \hat{\textbf{X}}_{f}^{^{\intercal }}&\varvec{\hat{A}}_{f_{1}}^{^{\intercal }} \end{bmatrix} {\Bigg [ \begin{bmatrix} \hat{\textbf{X}}_{f} \\ \varvec{\hat{A}}_{f_{1}} \end{bmatrix} \begin{bmatrix} \hat{\textbf{X}}_{f}^{^{\intercal }}&\varvec{\hat{A}}_{f_{1}}^{^{\intercal }} \end{bmatrix} \Bigg ]}^{-1}. \end{aligned}$$

2.3.3 Asymptotic properties

This section briefly discusses the asymptotic properties of subspace methods.

The estimates resulting from the main subspace methods are known to be strongly consistent and asymptotically normal under mild assumptions. Bauer (2005a) surveys the literature on this subject and gives conditions for consistency and asymptotic normality for many subspace methods. This is done for the system (7a7b) and also when observable inputs are included.

An effort has also been made to study the efficiency of these estimators, e.g., by comparing their asymptotic variances. In contrast with consistency and asymptotic normality, here we find some differences among the methods. Bauer and Jansson (2000) and Bauer and Ljung (2002) provide numerical approximations of the asymptotic variance of the estimators for several algorithms. The most relevant result in this line is presented by Bauer (2005b), who proves that CVA weighting matrices (see Table 1), together with the state approach (see Sect. 2.3.2), leads to estimates that are asymptotically equivalent to (pseudo) MLE,Footnote 6 being this the only asymptotic efficient result currently available in the literature.Footnote 7

The algorithm suggested in this paper named SUBEST1 (see Sect. 3.2 for details) follows a two-stage approach with the weighting matrix \(\textbf{W}_2\) given in Table 1. So, Bauer (2005b) theorem cannot be directly applied. We do not have a formal proof for its asymptotic efficiency. For this reason, we conduct a simulation where the mean-square errors of SUBEST1 and ML (Casals et al. 1999) estimates are computed. Additionally, the results obtained in Deistler et al. (1995) for other subspace methods and the same data generating process (DGP) are included for comparison. The model (7a7b) considered in the simulation is:

$$\begin{aligned} \varvec{\Phi }=\begin{bmatrix} 0.8&{}0.2\\ -\,0.4&{}-\,0.5 \end{bmatrix}, \quad \textbf{K}=\begin{bmatrix} 1.5&{}0\\ -\,0.2&{}-\,0.8 \end{bmatrix}, \quad \textbf{H}=\begin{bmatrix} 1&{}0\\ 0&{}1 \end{bmatrix}, \end{aligned}$$

with \(\textbf{a}_t \sim iid N(\textbf{0},\textbf{I})\). Following Deistler et al. (1995), the exercise is performed for \(T=2000,4000,8000,16000\) and \(R=500\) replications for each sample size, so that the results are comparable. We consider two measures for the quality of SUBEST1 estimates:

$$\begin{aligned} M(\text {SUBEST1}) = \frac{T}{R}\sum _{i=1}^R\left( \hat{\theta }^{\text {SUBEST1}}_i-\theta _i \right) ^2, \end{aligned}$$


$$\begin{aligned} D(\text {SUBEST1}) = \frac{T}{R}\sum _{i=1}^R\left( \hat{\theta }^{\text {SUBEST1}}_i-\hat{\theta }^{\text {ML}}_i \right) ^2 \end{aligned}$$

where \(\hat{\theta }^{\text {SUBEST1}}_i, \hat{\theta }^{\text {ML}}_i\) denote the SUBEST1 and ML estimates, while \(\theta _i\) denotes the true values given in (24). The quantity M(.) is T times the sample mean-squared error of the estimate. For asymptotically efficient methods, M(.) converges to the Cramer-Rao Bound (CRB) which is a lower bound for the asymptotic variance of the parameter estimates. On the other hand, D(.) is T times the mean-squared difference between the corresponding method and ML estimates. For an asymptotically efficient method, \(D(.) \rightarrow 0\) as \(T \rightarrow \infty \).

The results of the simulation are summarized in Table 2 and Fig. 1. Table 2 shows the average of M(.) and D(.) statistics for all the parameters in (24), for SUBEST1, the echelon subspace estimator (ECH) and CVA, computed for the same example by Deistler et al. (1995), and, finally, the corresponding CRB. The results suggest that SUBEST1, contrary to ECH, yields estimates that are as asymptotically efficient as those obtained by ML or CVA; see Table 2. Figure 1 evidences that M(SUBEST1) \(\rightarrow \) CRB and D(SUBEST1) \(\rightarrow 0\), as \(T \rightarrow \infty \). This conclusion was confirmed by many other simulations (see the online Appendix at https://doi.org/10.100/s00362-023-01451-4), while no counterexamples have been found. This result supports the use of SUBEST1 for the modeling method presented in the rest of the paper.

Fig. 1
figure 1

Results of the simulation of model (24). Values for ECH and CVA come from Peternell (1995); see Table 2 for more detail

Table 2 Simulation results for model (24)

2.4 Relationships between SS and VARMA\(_E\) representations

As VARMAs, the IF of a given dynamic system is not unique. Note that for any nonsingular matrix T, applying the transformation \(\textbf{x}_t^*=\textbf{T}^{-1}\textbf{x}_t\), \(\varvec{\Phi }^*=\textbf{T}^{-1}\varvec{\Phi }\textbf{T}\), \(\varvec{\Gamma }^*=\textbf{T}^{-1}\varvec{\Gamma }\), \(\textbf{E}^*=\textbf{T}^{-1}\textbf{E}\), \(\textbf{H}^*=\textbf{H}\textbf{T}\) to any IF, yields an alternative output-equivalent representation.

A canonical IF representation is characterized by two elements, a specific structure of the transition matrix \(\varvec{\Phi }\) and a unique transformation matrix T. The main property of canonical representations is that they realize the system output as a function of a unique parameter set and, therefore, are exactly identified.Footnote 8

Hannan and Deistler (1988) show the equivalence between the VARMA\(_E\) and the echelon IF by analyzing the structure of linear dependence relations between the rows of the Hankel matrix \(\mathcal {H}=\varvec{\mathcal {O}}\varvec{\mathcal {C}}\), where \(\varvec{\mathcal {O}}\) is the infinite version of (13) and \(\varvec{\mathcal {C}}:=[\textbf{K},\varvec{\Phi }\textbf{K},\varvec{\Phi }^2\textbf{K},...]\); see Hannan and Deistler (1988), Theorems 2.5.1 and 2.5.2.

In this paper we use the SS Luenberger Canonical Form (LCF).Footnote 9 This representation is convenient because it is easy to obtain from it the equivalent VARMA\(_E\) representation. Particularly, Proposition 2 in Casals et al. (2012) proves that if the coefficients of a IF process are known, then: (1) the KI vector can be derived from the observability matrix (13), and (2) the VARMA\(_E\) representation can be obtained from a change of coordinates that transforms any IF into its corresponding LCF. We will use this result in Sect. 3.2.

3 Modeling stationary datasets

This section details the main steps of our modeling strategy, which consist in: (i) determining the system order, (ii) estimating the KIs compatible with the previous decision, and (iii) refining the resulting model.

3.1 Step-1: Estimating the system order and a preliminary model

In our approach, the first step consists in determining the minimum dynamic order (or McMillan degree) required to realize \(\textbf{z}_t\).

The following remarks and examples may help to clarify this concept:

  1. 1.

    A model with \(n=0\) would be static, as its output could be realized by 0 dynamic components;

  2. 2.

    ...when \(n=3\) the observable time series can be realized as a linear combination of three first-order dynamic components;

  3. 3.

    ...the order of a standard VARMA(p,q) process is \(m \times \max \{\,p,q\}\); and,

  4. 4.

    ...the output of a n-order system can also be realized by systems of order \(n+1\), \(n+2\)... Hence, a unique characterization for the McMillan degree requires the dimension of the system to be minimal (see assumption A2).

The literature typically determines n by estimating a sequence of SS models over a predetermined search space for the McMillan degree, \(n = 0, 1, 2...\) These models can be efficiently estimated using subspace techniques, see Favoreel et al. (2000); Bauer (2005a), and compared with each other by means of: (a) LR tests, see Tsay (1989a); Tiao and Tsay (1989), (b) information criteria, see Akaike (1974); Schwarz (1978); Hannan and Quinn (1979), or (c) canonical correlation criteria, see Bauer (2001); Garcia-Hiernaux et al. (2012).

We determine the McMillan degree automatically by following the modal strategy suggested by Garcia-Hiernaux et al. (2012),which consists in determining n as the modal value of the orders chosen by a selection of the methods mentioned above. We refer to this procedure as NID (N-IDentification) and the following example illustrates its application.

3.1.1 Example 2: Estimating the Step-1 model

Consider again the VARMA(1,1) model (5) used in Example 1. We simulate 100 observations of this process and apply to them the previously mentioned NID algorithm, obtaining the results displayed in Table 3, where the acronyms AIC, SBC and HQ denote the Akaike (1974), Schwarz (1978) and Hannan and Quinn (1979) information criteria, respectively, \(SVC_{\Omega _2}\) and NIDC denote the canonical correlation-based criteria proposed by Garcia-Hiernaux et al. (2012) and, last, \(\chi ^2\)(5%) denotes the p-value of the LR test for significance of the canonical correlations proposed by Tsay (1989a). The 5% value in parentheses denotes that automatic testing will reject the null if the p-value is smaller than 5%. Last, the orders automatically chosen by each criterion are displayed in the last row of Table 3. Note that the correct order (\(n=1\)) was the modal choice, and the only criterion that overestimates n is the AIC.

Table 3 Output from NID when applied to \(\textbf{z}_t\) in model (5)

After determining the system order, it is easy to apply any of the subspace methods described in Sect. 2.3 to estimate the parameters of a n-order IF (7a7b). In this case, we use a shift invariance approach with \(\textbf{W}_1=(\textbf{Z}_f \varvec{\Pi }^{\perp }_{\textbf{Z}_p} \textbf{Z}_f^{^{\intercal }})^{-\frac{1}{2}}\) and \(i=5\).Footnote 10 We will call this a “Step-1 model”, and the estimated parameter matrices for this sample are:

$$\begin{aligned} \varvec{\hat{\Phi }}= & {} .519, \quad \varvec{\hat{K}}=\begin{bmatrix} 0.251&.258\end{bmatrix}, \quad \varvec{\hat{H}}=\begin{bmatrix}.640\\ -.292\end{bmatrix},\\ \varvec{\hat{\Sigma }}_a= & {} \begin{bmatrix}.924 &{}.376\\ .376 &{} 1.001 \end{bmatrix}. \end{aligned}$$

Chapter 9 in Casals et al. (2016) describes a simple numerical procedure to write an IF in an equivalent balanced VARMA(n,n) model.Footnote 11 Applying this procedure to the previous IF returns the standard VARMA(1,1) model:

$$\begin{aligned} \varvec{\bar{F}}_0=\varvec{\bar{L}}_0=\textbf{I}, \quad \varvec{\hat{\bar{F}}}_1=\begin{bmatrix} -.519&{}0\\ .237&{}0\end{bmatrix}, \quad \varvec{\hat{\bar{L}}}_1=\begin{bmatrix} -.236&{}.165\\ .120&{}-.075\end{bmatrix}, \end{aligned}$$

whose values are very close to the true coefficients in (5).

In the univariate case, the Step-1 model is a canonical representation which can only be refined by pruning non-significant parameters, see Sect. 3.3. In the multivariate case, assuring a canonical representation requires further refining. We discuss this issue in the next Section. However, even if this Step-1 model may be somewhat overparameterized, it might be sufficient to achieve the objectives of the model-building exercise. In this case, the modeling process could stop here.

3.2 Step-2: Estimating the KIs and its corresponding canonical model

We already mentioned that IF and VARMA representations are interchangeable. In particular, Casals et al. (2012) show how to obtain the VARMA\(_E\) representation of any IF when the matrices of the system and the KI, \(\alpha \), are known.

However, the equivalence issue is still unsolved when the IF coefficients are not known, but estimated from a finite sample. That happens because the estimated observability matrix does not present the linear dependence structure implied by \(\alpha \), and then the similar transformation proposed in Casals et al. (2012) does not lead to the LCF and, hence, to the equivalent VARMA\(_E\).

To solve this issue, we devised SUBEST1, a novel subspace algorithm that estimates the LCF model corresponding to any given KI vector \(\alpha \). The estimated parameters resulting from SUBEST1 coincide with those in a VARMA\(_E\) form. The algorithm follows a two-stage procedure. Initial estimates of the system matrices (using a shift invariance approach, see Sect. 2.3.1) are used to obtain a transformation matrix \(\varvec{\hat{T}}\). We then use this matrix to transform the states and finally apply a state approach (see Sect. 2.3.2) to estimate the echelon form. As far as we know, this procedure is new in the literature and is the main contribution of the paper.

To describe the steps of SUBEST1, we will use the subindex notation {i:j}, meaning that any matrix \(\textbf{A}_{i:j}\) is the partition of \(\textbf{A}\) that includes the rows from i to j.


Assume that n is known or consistently estimated using the procedures presented in Sect. 3.1. Estimate \(\varvec{\Phi }\), \(\textbf{K}\) and \(\textbf{H}\) using the subspace methods presented in Sect. 2.3.1 and the system order n. We denote by \(\hat{\varvec{\Phi }}\), \(\hat{\textbf{K}}\) and \(\hat{\textbf{H}}\) these estimates.


For a given KI vector \(\alpha \), apply Proposition 2 in Casals et al. (2012) to obtain the transformation matrix \(\textbf{T}\) such that \(\varvec{\Phi }^*=\textbf{T}^{-1}\varvec{\Phi }\textbf{T}\) and \(\textbf{H}^*=\textbf{H}\textbf{T}\) present the LCF structure when \(\alpha \), \(\varvec{\Phi }\) and \(\textbf{H}\) are known. Note that, even if the \(\alpha \) imposed as constraint is the true KI, this does not lead to the ones/zeros structure of the LCF, as \(\textbf{T}\) is computed with \(\varvec{\hat{H}}\) and \(\varvec{\hat{\Phi }}\), and not with the true (unknown) matrices of the system.


Obtain the states \(\textbf{X}_f\) and \(\textbf{X}_{f^+}\) from the model in Step-1, as explained in Sect. 2.3.2, and denote these estimates by \(\hat{\textbf{X}}_f\) and \(\hat{\textbf{X}}_{f^+}\), respectively. Then we compute the sequences of states, \(\textbf{X}^*_{f}\) and \(\textbf{X}^*_{f^+}\), that correspond to the LCF by applying the previous similar transformation \(\textbf{T}\) as \(\hat{\textbf{X}}^*_{f}=\textbf{T}^{-1}\hat{\textbf{X}}_{f}\) and \(\hat{\textbf{X}}^*_{f^+}=\textbf{T}^{-1}\hat{\textbf{X}}_{f^+}\).


The LCF is defined by the matrices \(\varvec{\Phi }^*\) and \(\textbf{H}^*\) as:

$$\begin{aligned} \varvec{\Phi }^*=\begin{bmatrix} \textbf{F}_1 &{} \textbf{Q}_1 &{} \textbf{0} &{}\ldots &{} \textbf{0}\\ \textbf{F}_2 &{} \textbf{0} &{} \textbf{Q}_2 &{}\ldots &{} \textbf{0}\\ \vdots &{} \vdots &{} \vdots &{} \ddots &{} \vdots \\ \textbf{F}_{n_{\max }-1} &{} \textbf{0} &{} \textbf{0} &{}\ldots &{} \textbf{Q}_{n_{\max }-1} \\ \textbf{F}_{n_{\max }} &{} \textbf{0} &{} \textbf{0} &{}\ldots &{} \textbf{0} \end{bmatrix} \quad \text {and} \quad \textbf{H}^*=\big [\textbf{F}_{0} \quad \textbf{0} \quad \textbf{0} \big ]. \end{aligned}$$

where \(\varvec{\Phi }^*\) is a companion matrix such that each \(\textbf{F}_j\) block \((j=1,...,n_{\max })\) has a number of rows equal to the number of KIs greater or equal to k, \(\bar{m}=\sum _{k=1}^m \min \{n_k,1\}\) columns and some null elements. In fact, the (kl)-th element of \(\textbf{F}_{j}\) will be nonzero only if \(j \in \big [n_k-n_{kl}+1, n_k\big ]\), where \(n_{kl}\) was defined in (4). Each \(\textbf{Q}_k\) block is a zeros/ones matrix, with as many columns as the number of KIs greater or equal to k. If the endogenous variables are sorted according to their corresponding KI, the structure of \(\textbf{Q}_k\) will be \(\textbf{Q}_k=\big [\textbf{I}_{k+1} \quad \textbf{0} \big ]^{^{\intercal }}\), where \(\textbf{I}_{k+1}\) is an identity matrix with the same number of rows as \(\textbf{F}_{k+1}\).

About \(\textbf{H}^*\), \(\textbf{F}_{0}\) is a \(m \times \bar{m}\) matrix, such that the rows corresponding to components with a nonzero KI can be organized in a \(\bar{m} \times \bar{m}\) lower triangular matrix with ones in the main diagonal.

Therefore, \(\varvec{\Phi }^*\) and \(\textbf{H}^*\) can be written as:

$$\begin{aligned} \varvec{\Phi }^*=\big [\varvec{\Phi }^*_{1:\bar{m}}~\varvec{\Phi }^*_{\bar{m}+1:n}\big ], \quad \textbf{H}^*=\begin{bmatrix} \bar{\textbf{H}}^*_1+\textbf{I} &{} \textbf{0} \\ \textbf{H}^*_2 &{} \textbf{0} \end{bmatrix} \end{aligned}$$

where \(\bar{\textbf{H}}^*_1=\textbf{H}^*_1 - \textbf{I}\) and matrices \(\varvec{\Phi }^*_{1:\bar{m}}\), \(\bar{\textbf{H}}^*_1\) and \(\textbf{H}^*_2\) include parameters to be estimated and zeros. On the other hand, the matrix \(\varvec{\Phi }^*_{\bar{m}+1:n}\) is invariant and is only made up of ones and zeros. Therefore, estimating \(\varvec{\Phi }^*\) and \(\textbf{H}^*\) boils down to estimating some elements in their \(\bar{m}\) first columns, i.e., some elements in \(\varvec{\Phi }^*_{1:\bar{m}}\), \(\bar{\textbf{H}}^*_1\) and \(\textbf{H}^*_2\).


Building on the estimate of the states obtained in Step-3, and the LCF structure of matrices \(\varvec{\Phi }^*\) and \(\textbf{H}^*\), determined in Step-4, estimate the LCF corresponding to the given KI vector \(\alpha \). To this end, formulate the SS model:

$$\begin{aligned} \textbf{X}^*_{f^+}&=\big [\varvec{\Phi }^*_{1:\bar{m}}~\varvec{\Phi }^*_{\bar{m}+1:n}\big ] \begin{bmatrix} \textbf{X}^*_{f,1:\bar{m}}\\ \textbf{X}^*_{f,\bar{m}+1:n} \end{bmatrix}+\textbf{K}^*\textbf{A}_{f_1}, \end{aligned}$$
$$\begin{aligned} \textbf{Z}_{f_1}&=\begin{bmatrix} \bar{\textbf{H}}^*_1+\textbf{I} &{} \textbf{0} \\ \textbf{H}^*_2 &{} \textbf{0} \end{bmatrix} \begin{bmatrix} \textbf{X}^*_{f,1:\bar{m}}\\ \textbf{X}^*_{f,\bar{m}+1:n} \end{bmatrix} + \textbf{A}_{f_1}, \end{aligned}$$

where \(\textbf{Z}_{f_1}\) and \(\textbf{A}_{f_1}\) are defined in Eq. (23). Rearranging terms in (29a29b) yields:

$$\begin{aligned} \tilde{\textbf{X}}^*_{f^+}&=\varvec{\Phi }^*_{1:\bar{m}}\textbf{X}^*_{i,1:\bar{m}}+\textbf{K}^*\textbf{A}_{f_1} \end{aligned}$$
$$\begin{aligned} \tilde{\textbf{Z}}_{f_1}&=\begin{bmatrix} \bar{\textbf{H}}^*_1 \\ \textbf{H}^*_2 \end{bmatrix}\textbf{X}^*_{f,1:\bar{m}}+\textbf{A}_{f_1}, \end{aligned}$$

where \(\tilde{\textbf{X}}^*_{f^+} = \textbf{X}^*_{f^+} - \varvec{\Phi }^*_{\bar{m}+1:n}\textbf{X}^*_{f,\bar{m}+1:n}\) and \(\tilde{\textbf{Z}}_{f_1} = [ \textbf{I} \quad \textbf{0} ]^{^{\intercal }} \textbf{X}^*_{f,1:\bar{m}}\). Notice that estimating \(\varvec{\Phi }^*_{1:\bar{m}}\), \(\bar{\textbf{H}}^*_1\), \(\textbf{H}^*_2\) and \(\textbf{K}^*\) comes down to apply Eqs. (2223), but substituting \(\textbf{Z}_{f_1}\), \(\hat{\textbf{X}}_{f^+}\) and \(\hat{\textbf{X}}_{f}\) by, respectively, \(\tilde{\textbf{Z}}_{f_1}\), \(\hat{\textbf{X}}^*_{f^+}\) and \(\hat{\textbf{X}}^*_{f}\), where the last two matrices were obtained in Step-3.

Therefore, estimates of \(\varvec{\Phi }^*\) and \(\textbf{H}^*\) given in (28) and obtained as the LS solution of (30a30b) keeps the LCF structure.Footnote 12

In practice, \(\alpha \) is not known. However, it can be empirically determined bearing in mind that \(|\alpha |=n\), and that models with different \(\alpha \) are not nested in general, see Example 1, Remarks 3 and 5. These remarks motivate our proposal, which consists in estimating the models corresponding to all the combinations of KIs which add up to the system order, already specified in the previous phase, and choosing the model that optimizes a given information criterion. This idea can be formalized by the following minimization problem:

$$\begin{aligned} \min _{(n_1,\ldots ,n_m)} \quad&\log |\hat{\varvec{\Sigma }}(n_1,\ldots ,n_m) |+C(T)d(n_1,\ldots ,n_m) \nonumber \\ \text {s.t.} \qquad&|\alpha |=\hat{n}, \end{aligned}$$

where \(\hat{\varvec{\Sigma }} (n_1,\ldots ,n_m)\) is an estimate for the error covariance matrix, C(T) is a penalty function that determines which information criterion is to be optimized and \(d(n_1,\ldots ,n_m)\) is the number of parameters of the corresponding echelon form. The usual alternatives are considered for the penalty term C(T), i.e., 2/T (AIC), \(\log (T)/T\) (SBC), and \(2\log (\log (T))/T\) (HQ).

Some authors have previously developed procedures to estimate \(\alpha \) and its associated echelon form. For instance, Hannan and Rissanen (1982); Hannan and Kavalieris (1984a); Tsay (1989b); Poskitt (1992, 2016); Nsiri and Roy (1992); Lütkepohl (2005), among others, deal with the problem using a VARMA representation. To the best of our knowledge, only Peternell (1995); Peternell et al. (1996) propose subspace methods to this aim. However, most of these methods are not statistically efficient in general, see Deistler et al. (1995), and some are computationally expensive or difficult to automate (Nsiri and Roy 1992). If SUBEST1 is asymptotically as efficient as MLE, as we conjectured in Sect. 2.3.3 based on simulation results, this would make it an attractive procedure with both, computational simplicity and statistical optimality.

On this basis, the methodology proposed here returns both, the KIs resulting from the optimization problem (31) and the estimates of its associated LCF. It is structured in three steps:

  1. 1.

    Determine the system order, n, by means of the NID procedure.

  2. 2.

    Compute all the KI vectors which add up to n, and estimate its corresponding canonical models using SUBEST1.

  3. 3.

    Select the KIs associated of the model with the best ICs.

To illustrate the performance of the algorithm, we simulated 5000 replications of a trivariate system with \(\alpha =(2,1,1)\), which is treated as unknown. Table 4 shows the relative estimates of different \(\alpha \). Note that the procedure performs well even for relatively small samples and, as expected, its performance improves as T increases. Moreover, the relative number of cases in which the system order is misspecified (last three \(\alpha \) vectors in Table 4) is very small, so the remarkable reduction in the number of models to be estimated by constraining \(|\alpha |=n\) does not have a significant cost although, obviously, this will depend on the DGP. Additionally, Table 5 presents the DGP and the average estimates of its parameters for different sample sizes.

Table 4 KI estimates using the algorithm in Sect. 3.2
Table 5 Estimate of an echelon form using the algorithm in Sect. 3.2

3.3 Step-3: Refining the Step-2 model

The third and final step consists in refining the canonical model obtained in Step-1, in the univariate case, or in Step-2, in the multivariate case, by different procedures such as:

  1. 1.

    Applying NID to the model residuals. If the model is statistically adequate, the order identified should be \(n=0\). A higher McMillan degree would imply that the residuals have a dynamic structure, i.e., they are autocorrelated, so the model should be re-specified.

  2. 2.

    Estimating the models by Gaussian ML for improved efficiency, if the normality assumption holds.

  3. 3.

    Excluding non-significant parameters to achieve further parsimony. Estimation and testing in this phase can be done by ML methods or iterative subspace-based techniques (see, Garcia-Hiernaux et al. 2009).

Note that Step-3 seeks parsimony. While this a desirable feature, it is somewhat balanced by the complex procedures and modeling decisions required.

We illustrate Steps 2 and 3 with the same data used in Example 2.

3.3.1 Example 3: Estimating and refining a canonical representation

We applied the procedure described in Sect. 3.2 to determine \(\alpha \) in the simulated data already used in Example 2, with the constraint that the system order is \(n=1\). Again we set \(i=\log (T)=5\) in SUBEST1, which returns a SBC value of 5.90 and 6.02 for \(\alpha =(1,0)\) and \(\alpha =(0,1)\), respectively. So, the \(\alpha =(1,0)\) final model in VARMA\(_E\) form, estimated by Gaussian ML is:

$$\begin{aligned}{} & {} \varvec{\hat{F}}_0=\varvec{\hat{L}}_0=\begin{bmatrix} \underset{(-)}{1.0}\ {} &{}\underset{(-)}{0}\\ \underset{(.153)}{0.464}\ {} &{}\underset{(-)}{1.0}\end{bmatrix} \,, \, \varvec{\hat{F}}_1=\begin{bmatrix} \underset{(.111)}{-0.510}\ {} &{}\underset{(-)}{0}\\ \underset{(-)}{0}\ {} &{}\underset{(-)}{0}\end{bmatrix} \,, \, \varvec{\hat{L}}_1=\begin{bmatrix} \underset{(.112)}{-0.272}\ {} &{}\underset{(.074)}{0.170}\\ \underset{(-)}{0}\ {} &{}\underset{(-)}{0}\end{bmatrix},\\{} & {} \varvec{\hat{\Sigma }}_a=\begin{bmatrix} 0.921 &{} 0.377\\ 0.377&{} 0.998\end{bmatrix}, \end{aligned}$$

very similar to (6). The standard errors are in parentheses. The residual diagnostics did not reveal any symptom of misspecification.

4 Nonstationary and seasonal processes

We will now discuss the extension of the previous analysis to nonstationary and seasonal variables. Instead of modeling directly the nonstationary process (as in 2005c), we follow the traditional strategy of identifying the integration order of each variable (or cointegration rank and vector) and then applying the corresponding transformation to induce stationarity.

4.1 Nonstationarity

Assume now that the system output, \(\textbf{y}_t \in {{\mathbb {R}}^m}\), is nonstationary. The methods described in Sect. 3 require stationarity, so we will first proceed to determine how to transform \(\textbf{y}_t\) into a stationary vector \(\textbf{z}_t\).

The literature provides many tools to this end, ranging from graphical assessment (Box et al. 2015), to univariate unit roots testing (Dickey and Fuller 1981; Kwiatkowski et al. 1992), and cointegration testing, (Johansen 1991). Subspace approaches have also been applied to determine the stationary transformation. In this case, the methods build on analyzing the canonical correlation coefficients (\(\sigma _j\), \(j=1,\ldots ,i\)) between the past and future information subspaces (\(\textbf{Z}_p\) and \(\textbf{Z}_f\), respectively, see Sect. 2.3). Bauer and Wagner (2002) and Bauer (2005c) extend the stationary theory to analyze I(1) and cointegrated variables within this framework. In a later work, Garcia-Hiernaux et al. (2010) discuss nonstationarity and cointegration analysis in the same line. The main idea in these papers is that the estimated \(\sigma _j\) corresponding to unit roots converge to their true values (unity) much faster than the other canonical correlation coefficients. This property, known as superconsistency, allows one to distinguish both kinds of correlations.

Here we use the latter method because of it is simple and can be implemented within the aforementioned NID procedure. Specifically, Garcia-Hiernaux et al. (2010) propose the following unit root criterion (URC) to identify the \(\sigma _j\) corresponding to unit roots:

$$\begin{aligned} URC=1-\hat{\sigma }^2_j-G(T,i,d) \le 0. \end{aligned}$$

This criterion concludes that \(\sigma _j=1\) when the distance between \(\hat{\sigma }_j^2\) and the unity is smaller than (or equal to) the penalty function, G(Tid), which depends on the sample size, T, the row dimension of \(\textbf{Z}_p\), i, and the number of unit roots we wish to evaluate, d. This threshold is computed through Monte Carlo simulations.

In the univariate case, \(m=1\), we apply the URC to determine d, so that \(z_t=\nabla ^d y_t\) is stationary. We could then proceed as described in Sect. 3 with \(z_t\).

In the multivariate case, \(m>1\), the procedure becomes more complex as potential cointegrating relations come to play. Particularly, when \(\textbf{y}_t\) is made up of m I(1) processes, then the cointegrating rank is \(c = m-d\), being d the number of unit roots of the multivariate process. Building on this basis, we determine the cointegrating rank using the following algorithm:

  1. 1.

    Check that the series are I(1), applying UCR to every single series.

  2. 2.

    Estimate the number of unit roots of the multivariate process, \(\hat{d}\), applying UCR to the multivariate process.

  3. 3.

    Calculate the cointegrating rank as the difference between the system dimension and the number of unit roots estimated in the previous step, i.e., \(\hat{c} = m - \hat{d}\).

When the procedure above suggests a cointegrating relation (\(\hat{c}>0\)), Garcia-Hiernaux et al. (2010) show how to estimate the cointegrating matrix of \(\textbf{y}_t\). This matrix represents the linear combinations of \(\textbf{y}_t\) which are stationary and, after applying this transformation, we proceed as described in Sect. 3.

4.2 Seasonality

The methodology in Sect. 3 can be extended to model the multiplicative seasonal structure of process by using systems in cascade; see Chapter 7 in Casals et al. (2016). However, as in the previous section, the adequate transformation to induce stationarity must be decided first. We do not consider here seasonal cointegration, so we will address this problem in the univariate case.

Assume that a given series, \(y_t\), has a seasonal dynamic structure of period s, meaning that there are s observations per seasonal cycle. Following Box et al. (2015), we denote the regular difference operator by \(\nabla =1-B\), and the seasonal difference operator by \(\nabla _s=1-B^s\). In seasonal processes, we would first compute the seasonal sum of \(y_t\), defined as:

$$\begin{aligned} w_t=(1+B+B^2+\cdots +B^{s-1})y_t, \end{aligned}$$

and then apply NID to detect unit roots at the zero frequency in \(w_t\), \(\nabla w_t\), \(\nabla ^2 w_t\)... If a single unit root in \(w_t\) is detected, the stationary transformation would be the seasonal difference:

$$\begin{aligned} z_t=\nabla (1+B+B^2+\cdots +B^{s-1})y_t= \nabla _s \,y_t. \end{aligned}$$

Similarly, detecting two unit roots in \(w_t\) would imply that the stationary transformation is:

$$\begin{aligned} z_t=\nabla ^2 (1+B+B^2+\cdots +B^{s-1})y_t=\nabla \nabla _s\,y_t, \end{aligned}$$

and so on. After the stationary transformation is identified and applied, one must run the NID procedure to determine, first, regular stationary structures of order \(n=0,1,2,\dots \), and then seasonal structures in the range \(n=0,s,2s,3s,\ldots \)

After fitting univariate models to each series in a dataset, one can cope with a vector of seasonal time series by: (a) filtering out the seasonal structure of each series using, e.g., the exact decomposition in Casals et al. (2002), and then (b) modeling the seasonally adjusted vector using the approach described in Sects. 3 and 4.1. After doing this, one could also estimate a final model combining the regular specification with a seasonal factor, given by seasonal AR, MA and transformation matrices, containing in the main diagonal corresponding factors of the univariate seasonal models.

Finally, as mentioned above, we do not address the case of seasonal cointegration. Recently, Bauer and Buschmeier (2021) went further in this sense proving that CVA, see Sect. 2.3, provides consistent estimators for long-run and short-run dynamics, even in the presence of seasonal unit roots. They consider multivariate systems and, accordingly, propose new tests for the cointegrating rank at seasonal frequencies. The simulations show that CVA performs better than some alternatives in specific cases (small sample, large system dimension).

This shows the flexibility of subspace methods to model vectors of time series, but a deeper discussion of this issue is beyond the scope of this paper.

5 Univariate modeling examples

We will now illustrate the performance of the methodology proposed in previous sections when applied to build univariate models. Section 5.1 presents a step-by-step analysis of the famous series G for International Airline Passengers (Box et al. 2015), which is a popular benchmark for practical cases emphasizing in nonstationarity and seasonality. Section 5.2 applies the modal automatic strategy described in Sect. 3.1 to nine benchmarks commonly used in the literature.

5.1 Airline passenger series

There is a consensus in the literature that the monthly (log) Airline Passenger series, \(z_t=\log (y_t)\), is adequately represented by an IMA(1,1) \(\times \) IMA(1,1)\(_{12}\) model. We will show now that our methodology leads to this model.

Following the steps described in Sect. 4.2, we first compute the seasonal sum, \(w_t=(1+B+...+B^{s-1}) z_t\), where \(s=12\), and apply to it the procedure for unit root detection, finding: (a) at least one unit root in \(w_t\) (Table 6), (b) at least one unit root in \(\nabla w_t\) (Table 7), and (c) zero unit roots in \(\nabla ^2 w_t\). These decisions seem quite reliable as the first canonical correlation coefficient for \(\nabla w_t\) and \(\nabla ^2 w_t\) are .9925 and .9565, and so the Unit Root Criterion presented in (32) clearly indicates the existence of a unit root in each case.

Table 6 Output from NID when applied to \(\nabla w_t\)
Table 7 Output from NID when applied to \(\nabla ^2 w_t\)

Therefore, an adequate stationary transformation is \(\nabla ^2 w_t=\nabla \nabla _{12} \, z_t\), which coincides with the standard result in the literature. Note that Table 7 also determines the order of the regular system, which is set at \(\hat{n}=1\).

The next step consists in determining the order of the seasonal subsystem. To this end, we apply NID to \(\nabla \nabla _{12} \, z_t\), setting the length of the seasonal cycle to \(s=12\). By doing so, we obtain the results shown in Table 8. Accordingly, the dynamic order of the seasonal subsystem is set to \(\hat{n}_s=1\) (see Garcia-Hiernaux et al. 2012, for the use of NID with seasonal series).

Table 8 Determination of the seasonal subsystem order for \(\nabla \nabla _{12} z_t\) with NID

After this preliminary specification analysis, we apply SUBEST1 to the stationary transformed series, setting the orders of the regular and seasonal subsystems to the previously determined values: \(\hat{n}=1\) and \(\hat{n}_s=1\). This yields the balanced model:

$$\begin{aligned} (1-\underset{(0.21)}{0.06}B)(1-\underset{(0.17)}{0.08}B^{12}) \nabla \nabla _{12} \, z_t=(1-\underset{(0.21)}{0.43}B) (1-\underset{(0.16)}{0.58}B^{12})\hat{a}_{1t}, \end{aligned}$$

where the figures in parentheses are the parameter standard errors, obtained by computing the corresponding information matrix. The estimate for the percent standard deviation of the error is \(\hat{\sigma }_{a_1}=3.68\%\).

To achieve the highest efficiency with Gaussian data and to exclude some non-significant parameters, one can compute MLE estimates using the previous values as initial conditions for the iterative optimization.Footnote 13 In this case, this provides the following result:

$$\begin{aligned} \nabla \nabla _{12} \, z_t= (1-\underset{(0.08)}{0.40}B)(1-\underset{(0.08)}{0.56}B^{12}) \hat{a}_{2t}, \quad \hat{\sigma }_a=3.67\%, \end{aligned}$$

which essentially coincides with the results in the literature.

Finally, one can check for autocorrelation by applying NID to the residuals of this tentative model. If we do so setting the frequencies at \(s=1\) and \(s=12\) to check for additional regular or seasonal dynamic structures, we obtain Tables 9  and 10, whose results show no evidence of misspecification.

Table 9 Output from NID when applied to \(\hat{a}_{2t}\) in Eq. (37)
Table 10 Determination of the seasonal subsystem order for \(\hat{a}_{2t}\) in Eq. (37)

5.2 Semi-automatic identification for real and simulated time series

Now we apply our methodology to nine benchmark time series taken from the literature, using the modal strategy described in Sect. 3.1. Table 11 describes these series, while Tables 12 and 13 summarize the main results. Note that:

  1. 1.

    The stationary transformation was correctly determined in all the cases.

  2. 2.

    When working with seasonal series (series B, G y H) the final results coincide essentially with the models suggested by the literature, with two minor issues:

    1. (a)

      Seasonality in series G is weak, so it was not detected until we applied NID to the residuals of the initial regular model;

    2. (b)

      In the model for series H, the MLE of the seasonal MA parameter converged to unity, which implies a deterministic seasonality.

  3. 3.

    All the models in the column “NID+SUBEST1+ML" in Table 13 coincide with the specifications proposed by the literature.

Regarding the performance of the modal strategy, there was a perfect agreement between all the criteria employed except for series D, where SBC suggests a first-order process, while the remaining procedures suggests the “consensus” second-order specification.

Table 11 Datasets employed to estimate the models in Tables 12 and 13
Table 12 Step-1 univariate models, resulting from the successive application of NID and SUBEST1. Standard errors are in parentheses
Table 13 Univariate models, resulting from the successive application of NID, SUBEST1 and ML. Standard errors are in parentheses

6 Multivariate modeling examples

Now we will show two examples of multivariate modeling, chosen because of the different cointegration properties of the final models. The first one is not cointegrated, while the second has one nonstationary common factor.

6.1 Flour price indices in three cities

The first exercise is based on three monthly series of flour price indexes in Buffalo, Minneapolis and Kansas City, recorded from August 1972 to November 1980, which will be denoted by \(\textbf{y}_t\). This dataset is adequate to test the “Law of one price”, as flour is a quite homogeneous product and these cities are close enough to allow for product flows.

Figure 2 displays the profile of these series. Note that they show a high degree of comovement and so, they could be cointegrated. Despite this graphical evidence, previous analyses reject the null of cointegration for this dataset; see Tiao and Tsay (1989); Lütkepohl and Poskitt (1996). This makes it a good benchmark to test for false positives in cointegration testing.

Fig. 2
figure 2

Logs of flour price indexes in Buffalo, Minneapolis and Kansas City. Monthly data from August 1972 to November 1980

Applying NID to each individual series concludes that they all are I(1). Again, the first canonical correlation is close to unity in the three series, so this decision seems solid. To check for cointegration, we apply NID to the vector \(\log (\textbf{y}_t)\), detecting again three unit roots for the multivariate process. Here the first three canonical correlations are .99, .94 and .88, respectively. Although the unit root criterion presented in (32) indicates three unit roots, note that the last one is a borderline case. Anyway, cointegration is rejected, see Table 14.

Table 14 Output from NID when applied to \(\log (\textbf{y}_t)\)

Table 14 shows no evidence of any dynamic structure beyond the unit roots. This could be due to the fact that nonstationary components often dominate other weaker dynamic components, which detection requires a previous stationarity inducing transformation. For this reason, we apply again the NID algorithm to the stationary transformation \(\nabla \log (\textbf{y}_t)\), obtaining the results in Table 15. In this case there is a tie, as three criteria detect a first-order structure while the other three conclude that the order is zero. When this happens, we prefer choosing the higher dynamic order because an overfitting error can be easily solved by pruning insignificant parameters. Therefore, we choose \(\hat{n}=1\).

Table 15 Output from NID when applied to \(\nabla \log (\textbf{y}_t)\)

After deciding the system order, SUBEST1 (with \(i=3\)) provides estimates for \(\alpha \) and its corresponding VARMA\(_E\). In this case, AIC, SBC and HQ coincide in choosing \(\hat{\alpha }=(1,0,0)\). Accordingly, the estimated model is:

$$\begin{aligned} \varvec{\hat{F}}(B)\nabla \log (\textbf{y}_t)=\varvec{\hat{L}}(B)\varvec{\hat{a}_t}, \end{aligned}$$


$$\begin{aligned} \varvec{\hat{F}}(B)= & {} \begin{bmatrix} 1+{0.39}B&{}{0}&{}{0}\\ -{0.71}&{}{1}&{}{0}\\ -{0.45}&{}{0}&{}{1}\\ \end{bmatrix}\quad \varvec{\hat{L}}(B)=\begin{bmatrix} 1-{0.82}B&{}{1.08B}&{}{0.09B}\\ -{0.71}&{}{1}&{}{0}\\ -{0.45}&{}{0}&{}{1}\\ \end{bmatrix}, \nonumber \\ \varvec{\hat{\Sigma }_a}(\%)= & {} \begin{bmatrix} 0.21&{}-&{}-\\ 0.22&{}0.25&{}-\\ 0.22&{}0.24&{}0.28 \end{bmatrix}, \end{aligned}$$

where the noise covariance estimates are presented in percentage. We then estimate the previous structure by ML, obtaining:

$$\begin{aligned} \varvec{\hat{F}}(B)= & {} \begin{bmatrix} 1+\underset{(0.17)}{0.41}B&{}\underset{(-)}{0}\ {} &{}\underset{(-)}{0}\\ -\underset{(0.11)}{0.73}\ {} &{}\underset{(-)}{1}\ {} &{}\underset{(-)}{0}\\ -\underset{(0.20)}{.49}\ {} &{}\underset{(-)}{0}\ {} &{}\underset{(-)}{1}\\ \end{bmatrix}\quad \varvec{\hat{L}}(B)=\begin{bmatrix} 1-\underset{(0.46)}{0.94}B&{}\underset{(0.37)}{1.32B}\ {} &{}\underset{(-)}{0}\\ -\underset{(0.11)}{0.73}\ {} &{}\underset{(-)}{1}\ {} &{}\underset{(-)}{0}\\ -\underset{(0.20)}{0.49}\ {} &{}\underset{(-)}{0}\ {} &{}\underset{(-)}{1}\\ \end{bmatrix} \nonumber \\ \varvec{\hat{\Sigma }_a}(\%)= & {} \begin{bmatrix} \underset{(0.03)}{0.21}\ {} &{}-&{}-\\ \underset{(0.03)}{0.22}\ {} &{}\underset{(0.03)}{0.25}\ {} &{}--\\ \underset{(0.03)}{0.21}\ {} &{}\underset{(0.04)}{0.24}\ {} &{}\underset{(0.04)}{0.28} \end{bmatrix} \end{aligned}$$

where the (1,3)-element in \(\varvec{\hat{L}}\) was found non-significant and therefore pruned.

Last, Table 16  summarizes the output of NID when applied to the residuals of model (40), which does not show any symptom of misspecification. The \(\textbf{Q}\) matrix of Ljung and Box (1978), computed with 7 degrees of freedom, confirms this conclusion:

$$\begin{aligned} \mathbf {Q_{(7)}}=\begin{bmatrix} 3.5&{}3.3&{}3.2\\ 4.6&{}4.3&{}4.9\\ 6.9&{}5.6&{}4.0 \end{bmatrix}. \end{aligned}$$
Table 16 Output from NID when applied to \(\varvec{\hat{a}_t}^\star \)

The VARMA echelon model (40) implies that the three series share a first-order stationary AR component, and receive the effect of the innovations in Buffalo and Minneapolis at \(t-1\). This structure is represented by seven free parameters, so it is parsimonious in comparison, e.g., with the nine parameters of the VAR(1) specification proposed by Tiao and Tsay (1989).

On the other hand, note that the error covariance matrix in Equation (40) implies that the disturbances in this model are highly correlated. This fact would explain the comovement observed in Fig. 2 and suggests that, in this case, the “Law of one price” could be described, not by common trends, but by common shocks affecting prices in different locations.

6.2 US interest rates

We will model now four monthly short-term US interest rates, from August 1985 to January 2003, already analyzed by Martin-Manjon and Treadway (1997). These are the Fed’s Target Rate (\(TO_t\)), the effective rate (\(TE_t\)), and the secondary market yields for 3-month (\(T3m_t\)) and 6-month T-bills (\(T6m_t\)). Figure 3 shows the profile of these series.

Fig. 3
figure 3

Short-term interest rates. Monthly data from August 1985 to January 2003

We determine the integration order for each individual variable by applying NID. The unit root criterion clearly indicates that each individual series is I(1). The second step in the analysis consists in determining the dynamic order for the vector of variables by applying NID again. In this case, the procedure detects one unit root and, therefore, three cointegrating relationships, see Table 17.Footnote 14

Table 17 Output from NID when applied to \(\mathbf {Z_{1t}}^\star \)

NID also provides estimates for the coefficients in the cointegrating equations, so the stationary-inducing transformations for the data is given by:

$$\begin{aligned} \textbf{z}_{t}=\begin{bmatrix} \nabla &{} 0 &{} 0 &{} 0 \\ -\,1.01 &{} 1 &{} 0 &{} 0 \\ -\,0.92 &{} 0 &{} 1 &{} 0 \\ -\,0.93 &{} 0 &{} 0 &{} 1 \end{bmatrix} \begin{bmatrix} TO_t \\ TE_t \\ T3m_t\\ T6m_t\\ \end{bmatrix} \end{aligned}$$

Therefore, cointegration operates between pairs of variables and the stationary combinations are close to the spreads between each rate and the Fed’s target. Note also that the structure of Equation (42) has a clear interpretation in terms of economic controllability, as determining the value of the Target Rate \(TO_t\) implies choosing the long-run mean value of \(TE_t\), \(T3m_t\) and \(T6m_t\) Applying NID to the stationary-transformed vector, \(\textbf{z}_{t}\), detects four stationary dynamic components and no additional unit roots, see Table 18.

Table 18 Output from NID when applied to \(\mathbf {Z_{1t}}^\star \)

In this situation, the next steps in the analysis would be: (a) determining the KI vector \(\alpha \), (b) estimating the final model, and (c) performing diagnostic checks. We do not include this results here because they do not add much to what was shown in previous examples.

7 Discussion

The model-building approach proposed in this paper has substantial advantages in comparison with its main alternatives.

7.1 Comparison with the VAR methodology

The procedure to fit a Step-1 model described in Sect. 3.1 is similar to standard VAR model-building, see Lütkepohl (2005), as it relies on LS methods, explores a succession of linear models with increasing orders and compares the resulting models both, in terms of LRs and ICs. However, there are two important differences.

First, in VAR modeling the exploration of possible dynamic orders is not systematic, because it skips some intermediate dynamic orders. To see this, consider a vector of three time series. A standard VAR analysis would consider vector autoregressions of orders 0, 1, 2... which corresponding dynamic orders would be 0, \(1 \times 3=3\), \(2 \times 3=6\)... In comparison, our method explores the whole sequence of positive integer orders (0, 1, 2,...) up to a user-specified upper limit. Because of this, standard VAR modeling will be prone to over or under-specifying the system order.

Second, the VAR approach considers only autoregressive structures, while Step-1 models have both, AR and MA terms. Accordingly, if the DGP process has some moving average structure our approach should provide a more parsimonious representation, even if we do not proceed to Step-2.

Third, while a Step-1 model can be compared with a VAR, in terms of the modeling decisions required (just deciding a scalar order) our approach provides a way to achieve further parsimony by enforcing a canonical structure (i.e., fitting a Step-2 model) without much additional complexity. Conversely, the VAR methodology does not provide such an option.

7.2 Comparison with VARMA specification methods

In comparison with alternative VARMA specification procedures, as those in Jenkins and Alavi (1981), Tiao and Box (1981) and Reinsel (1997), our methodology:

  1. 1.

    ...provides a unified modeling procedure for single and multiple time series because the same basic decisions are required in both cases,

  2. 2.

    ...is scalable, meaning that the analyst may choose between a quick non-parsimonious Step-1 model, or the more efficient Step-2 and Step-3 models,

  3. 3.

    ...is easy to implement in automatic mode, and

  4. 4.

    ...offers a flexible approach to the parsimony principle, which can be enforced in the latter stages of the analysis by adopting an efficient canonical structure (Step-2) and pruning non-significant parameters (Step-3).

On the other hand, standard VARMA specification methods: (a) do not enforce coherence with univariate model-building, (b) are difficult to implement in automatic mode, (c) are not scalable, as they do not provide cost-efficient relaxations of the parsimony requirement and, in contradiction with this, (d) may yield an overparameterized model, as they do not assure the final model to be canonical.

7.3 Comparison with other methods using the Kronecker indices

The closer benchmarks for our proposal are the alternative methods to determine the KIs and, hence, to specify a VARMA\(_E\). This literature is rich and includes, among many works, those of Hannan and Kavalieris (1984b), Poskitt (1992), Lütkepohl and Poskitt (1996), Tsay (1989b), Li and Tsay (1998) and Poskitt (2016).

The common feature of all these methods is that they directly identify the KIs \(\alpha =(n_1,n_2,\ldots ,n_m)\), while we first determine the system order, n, and only then estimate and compare all the models with a sum of KIs equal to n. Our approach has several advantages:

  1. 1.

    Determining the system order is useful in itself, as it allows one to specify a crude but fast Step-1 model which may be adequate in some cases, see Sect. 3.1, thus avoiding the need to determine \(\alpha \).

  2. 2.

    The dynamic order can be obtained with both, ICs and LR tests, because lower order specifications are nested in higher order ones. On the other hand, models with different \(\alpha \) non-nested in general, so they can only be compared with ICs.

  3. 3.

    Limiting the search space to the KIs compatible with a previously estimated system order provides a faster and more robust performance than searching in an unbounded space.

Besides these advantages, our proposal adds extensions for nonstationary time series, cointegration and seasonality, not treated by other VARMA\(_E\) approaches.

7.4 Comparison with the SCM approach

Tiao and Tsay (1989) propose a structured method to specify a canonical model. They build on the so-called “Scalar Component” (SC), which is the basis of the family of “Scalar Component Models” (SCM). These concepts are analogous, but not equivalent, to those of KIs and VARMA\(_E\). In a nutshell, they build on a canonical correlation analysis to find a contemporaneous linear transformation for the vector process that: (a) reveals simplifying structures, (b) achieves parametric parsimony, and (c) identifies alternative “exchangeable” parameterizations.

Athanasopoulos et al. (2012) compare comprehensively the SC and KI-based approaches at theoretical, experimental, and empirical levels. They conclude that:

  1. 1.

    VARMA\(_E\) and SC models are connected in theory.

  2. 2.

    The greatest advantage of echelon forms is that their construction can be automated, being this impossible to achieve within the SC identification process.

  3. 3.

    The experimental comparison is performed via Monte Carlo experiments. In this aspect, they conclude that both procedures work well in identifying some common VARMA DGPs.

  4. 4.

    SC VARMA models are more flexible because their maximum autoregressive order does not have to be the same as the order of the moving average component, while these orders are constrained to be equal for a VARMA\(_E\) form and, perhaps for this reason,

  5. 5.

    ...the empirical out-of-sample forecast evaluation shows that SC VARMA models provide better forecasts than VARMA\(_E\) models which, in turn, provide better forecasts than VAR models.

These conclusions are fully consistent with our own experience. In the framework of this paper, we add significance tests in Step-3 to relax the constraint on the autoregressive and moving average orders implied by the echelon form, so our Step-3 and SC models could potentially have the same forecasting performance. Confirming this possibility would require and in-depth forecasting power comparison, which exceeds the scope of this paper.

On the other hand, the methodology proposed here is scalable and includes extensions for nonstationary time series, cointegration and seasonality, not treated by the SCM approach.

7.5 Software implementation

An important disadvantage of our method in comparison with some alternatives is that the software required is not commonly available. To address this shortcoming, we implemented all the computational procedures described in this paper in the \(E^4\) functions NID and SUBEST1.

\(E^4\) is a MATLAB toolbox for time series modeling, which can be downloaded at: www.ucm.es/e-4. The source code for all the functions in the toolbox is freely provided under the terms of the GNU General Public License. This site also includes a complete user manual and other reference materials.