Abstract
We propose a new method to specify linear models for vectors of time series with some convenient properties. First, it provides a unified modeling approach for single and multiple time series, as the same decisions are required in both cases. Second, it is scalable, meaning that it provides a quick preliminary model, which can be refined in subsequent modeling phases if required. Third, it is optionally automatic, because the specification depends on a few key parameters which can be determined either automatically or by human decision. And last, it is parsimonious, as it allows one to choose and impose a canonical structure by a novel estimation procedure. Several examples with simulated and real data illustrate its application in practice.
1 Introduction
This paper proposes a new method to specify linear models for vectors of time series which draws on ideas from two different sources: the subspace methods, originated in the system identification literature, and the time series analysis approach, built on the seminal work of Box and Jenkins (1970).
Since the pioneer work of Ho and Kalman (1966), the term “system identification” refers to the use of statistics and algebraic tools to fit a StateSpace (SS) model to a dataset without a priori restrictions. Much of the recent work in this area follows the subspace methods approach: a family of system identification procedures that rely heavily on canonical correlation and Least Squares (LS) methods. For a survey see, e.g., Qin (2006).
On the other hand, modern time series analysis is critically influenced by the classical Box and Jenkins (1970) book and subsequent works. The philosophical contribution of this methodology can be synthesized in three basic principles: (i) “let the data speak” meaning that the specification of reduced form models should be based on the observable properties of the sample; (ii) “parameterize with parsimony,” meaning that models with few parameters should be preferred, all other aspects being equal, and; (iii) “all the models are wrong, but some are useful,” meaning that all specifications must be considered tentative and subject to a structured process of critical revision and diagnostic testing. We subscribe these principles and implemented them in the foundations of our approach.
On these bases, our method is structured in three sequential steps:
 Step1:

Determine the dynamic order of the system (or McMillan degree), defined as the minimum number of dynamic components required to realize the system output. We do this by using subspace methods to estimate a sequence of SS models over a predetermined range of orders, and then choosing the best one by means of Likelihood Ratio (LR) tests and information criteria (ICs). This simple decision, similar to the VAR order selection, is enough to fit a preliminary “Step1 model” to the dataset, which might be overparameterized but useful to cover simple needs.
 Step2:

Determine the dynamic order for each individual time series, which are also known as Kronecker index (KI). We do this by applying a new subspace algorithm, named SUBEST1, which estimates the parameters of the Luenberger Canonical Form corresponding to a given vector of KIs. We use this algorithm to estimate the models corresponding to all the combinations of KIs which add up to the previously determined system order. The model with the best ICs (“Step2 model”) is therefore a canonical representation for the time series, and will often be more parsimonious than the Step1 model. In the univariate case this step is not required, because the system order coincides with the KI for the single series and the Step1 model is, therefore, canonical.
 Step3:

The third and final step consists in refining the models obtained in previous steps, for example by computing MaximumLikelihood Estimates (MLE) or by pruning nonsignificant parameters. Such a “Step3 model” might be more parsimonious than the previously fitted models but, on the other hand, requires more effort and human decisionmaking.
The computational procedures employed to implement the Step2, as well as the distinction between Steps 1, 2 and 3, are original contributions to the literature on canonical modeling, although closely related to the works of Hannan and Deistler (1988), Deistler et al. (1995), Bauer and Deistler (1999) and Bauer (2005c).
In comparison with mainstream econometric procedures, our proposal has three specific advantages, as it:

1.
...provides a unified modeling approach for single and multiple time series, as the same modelbuilding process is followed in both cases,

2.
...is optionally automatic, meaning that each model is characterized by a few key integer parameters to be chosen following a clearly structured process, based on the use of a limited set of sample statistics; so the modeling decisions can be adopted automatically or by a human analyst,

3.
...is scalable, meaning that: (i) it provides a possibly crude but statistically adequate Step1 model, which may be enough to satisfy simple analytical needs, which (ii) can be made parsimonious if required, by enforcing a canonical structure in Step2, and (iii) additional efficiency can be achieved in Step3.
The structure of the paper is as follows. Section 2 presents the notation and reviews some ideas on VARMA and SS systems. It also introduces the subspace methods and its asymptotic properties. Section 3 outlines the methodology proposed in the relatively simple case of stationary and nonseasonal time series and Sect. 4 extends the previous method to nonstationary and/or seasonal series. We do this by adding to the previous schema two specification phases to determine the number of unit roots and cointegrating relations, and to identify the dynamic orders of the seasonal and nonseasonal subsystems. Sections 5 and 6 illustrate the practical application of this approach to single and multiple time series, respectively, with benchmark datasets taken from the literature. Last, Sect. 7 compares the method proposed with alternative methods, provides some concluding remarks and indicates how to obtain a free MATLAB code which implements all the algorithms required by our proposal.
2 Notation and previous results
In this Section we present the notation employed and the main previous results upon which our work is based. We begin by defining VARMA and SS systems. Then, we introduce the subspace methods and finally discuss the relation between canonical VARMA models (named echelon) and their equivalent SS representation.
2.1 VARMA models
Much work in applied time series analysis is based on the linear dynamic model:
where \(\textbf{z}_t\in {\mathbb {R}}^m\) is observed for \(t=1,\ldots ,T\); \(\textbf{a}_t \in {\mathbb {R}}^m\) is an innovation vector such that \(\textbf{a}_t \sim iid(\textbf{0},\varvec{\Sigma _a})\); B denotes the backshift operator, such that for any sequence \(\omega _t\): \(B^i\omega _t=\omega _{ti}\), \(i=0,\pm 1,\pm 2,\ldots ,I\) and, finally, factors \(\varvec{\bar{F}}(B)\) and \(\varvec{\bar{L}}(B)\) in (1) are given by:
Model (1–2) is assumed to be left coprime with the roots of \(\varvec{\bar{F}}(B)\) and \(\varvec{\bar{L}}(B)\) greater than unity, so \(\textbf{z}_t\) is stationary and invertible.
If we multiply the left and righthand side of Eq. (1) by any arbitrary nondefective matrix we would obtain an observationally equivalent model for \(\textbf{z}_t\). Therefore, left coprimeness is not enough to identify the model. Achieving identifiability requires additional constraints. For example, normalising \(\bar{\textbf{F}}_0=\bar{\textbf{L}}_0=\textbf{I}\) yields the standard VARMA(p,q) introduced by Quenouille (1957). That is because, for a given transfer function, several VARMA(p,q) models exists, even under left coprimeness. For more details, the equivalence classes of VARMA systems is given in Hannan and Deistler (1988), Theorem 2.2.1.
An interesting alternative to a standard VARMA is the VARMA echelon, or VARMA\(_{E}\), representation. System (1–2) is in echelon form if, denoting by \(\bar{F}_{kl}(B)\) and \(\bar{L}_{kl}(B)\) the klth elements of \(\varvec{\bar{F}}(B)\) and \(\varvec{\bar{L}}(B)\), respectively, the polynomial factors in (1) are uniquely defined by:
where the multiindex \(\alpha = (n_1,n_2,\ldots ,n_m)\), previously named KI, represents the dynamic order of each series, and:
The KIs uniquely define an echelon canonical VARMA form by means of Eqs. (3a–4); see Hannan and Deistler (1988), Theorem 2.5.1. Broadly speaking, these indices determine the zeros, ones and free parameters structure of the echelon form. By canonical, we mean a unique representation (commonly with the minimum number of parameters) among the outputequivalent VARMA class. Another property of KIs is that \(\sum _{k=1}^{m}n_k=\alpha =n\), meaning that the echelon form distributes the dynamic order between the different time series so that their sum equals the McMillan degree, denoted by n. Specific details about echelon forms can be found in Hannan and Deistler (1988), Chapter 2, and Lütkepohl (2005), Definition 12.2.
To clarify the exposition above, the following example compares two equivalent VARMA representations of the same transfer function with \(\alpha =(1,0)\). We will use this model in different examples throughout this paper.
2.1.1 Example 1: A VARMA(1,1) model vs. its equivalent VARMA\(_{E}\)
Consider the left coprime standard VARMA(1,1) model:
with,
As \(\alpha =(1,0)\), its equivalent VARMA\(_E\) form is defined as:
where,
and \(\varvec{\Sigma _a}\) coincides with the one in (5), because both models are observationally equivalent and, therefore, have the same innovations.
This example motivates the following remarks:
Remark 1
It is easy to see the output equivalence of models (5) and (6), as premultiplying both sides of (6) by \({{\textbf {F}}_{\textbf {0}}^{\textbf {1}}}\) yields (5).^{Footnote 1}
Remark 2
The standard VARMA specification (5) has six coefficients while its equivalent echelon form (6) only has four, being parsimony one of the advantages of canonical versus noncanonical formulations.
Remark 3
As \(\alpha =(1,0)\) in (6), each scalar in the vector corresponds to the highest dynamic order for the corresponding series. Therefore, the order of the whole system is \(n=1\).
Remark 4
When all the KIs are equal, and only in this case, the standard VARMA model (12) has an echelon structure and, accordingly, is a canonical representation. In other case, the echelon form has less parameters than the equivalent standard VARMA which, accordingly, is not canonical.
Remark 5
Models with different KI vectors are not nested in general, so they cannot be compared using LR tests. Their relative adequacy is often assessed by means of ICs, see, e.g., Hannan and Kavalieris (1984a); Poskitt (1992); Lütkepohl and Poskitt (1996).
As the above remarks point out, when the KIs are different, one can generally obtain the equivalent VARMA representation for any echelon form by premultiplying it by an appropriate unimodular transformation. Nonetheless, Gevers (1985) shows that, in general, the standard VARMA representation (1–2) implies that the McMillan degree is \(m \times \max \{p,q\}\) and, therefore, only transfer functions whose system order is a multiple of m can be adequately represented that way. Additional details on the KIs nesting structure can be found in Hannan and Deistler (1988), Section 2.6. These facts are fundamental for our paper, and motivate Sects. 3.1 and 3.2.
2.2 StateSpace systems
The linear timeinvariant SS representation employed in this paper is:
where \(\textbf{z}_t\) and \(\textbf{a}_t\) are as in (1–2). We assume that the SS system is:

A1.
...strictly minimumphase, that is, \(\lambda _i(\varvec{\Phi }\textbf{KH})<1,\forall i=1,...,n\), where \(\lambda _i(.)\) denotes the ith eigenvalue,

A2.
...minimal, which implies that the pair \((\varvec{\Phi },\textbf{H})\) is observable and \((\varvec{\Phi ,\textbf{K}})\) is controllable, and

A3.
...stable, so: \(\lambda _i(\varvec{\Phi })<1,\forall i=1,...,n\).
In comparison with mainstream alternatives, model (7a–7b) has a unique error term affecting both, the state and observation equations. This model is sometimes known as Prediction Error or Innovations Form (IF).^{Footnote 2} We will use the latter denomination throughout the paper. The importance of the IF lies in the fact that

1.
...it is a general representation, because any SS system with different unobserved inputs in (7a–7b) can be written in an equivalent IF (see, Hannan and Deistler 1988, Chapter 1). Additionally, Casals et al. (1999) present an algorithm to compute the matrices of an IF equivalent to any SS model by solving an exactly determined system of equations, and

2.
...it has clear computational advantages for likelihood computation and signal extraction, because its states can be estimated with an uncertainty that converges to zero under nonrestrictive assumptions, see Casals et al. (2013).
Most of these results are the core of Chapters 1 and 2 in Hannan and Deistler (1988), and have also been empirically discussed in Casals et al. (2016), Chapters 2, 5 and 9.
2.3 Subspace methods
Subspace methods are used to estimate the parameters in the IF (7a–7b) by applying generalized LS to a rankdeficient system, see e.g. Qin (2006); Favoreel et al. (2000). To do this, we need to write the model (7a–7b) in subspace form by substituting (7b) into (7a) in \(\textbf{a}_t\), and solving the resulting recursion. This yields:
where the states at time t depend on its initial value, \(\textbf{x}_1\), as well as the past values of the output.
Substituting (8) into the observation Eq. (7b) results in:
Equations (8) and (9) can be written in matrix form as:
where i is a userdefined index which represents the number of past block rows, p. For simplicity, we will assume hereafter that the scalars p and f are equal to i.^{Footnote 3} For now, the only requisite for this value is that \(i>n\). This dimensioning implies that \(\varvec{\mathcal {L}}\) matrices in (10a) are:
Second, given these previous choices, \(\textbf{Z}_f\) and \(\textbf{Z}_p\) are blockHankel matrices whose rows are \([\textbf{z}_{t}^{\intercal },\textbf{z}_{t+1}^{\intercal }, \dots ,\textbf{z}_{t+f1}^{\intercal }]^{^{\intercal }}\) and \([\textbf{z}_{tp}^{\intercal }, \textbf{z}_{tp+1}^{\intercal }, \dots , \textbf{z}_{t1}^{\intercal }]^{^{\intercal }}\), respectively, and each column is characterized by \(t=p+1,\dots ,Tf+1\). Therefore, the dimension of both matrices is \(im \times T2i+1\). The blockHankel matrix \(\textbf{A}_f\) has the same general structure as \(\textbf{Z}_f\), with \(\textbf{a}_t\) instead of \(\textbf{z}_t\), while \(\textbf{X}_p:=[\textbf{x}_1, \textbf{x}_2, \ldots , \textbf{x}_{Tpf+1}]\) and \(\textbf{X}_f:=[\textbf{x}_{p+1}, \textbf{x}_{p+2}, \ldots , \textbf{x}_{Tf+1}]\), which have n rows (from assumption A2) and the same number of columns as \(\textbf{Z}_f\), are the sequence of past and future states, respectively.
Finally, matrices \(\varvec{\mathcal {O}}_i\) and \(\textbf{V}_i\) are known functions of the original parameter matrices, \(\varvec{\Phi }\), \(\textbf{K}\) and \(\textbf{H}\), see Section 2 of GarciaHiernaux et al. (2010), such that:
From assumption A1 and for sufficiently large i, \(\varvec{\mathcal {L}}^i_x\) tends to the null matrix. Consequently, substituting Eqs. (10a) in (10b) gives:
where \(\varvec{\Theta }_i:= \varvec{\mathcal {O}}_i\varvec{\mathcal {L}}_z\).
Equation (15) is the basis of subspace methods. It is a regression model which can be consistently estimated by LS because the columns in \(\textbf{V}_{i}\textbf{A}_f\) are uncorrelated with the regressor \(\textbf{Z}_p\) (from assumption A3).
Note that, for sufficiently large i, matrix \(\varvec{\Theta }_i\) has a reduced rank, n, which is less than the matrix dimension \(im \times im\). So, estimating (15) requires reducedrank LS. The estimation problem can be written as:
where \(\Vert \cdot \Vert _F\) is the Frobenius norm. This problem is solved by computing:
and then applying a SVD decomposition (denoted as \(\overset{svd}{=}\)) on a weighted version of \(\varvec{\hat{\Theta }}_i\), to deal with the rank reduction. Van Overschee and De Moor (1995) proved that several subspace methods are equivalent to computing:
where \(\hat{\textbf{E}}_n\) denotes the approximation error arising from the use of a system order n, and \(\textbf{W}_1\), \(\textbf{W}_2\) are two nonsingular weighting matrices, whose composition depends on the specific algorithm employed. While \(\textbf{W}_2 = (\textbf{Z}_p \textbf{Z}_p^{^{\intercal }})^{\frac{1}{2}}\) is common for most subspace methods (only for systems without observable inputs) there are differences for \(\textbf{W}_1\); see Table 1.
Table 1 shows that MOESP and N4SID subspace algorithms use the identity as weighting matrix. In contrast, when CVA is used, the singular values obtained from the SVD decomposition coincide with the canonical correlations between \(\textbf{Z}_f\) and \(\textbf{Z}_p\).
Alternatively, the weighting matrix used in this paper is an estimate of the inverse of the square root of the error covariance matrix, as \(\varvec{\Pi }^{\perp }_{\textbf{Z}_p}\) is the projection in the orthogonal complement of \(\textbf{Z}_{p}\). As we will discuss later (see Sect. 2.3.3) adequately choosing these weightings is crucial for some asymptotic results.
From Eq. (18), matrices in (7a–7b) are usually estimated following two main approaches. The first one uses the extended observability matrix, sometimes known as shift invariance approach, as it uses the shift invariance property of \(\varvec{\mathcal {O}}_i\). The second uses the estimates of the states sequence, and is usually termed state approach.
2.3.1 Shift invariance approach
The extended observability matrix in (13) can be estimated from (18) as \(\varvec{\hat{\varvec{\mathcal {O}}}}_i = \textbf{W}_1^{1}\textbf{U}_n \textbf{S}_n^{1/2}\), where we use that \(\varvec{\Theta }_i:= \varvec{\mathcal {O}}_i \varvec{\mathcal {L}}_z\). Note that if a different factorization is used, the resulting estimates would be observationally equivalent.
From Arun and Kung (1990) and Verhaegen (1994), we can obtain the matrix \(\varvec{\Phi }\) in (7a–7b) as:
where \(\underline{\hat{\varvec{\mathcal {O}}}_{i}}\) and \(\overline{\hat{\varvec{\mathcal {O}}}_{i}}\) result from removing the last and first m rows, respectively, from \(\hat{\varvec{\mathcal {O}}}_{i}\), and \(\textbf{A}^{\dag }\) denotes the Moore–Penrose pseudo–inverse of \(\textbf{A}\) (see, e.g., Golub and Van Loan 1996). An estimate of \(\textbf{H}\) is obtained from the first m rows of \(\hat{\varvec{\mathcal {O}}}_{i}\).
Last, from the SVD (18) we also get:
and \(\hat{\textbf{K}}\) can be easily obtained using (12).^{Footnote 4}
2.3.2 State approach
This second approach is based on the estimation of the state sequence, see e.g., Larimore (1990); Van Overschee and De Moor (1994). These algorithms estimate the model parameters by solving a simple set of LS problems.
From \(\hat{\varvec{\mathcal {L}}}_z\) obtained in (20), we can compute the state sequence as:
and similarly, \(\hat{\textbf{X}}_{f^+} = \hat{\varvec{\mathcal {L}}}_z \textbf{Z}_{p^+}\), where \(\textbf{Z}_{p^+}\) is like \(\textbf{Z}_p\), but adding a ‘\(+1\)’ to all the subscripts.^{Footnote 5} Once the state sequence is approximated, the estimates \((\hat{\textbf{H}},\varvec{\hat{\Phi }},\hat{\textbf{K}})\) are obtained as follows:
where \(\textbf{Z}_{f_{1}}\) denotes the first row of \(\textbf{Z}_{f}\). Building on these estimates, we compute the residuals \(\varvec{\hat{A}}_{f_{1}} = \textbf{Z}_{f_{1}}\hat{\textbf{H}}\hat{\textbf{X}}_f\) and, finally, get:
2.3.3 Asymptotic properties
This section briefly discusses the asymptotic properties of subspace methods.
The estimates resulting from the main subspace methods are known to be strongly consistent and asymptotically normal under mild assumptions. Bauer (2005a) surveys the literature on this subject and gives conditions for consistency and asymptotic normality for many subspace methods. This is done for the system (7a–7b) and also when observable inputs are included.
An effort has also been made to study the efficiency of these estimators, e.g., by comparing their asymptotic variances. In contrast with consistency and asymptotic normality, here we find some differences among the methods. Bauer and Jansson (2000) and Bauer and Ljung (2002) provide numerical approximations of the asymptotic variance of the estimators for several algorithms. The most relevant result in this line is presented by Bauer (2005b), who proves that CVA weighting matrices (see Table 1), together with the state approach (see Sect. 2.3.2), leads to estimates that are asymptotically equivalent to (pseudo) MLE,^{Footnote 6} being this the only asymptotic efficient result currently available in the literature.^{Footnote 7}
The algorithm suggested in this paper named SUBEST1 (see Sect. 3.2 for details) follows a twostage approach with the weighting matrix \(\textbf{W}_2\) given in Table 1. So, Bauer (2005b) theorem cannot be directly applied. We do not have a formal proof for its asymptotic efficiency. For this reason, we conduct a simulation where the meansquare errors of SUBEST1 and ML (Casals et al. 1999) estimates are computed. Additionally, the results obtained in Deistler et al. (1995) for other subspace methods and the same data generating process (DGP) are included for comparison. The model (7a–7b) considered in the simulation is:
with \(\textbf{a}_t \sim iid N(\textbf{0},\textbf{I})\). Following Deistler et al. (1995), the exercise is performed for \(T=2000,4000,8000,16000\) and \(R=500\) replications for each sample size, so that the results are comparable. We consider two measures for the quality of SUBEST1 estimates:
and
where \(\hat{\theta }^{\text {SUBEST1}}_i, \hat{\theta }^{\text {ML}}_i\) denote the SUBEST1 and ML estimates, while \(\theta _i\) denotes the true values given in (24). The quantity M(.) is T times the sample meansquared error of the estimate. For asymptotically efficient methods, M(.) converges to the CramerRao Bound (CRB) which is a lower bound for the asymptotic variance of the parameter estimates. On the other hand, D(.) is T times the meansquared difference between the corresponding method and ML estimates. For an asymptotically efficient method, \(D(.) \rightarrow 0\) as \(T \rightarrow \infty \).
The results of the simulation are summarized in Table 2 and Fig. 1. Table 2 shows the average of M(.) and D(.) statistics for all the parameters in (24), for SUBEST1, the echelon subspace estimator (ECH) and CVA, computed for the same example by Deistler et al. (1995), and, finally, the corresponding CRB. The results suggest that SUBEST1, contrary to ECH, yields estimates that are as asymptotically efficient as those obtained by ML or CVA; see Table 2. Figure 1 evidences that M(SUBEST1) \(\rightarrow \) CRB and D(SUBEST1) \(\rightarrow 0\), as \(T \rightarrow \infty \). This conclusion was confirmed by many other simulations (see the online Appendix at https://doi.org/10.100/s00362023014514), while no counterexamples have been found. This result supports the use of SUBEST1 for the modeling method presented in the rest of the paper.
2.4 Relationships between SS and VARMA\(_E\) representations
As VARMAs, the IF of a given dynamic system is not unique. Note that for any nonsingular matrix T, applying the transformation \(\textbf{x}_t^*=\textbf{T}^{1}\textbf{x}_t\), \(\varvec{\Phi }^*=\textbf{T}^{1}\varvec{\Phi }\textbf{T}\), \(\varvec{\Gamma }^*=\textbf{T}^{1}\varvec{\Gamma }\), \(\textbf{E}^*=\textbf{T}^{1}\textbf{E}\), \(\textbf{H}^*=\textbf{H}\textbf{T}\) to any IF, yields an alternative outputequivalent representation.
A canonical IF representation is characterized by two elements, a specific structure of the transition matrix \(\varvec{\Phi }\) and a unique transformation matrix T. The main property of canonical representations is that they realize the system output as a function of a unique parameter set and, therefore, are exactly identified.^{Footnote 8}
Hannan and Deistler (1988) show the equivalence between the VARMA\(_E\) and the echelon IF by analyzing the structure of linear dependence relations between the rows of the Hankel matrix \(\mathcal {H}=\varvec{\mathcal {O}}\varvec{\mathcal {C}}\), where \(\varvec{\mathcal {O}}\) is the infinite version of (13) and \(\varvec{\mathcal {C}}:=[\textbf{K},\varvec{\Phi }\textbf{K},\varvec{\Phi }^2\textbf{K},...]\); see Hannan and Deistler (1988), Theorems 2.5.1 and 2.5.2.
In this paper we use the SS Luenberger Canonical Form (LCF).^{Footnote 9} This representation is convenient because it is easy to obtain from it the equivalent VARMA\(_E\) representation. Particularly, Proposition 2 in Casals et al. (2012) proves that if the coefficients of a IF process are known, then: (1) the KI vector can be derived from the observability matrix (13), and (2) the VARMA\(_E\) representation can be obtained from a change of coordinates that transforms any IF into its corresponding LCF. We will use this result in Sect. 3.2.
3 Modeling stationary datasets
This section details the main steps of our modeling strategy, which consist in: (i) determining the system order, (ii) estimating the KIs compatible with the previous decision, and (iii) refining the resulting model.
3.1 Step1: Estimating the system order and a preliminary model
In our approach, the first step consists in determining the minimum dynamic order (or McMillan degree) required to realize \(\textbf{z}_t\).
The following remarks and examples may help to clarify this concept:

1.
A model with \(n=0\) would be static, as its output could be realized by 0 dynamic components;

2.
...when \(n=3\) the observable time series can be realized as a linear combination of three firstorder dynamic components;

3.
...the order of a standard VARMA(p,q) process is \(m \times \max \{\,p,q\}\); and,

4.
...the output of a norder system can also be realized by systems of order \(n+1\), \(n+2\)... Hence, a unique characterization for the McMillan degree requires the dimension of the system to be minimal (see assumption A2).
The literature typically determines n by estimating a sequence of SS models over a predetermined search space for the McMillan degree, \(n = 0, 1, 2...\) These models can be efficiently estimated using subspace techniques, see Favoreel et al. (2000); Bauer (2005a), and compared with each other by means of: (a) LR tests, see Tsay (1989a); Tiao and Tsay (1989), (b) information criteria, see Akaike (1974); Schwarz (1978); Hannan and Quinn (1979), or (c) canonical correlation criteria, see Bauer (2001); GarciaHiernaux et al. (2012).
We determine the McMillan degree automatically by following the modal strategy suggested by GarciaHiernaux et al. (2012),which consists in determining n as the modal value of the orders chosen by a selection of the methods mentioned above. We refer to this procedure as NID (NIDentification) and the following example illustrates its application.
3.1.1 Example 2: Estimating the Step1 model
Consider again the VARMA(1,1) model (5) used in Example 1. We simulate 100 observations of this process and apply to them the previously mentioned NID algorithm, obtaining the results displayed in Table 3, where the acronyms AIC, SBC and HQ denote the Akaike (1974), Schwarz (1978) and Hannan and Quinn (1979) information criteria, respectively, \(SVC_{\Omega _2}\) and NIDC denote the canonical correlationbased criteria proposed by GarciaHiernaux et al. (2012) and, last, \(\chi ^2\)(5%) denotes the pvalue of the LR test for significance of the canonical correlations proposed by Tsay (1989a). The 5% value in parentheses denotes that automatic testing will reject the null if the pvalue is smaller than 5%. Last, the orders automatically chosen by each criterion are displayed in the last row of Table 3. Note that the correct order (\(n=1\)) was the modal choice, and the only criterion that overestimates n is the AIC.
After determining the system order, it is easy to apply any of the subspace methods described in Sect. 2.3 to estimate the parameters of a norder IF (7a–7b). In this case, we use a shift invariance approach with \(\textbf{W}_1=(\textbf{Z}_f \varvec{\Pi }^{\perp }_{\textbf{Z}_p} \textbf{Z}_f^{^{\intercal }})^{\frac{1}{2}}\) and \(i=5\).^{Footnote 10} We will call this a “Step1 model”, and the estimated parameter matrices for this sample are:
Chapter 9 in Casals et al. (2016) describes a simple numerical procedure to write an IF in an equivalent balanced VARMA(n,n) model.^{Footnote 11} Applying this procedure to the previous IF returns the standard VARMA(1,1) model:
whose values are very close to the true coefficients in (5).
In the univariate case, the Step1 model is a canonical representation which can only be refined by pruning nonsignificant parameters, see Sect. 3.3. In the multivariate case, assuring a canonical representation requires further refining. We discuss this issue in the next Section. However, even if this Step1 model may be somewhat overparameterized, it might be sufficient to achieve the objectives of the modelbuilding exercise. In this case, the modeling process could stop here.
3.2 Step2: Estimating the KIs and its corresponding canonical model
We already mentioned that IF and VARMA representations are interchangeable. In particular, Casals et al. (2012) show how to obtain the VARMA\(_E\) representation of any IF when the matrices of the system and the KI, \(\alpha \), are known.
However, the equivalence issue is still unsolved when the IF coefficients are not known, but estimated from a finite sample. That happens because the estimated observability matrix does not present the linear dependence structure implied by \(\alpha \), and then the similar transformation proposed in Casals et al. (2012) does not lead to the LCF and, hence, to the equivalent VARMA\(_E\).
To solve this issue, we devised SUBEST1, a novel subspace algorithm that estimates the LCF model corresponding to any given KI vector \(\alpha \). The estimated parameters resulting from SUBEST1 coincide with those in a VARMA\(_E\) form. The algorithm follows a twostage procedure. Initial estimates of the system matrices (using a shift invariance approach, see Sect. 2.3.1) are used to obtain a transformation matrix \(\varvec{\hat{T}}\). We then use this matrix to transform the states and finally apply a state approach (see Sect. 2.3.2) to estimate the echelon form. As far as we know, this procedure is new in the literature and is the main contribution of the paper.
To describe the steps of SUBEST1, we will use the subindex notation {i:j}, meaning that any matrix \(\textbf{A}_{i:j}\) is the partition of \(\textbf{A}\) that includes the rows from i to j.
 Step1.:

Assume that n is known or consistently estimated using the procedures presented in Sect. 3.1. Estimate \(\varvec{\Phi }\), \(\textbf{K}\) and \(\textbf{H}\) using the subspace methods presented in Sect. 2.3.1 and the system order n. We denote by \(\hat{\varvec{\Phi }}\), \(\hat{\textbf{K}}\) and \(\hat{\textbf{H}}\) these estimates.
 Step2.:

For a given KI vector \(\alpha \), apply Proposition 2 in Casals et al. (2012) to obtain the transformation matrix \(\textbf{T}\) such that \(\varvec{\Phi }^*=\textbf{T}^{1}\varvec{\Phi }\textbf{T}\) and \(\textbf{H}^*=\textbf{H}\textbf{T}\) present the LCF structure when \(\alpha \), \(\varvec{\Phi }\) and \(\textbf{H}\) are known. Note that, even if the \(\alpha \) imposed as constraint is the true KI, this does not lead to the ones/zeros structure of the LCF, as \(\textbf{T}\) is computed with \(\varvec{\hat{H}}\) and \(\varvec{\hat{\Phi }}\), and not with the true (unknown) matrices of the system.
 Step3.:

Obtain the states \(\textbf{X}_f\) and \(\textbf{X}_{f^+}\) from the model in Step1, as explained in Sect. 2.3.2, and denote these estimates by \(\hat{\textbf{X}}_f\) and \(\hat{\textbf{X}}_{f^+}\), respectively. Then we compute the sequences of states, \(\textbf{X}^*_{f}\) and \(\textbf{X}^*_{f^+}\), that correspond to the LCF by applying the previous similar transformation \(\textbf{T}\) as \(\hat{\textbf{X}}^*_{f}=\textbf{T}^{1}\hat{\textbf{X}}_{f}\) and \(\hat{\textbf{X}}^*_{f^+}=\textbf{T}^{1}\hat{\textbf{X}}_{f^+}\).
 Step4.:

The LCF is defined by the matrices \(\varvec{\Phi }^*\) and \(\textbf{H}^*\) as:
$$\begin{aligned} \varvec{\Phi }^*=\begin{bmatrix} \textbf{F}_1 &{} \textbf{Q}_1 &{} \textbf{0} &{}\ldots &{} \textbf{0}\\ \textbf{F}_2 &{} \textbf{0} &{} \textbf{Q}_2 &{}\ldots &{} \textbf{0}\\ \vdots &{} \vdots &{} \vdots &{} \ddots &{} \vdots \\ \textbf{F}_{n_{\max }1} &{} \textbf{0} &{} \textbf{0} &{}\ldots &{} \textbf{Q}_{n_{\max }1} \\ \textbf{F}_{n_{\max }} &{} \textbf{0} &{} \textbf{0} &{}\ldots &{} \textbf{0} \end{bmatrix} \quad \text {and} \quad \textbf{H}^*=\big [\textbf{F}_{0} \quad \textbf{0} \quad \textbf{0} \big ]. \end{aligned}$$(27)where \(\varvec{\Phi }^*\) is a companion matrix such that each \(\textbf{F}_j\) block \((j=1,...,n_{\max })\) has a number of rows equal to the number of KIs greater or equal to k, \(\bar{m}=\sum _{k=1}^m \min \{n_k,1\}\) columns and some null elements. In fact, the (k, l)th element of \(\textbf{F}_{j}\) will be nonzero only if \(j \in \big [n_kn_{kl}+1, n_k\big ]\), where \(n_{kl}\) was defined in (4). Each \(\textbf{Q}_k\) block is a zeros/ones matrix, with as many columns as the number of KIs greater or equal to k. If the endogenous variables are sorted according to their corresponding KI, the structure of \(\textbf{Q}_k\) will be \(\textbf{Q}_k=\big [\textbf{I}_{k+1} \quad \textbf{0} \big ]^{^{\intercal }}\), where \(\textbf{I}_{k+1}\) is an identity matrix with the same number of rows as \(\textbf{F}_{k+1}\).
About \(\textbf{H}^*\), \(\textbf{F}_{0}\) is a \(m \times \bar{m}\) matrix, such that the rows corresponding to components with a nonzero KI can be organized in a \(\bar{m} \times \bar{m}\) lower triangular matrix with ones in the main diagonal.
Therefore, \(\varvec{\Phi }^*\) and \(\textbf{H}^*\) can be written as:
$$\begin{aligned} \varvec{\Phi }^*=\big [\varvec{\Phi }^*_{1:\bar{m}}~\varvec{\Phi }^*_{\bar{m}+1:n}\big ], \quad \textbf{H}^*=\begin{bmatrix} \bar{\textbf{H}}^*_1+\textbf{I} &{} \textbf{0} \\ \textbf{H}^*_2 &{} \textbf{0} \end{bmatrix} \end{aligned}$$(28)where \(\bar{\textbf{H}}^*_1=\textbf{H}^*_1  \textbf{I}\) and matrices \(\varvec{\Phi }^*_{1:\bar{m}}\), \(\bar{\textbf{H}}^*_1\) and \(\textbf{H}^*_2\) include parameters to be estimated and zeros. On the other hand, the matrix \(\varvec{\Phi }^*_{\bar{m}+1:n}\) is invariant and is only made up of ones and zeros. Therefore, estimating \(\varvec{\Phi }^*\) and \(\textbf{H}^*\) boils down to estimating some elements in their \(\bar{m}\) first columns, i.e., some elements in \(\varvec{\Phi }^*_{1:\bar{m}}\), \(\bar{\textbf{H}}^*_1\) and \(\textbf{H}^*_2\).
 Step5.:

Building on the estimate of the states obtained in Step3, and the LCF structure of matrices \(\varvec{\Phi }^*\) and \(\textbf{H}^*\), determined in Step4, estimate the LCF corresponding to the given KI vector \(\alpha \). To this end, formulate the SS model:
$$\begin{aligned} \textbf{X}^*_{f^+}&=\big [\varvec{\Phi }^*_{1:\bar{m}}~\varvec{\Phi }^*_{\bar{m}+1:n}\big ] \begin{bmatrix} \textbf{X}^*_{f,1:\bar{m}}\\ \textbf{X}^*_{f,\bar{m}+1:n} \end{bmatrix}+\textbf{K}^*\textbf{A}_{f_1}, \end{aligned}$$(29a)$$\begin{aligned} \textbf{Z}_{f_1}&=\begin{bmatrix} \bar{\textbf{H}}^*_1+\textbf{I} &{} \textbf{0} \\ \textbf{H}^*_2 &{} \textbf{0} \end{bmatrix} \begin{bmatrix} \textbf{X}^*_{f,1:\bar{m}}\\ \textbf{X}^*_{f,\bar{m}+1:n} \end{bmatrix} + \textbf{A}_{f_1}, \end{aligned}$$(29b)where \(\textbf{Z}_{f_1}\) and \(\textbf{A}_{f_1}\) are defined in Eq. (23). Rearranging terms in (29a–29b) yields:
$$\begin{aligned} \tilde{\textbf{X}}^*_{f^+}&=\varvec{\Phi }^*_{1:\bar{m}}\textbf{X}^*_{i,1:\bar{m}}+\textbf{K}^*\textbf{A}_{f_1} \end{aligned}$$(30a)$$\begin{aligned} \tilde{\textbf{Z}}_{f_1}&=\begin{bmatrix} \bar{\textbf{H}}^*_1 \\ \textbf{H}^*_2 \end{bmatrix}\textbf{X}^*_{f,1:\bar{m}}+\textbf{A}_{f_1}, \end{aligned}$$(30b)where \(\tilde{\textbf{X}}^*_{f^+} = \textbf{X}^*_{f^+}  \varvec{\Phi }^*_{\bar{m}+1:n}\textbf{X}^*_{f,\bar{m}+1:n}\) and \(\tilde{\textbf{Z}}_{f_1} = [ \textbf{I} \quad \textbf{0} ]^{^{\intercal }} \textbf{X}^*_{f,1:\bar{m}}\). Notice that estimating \(\varvec{\Phi }^*_{1:\bar{m}}\), \(\bar{\textbf{H}}^*_1\), \(\textbf{H}^*_2\) and \(\textbf{K}^*\) comes down to apply Eqs. (22–23), but substituting \(\textbf{Z}_{f_1}\), \(\hat{\textbf{X}}_{f^+}\) and \(\hat{\textbf{X}}_{f}\) by, respectively, \(\tilde{\textbf{Z}}_{f_1}\), \(\hat{\textbf{X}}^*_{f^+}\) and \(\hat{\textbf{X}}^*_{f}\), where the last two matrices were obtained in Step3.
Therefore, estimates of \(\varvec{\Phi }^*\) and \(\textbf{H}^*\) given in (28) and obtained as the LS solution of (30a–30b) keeps the LCF structure.^{Footnote 12}
In practice, \(\alpha \) is not known. However, it can be empirically determined bearing in mind that \(\alpha =n\), and that models with different \(\alpha \) are not nested in general, see Example 1, Remarks 3 and 5. These remarks motivate our proposal, which consists in estimating the models corresponding to all the combinations of KIs which add up to the system order, already specified in the previous phase, and choosing the model that optimizes a given information criterion. This idea can be formalized by the following minimization problem:
where \(\hat{\varvec{\Sigma }} (n_1,\ldots ,n_m)\) is an estimate for the error covariance matrix, C(T) is a penalty function that determines which information criterion is to be optimized and \(d(n_1,\ldots ,n_m)\) is the number of parameters of the corresponding echelon form. The usual alternatives are considered for the penalty term C(T), i.e., 2/T (AIC), \(\log (T)/T\) (SBC), and \(2\log (\log (T))/T\) (HQ).
Some authors have previously developed procedures to estimate \(\alpha \) and its associated echelon form. For instance, Hannan and Rissanen (1982); Hannan and Kavalieris (1984a); Tsay (1989b); Poskitt (1992, 2016); Nsiri and Roy (1992); Lütkepohl (2005), among others, deal with the problem using a VARMA representation. To the best of our knowledge, only Peternell (1995); Peternell et al. (1996) propose subspace methods to this aim. However, most of these methods are not statistically efficient in general, see Deistler et al. (1995), and some are computationally expensive or difficult to automate (Nsiri and Roy 1992). If SUBEST1 is asymptotically as efficient as MLE, as we conjectured in Sect. 2.3.3 based on simulation results, this would make it an attractive procedure with both, computational simplicity and statistical optimality.
On this basis, the methodology proposed here returns both, the KIs resulting from the optimization problem (31) and the estimates of its associated LCF. It is structured in three steps:

1.
Determine the system order, n, by means of the NID procedure.

2.
Compute all the KI vectors which add up to n, and estimate its corresponding canonical models using SUBEST1.

3.
Select the KIs associated of the model with the best ICs.
To illustrate the performance of the algorithm, we simulated 5000 replications of a trivariate system with \(\alpha =(2,1,1)\), which is treated as unknown. Table 4 shows the relative estimates of different \(\alpha \). Note that the procedure performs well even for relatively small samples and, as expected, its performance improves as T increases. Moreover, the relative number of cases in which the system order is misspecified (last three \(\alpha \) vectors in Table 4) is very small, so the remarkable reduction in the number of models to be estimated by constraining \(\alpha =n\) does not have a significant cost although, obviously, this will depend on the DGP. Additionally, Table 5 presents the DGP and the average estimates of its parameters for different sample sizes.
3.3 Step3: Refining the Step2 model
The third and final step consists in refining the canonical model obtained in Step1, in the univariate case, or in Step2, in the multivariate case, by different procedures such as:

1.
Applying NID to the model residuals. If the model is statistically adequate, the order identified should be \(n=0\). A higher McMillan degree would imply that the residuals have a dynamic structure, i.e., they are autocorrelated, so the model should be respecified.

2.
Estimating the models by Gaussian ML for improved efficiency, if the normality assumption holds.

3.
Excluding nonsignificant parameters to achieve further parsimony. Estimation and testing in this phase can be done by ML methods or iterative subspacebased techniques (see, GarciaHiernaux et al. 2009).
Note that Step3 seeks parsimony. While this a desirable feature, it is somewhat balanced by the complex procedures and modeling decisions required.
We illustrate Steps 2 and 3 with the same data used in Example 2.
3.3.1 Example 3: Estimating and refining a canonical representation
We applied the procedure described in Sect. 3.2 to determine \(\alpha \) in the simulated data already used in Example 2, with the constraint that the system order is \(n=1\). Again we set \(i=\log (T)=5\) in SUBEST1, which returns a SBC value of 5.90 and 6.02 for \(\alpha =(1,0)\) and \(\alpha =(0,1)\), respectively. So, the \(\alpha =(1,0)\) final model in VARMA\(_E\) form, estimated by Gaussian ML is:
very similar to (6). The standard errors are in parentheses. The residual diagnostics did not reveal any symptom of misspecification.
4 Nonstationary and seasonal processes
We will now discuss the extension of the previous analysis to nonstationary and seasonal variables. Instead of modeling directly the nonstationary process (as in 2005c), we follow the traditional strategy of identifying the integration order of each variable (or cointegration rank and vector) and then applying the corresponding transformation to induce stationarity.
4.1 Nonstationarity
Assume now that the system output, \(\textbf{y}_t \in {{\mathbb {R}}^m}\), is nonstationary. The methods described in Sect. 3 require stationarity, so we will first proceed to determine how to transform \(\textbf{y}_t\) into a stationary vector \(\textbf{z}_t\).
The literature provides many tools to this end, ranging from graphical assessment (Box et al. 2015), to univariate unit roots testing (Dickey and Fuller 1981; Kwiatkowski et al. 1992), and cointegration testing, (Johansen 1991). Subspace approaches have also been applied to determine the stationary transformation. In this case, the methods build on analyzing the canonical correlation coefficients (\(\sigma _j\), \(j=1,\ldots ,i\)) between the past and future information subspaces (\(\textbf{Z}_p\) and \(\textbf{Z}_f\), respectively, see Sect. 2.3). Bauer and Wagner (2002) and Bauer (2005c) extend the stationary theory to analyze I(1) and cointegrated variables within this framework. In a later work, GarciaHiernaux et al. (2010) discuss nonstationarity and cointegration analysis in the same line. The main idea in these papers is that the estimated \(\sigma _j\) corresponding to unit roots converge to their true values (unity) much faster than the other canonical correlation coefficients. This property, known as superconsistency, allows one to distinguish both kinds of correlations.
Here we use the latter method because of it is simple and can be implemented within the aforementioned NID procedure. Specifically, GarciaHiernaux et al. (2010) propose the following unit root criterion (URC) to identify the \(\sigma _j\) corresponding to unit roots:
This criterion concludes that \(\sigma _j=1\) when the distance between \(\hat{\sigma }_j^2\) and the unity is smaller than (or equal to) the penalty function, G(T, i, d), which depends on the sample size, T, the row dimension of \(\textbf{Z}_p\), i, and the number of unit roots we wish to evaluate, d. This threshold is computed through Monte Carlo simulations.
In the univariate case, \(m=1\), we apply the URC to determine d, so that \(z_t=\nabla ^d y_t\) is stationary. We could then proceed as described in Sect. 3 with \(z_t\).
In the multivariate case, \(m>1\), the procedure becomes more complex as potential cointegrating relations come to play. Particularly, when \(\textbf{y}_t\) is made up of m I(1) processes, then the cointegrating rank is \(c = md\), being d the number of unit roots of the multivariate process. Building on this basis, we determine the cointegrating rank using the following algorithm:

1.
Check that the series are I(1), applying UCR to every single series.

2.
Estimate the number of unit roots of the multivariate process, \(\hat{d}\), applying UCR to the multivariate process.

3.
Calculate the cointegrating rank as the difference between the system dimension and the number of unit roots estimated in the previous step, i.e., \(\hat{c} = m  \hat{d}\).
When the procedure above suggests a cointegrating relation (\(\hat{c}>0\)), GarciaHiernaux et al. (2010) show how to estimate the cointegrating matrix of \(\textbf{y}_t\). This matrix represents the linear combinations of \(\textbf{y}_t\) which are stationary and, after applying this transformation, we proceed as described in Sect. 3.
4.2 Seasonality
The methodology in Sect. 3 can be extended to model the multiplicative seasonal structure of process by using systems in cascade; see Chapter 7 in Casals et al. (2016). However, as in the previous section, the adequate transformation to induce stationarity must be decided first. We do not consider here seasonal cointegration, so we will address this problem in the univariate case.
Assume that a given series, \(y_t\), has a seasonal dynamic structure of period s, meaning that there are s observations per seasonal cycle. Following Box et al. (2015), we denote the regular difference operator by \(\nabla =1B\), and the seasonal difference operator by \(\nabla _s=1B^s\). In seasonal processes, we would first compute the seasonal sum of \(y_t\), defined as:
and then apply NID to detect unit roots at the zero frequency in \(w_t\), \(\nabla w_t\), \(\nabla ^2 w_t\)... If a single unit root in \(w_t\) is detected, the stationary transformation would be the seasonal difference:
Similarly, detecting two unit roots in \(w_t\) would imply that the stationary transformation is:
and so on. After the stationary transformation is identified and applied, one must run the NID procedure to determine, first, regular stationary structures of order \(n=0,1,2,\dots \), and then seasonal structures in the range \(n=0,s,2s,3s,\ldots \)
After fitting univariate models to each series in a dataset, one can cope with a vector of seasonal time series by: (a) filtering out the seasonal structure of each series using, e.g., the exact decomposition in Casals et al. (2002), and then (b) modeling the seasonally adjusted vector using the approach described in Sects. 3 and 4.1. After doing this, one could also estimate a final model combining the regular specification with a seasonal factor, given by seasonal AR, MA and transformation matrices, containing in the main diagonal corresponding factors of the univariate seasonal models.
Finally, as mentioned above, we do not address the case of seasonal cointegration. Recently, Bauer and Buschmeier (2021) went further in this sense proving that CVA, see Sect. 2.3, provides consistent estimators for longrun and shortrun dynamics, even in the presence of seasonal unit roots. They consider multivariate systems and, accordingly, propose new tests for the cointegrating rank at seasonal frequencies. The simulations show that CVA performs better than some alternatives in specific cases (small sample, large system dimension).
This shows the flexibility of subspace methods to model vectors of time series, but a deeper discussion of this issue is beyond the scope of this paper.
5 Univariate modeling examples
We will now illustrate the performance of the methodology proposed in previous sections when applied to build univariate models. Section 5.1 presents a stepbystep analysis of the famous series G for International Airline Passengers (Box et al. 2015), which is a popular benchmark for practical cases emphasizing in nonstationarity and seasonality. Section 5.2 applies the modal automatic strategy described in Sect. 3.1 to nine benchmarks commonly used in the literature.
5.1 Airline passenger series
There is a consensus in the literature that the monthly (log) Airline Passenger series, \(z_t=\log (y_t)\), is adequately represented by an IMA(1,1) \(\times \) IMA(1,1)\(_{12}\) model. We will show now that our methodology leads to this model.
Following the steps described in Sect. 4.2, we first compute the seasonal sum, \(w_t=(1+B+...+B^{s1}) z_t\), where \(s=12\), and apply to it the procedure for unit root detection, finding: (a) at least one unit root in \(w_t\) (Table 6), (b) at least one unit root in \(\nabla w_t\) (Table 7), and (c) zero unit roots in \(\nabla ^2 w_t\). These decisions seem quite reliable as the first canonical correlation coefficient for \(\nabla w_t\) and \(\nabla ^2 w_t\) are .9925 and .9565, and so the Unit Root Criterion presented in (32) clearly indicates the existence of a unit root in each case.
Therefore, an adequate stationary transformation is \(\nabla ^2 w_t=\nabla \nabla _{12} \, z_t\), which coincides with the standard result in the literature. Note that Table 7 also determines the order of the regular system, which is set at \(\hat{n}=1\).
The next step consists in determining the order of the seasonal subsystem. To this end, we apply NID to \(\nabla \nabla _{12} \, z_t\), setting the length of the seasonal cycle to \(s=12\). By doing so, we obtain the results shown in Table 8. Accordingly, the dynamic order of the seasonal subsystem is set to \(\hat{n}_s=1\) (see GarciaHiernaux et al. 2012, for the use of NID with seasonal series).
After this preliminary specification analysis, we apply SUBEST1 to the stationary transformed series, setting the orders of the regular and seasonal subsystems to the previously determined values: \(\hat{n}=1\) and \(\hat{n}_s=1\). This yields the balanced model:
where the figures in parentheses are the parameter standard errors, obtained by computing the corresponding information matrix. The estimate for the percent standard deviation of the error is \(\hat{\sigma }_{a_1}=3.68\%\).
To achieve the highest efficiency with Gaussian data and to exclude some nonsignificant parameters, one can compute MLE estimates using the previous values as initial conditions for the iterative optimization.^{Footnote 13} In this case, this provides the following result:
which essentially coincides with the results in the literature.
Finally, one can check for autocorrelation by applying NID to the residuals of this tentative model. If we do so setting the frequencies at \(s=1\) and \(s=12\) to check for additional regular or seasonal dynamic structures, we obtain Tables 9 and 10, whose results show no evidence of misspecification.
5.2 Semiautomatic identification for real and simulated time series
Now we apply our methodology to nine benchmark time series taken from the literature, using the modal strategy described in Sect. 3.1. Table 11 describes these series, while Tables 12 and 13 summarize the main results. Note that:

1.
The stationary transformation was correctly determined in all the cases.

2.
When working with seasonal series (series B, G y H) the final results coincide essentially with the models suggested by the literature, with two minor issues:

(a)
Seasonality in series G is weak, so it was not detected until we applied NID to the residuals of the initial regular model;

(b)
In the model for series H, the MLE of the seasonal MA parameter converged to unity, which implies a deterministic seasonality.

(a)

3.
All the models in the column “NID+SUBEST1+ML" in Table 13 coincide with the specifications proposed by the literature.
Regarding the performance of the modal strategy, there was a perfect agreement between all the criteria employed except for series D, where SBC suggests a firstorder process, while the remaining procedures suggests the “consensus” secondorder specification.
6 Multivariate modeling examples
Now we will show two examples of multivariate modeling, chosen because of the different cointegration properties of the final models. The first one is not cointegrated, while the second has one nonstationary common factor.
6.1 Flour price indices in three cities
The first exercise is based on three monthly series of flour price indexes in Buffalo, Minneapolis and Kansas City, recorded from August 1972 to November 1980, which will be denoted by \(\textbf{y}_t\). This dataset is adequate to test the “Law of one price”, as flour is a quite homogeneous product and these cities are close enough to allow for product flows.
Figure 2 displays the profile of these series. Note that they show a high degree of comovement and so, they could be cointegrated. Despite this graphical evidence, previous analyses reject the null of cointegration for this dataset; see Tiao and Tsay (1989); Lütkepohl and Poskitt (1996). This makes it a good benchmark to test for false positives in cointegration testing.
Applying NID to each individual series concludes that they all are I(1). Again, the first canonical correlation is close to unity in the three series, so this decision seems solid. To check for cointegration, we apply NID to the vector \(\log (\textbf{y}_t)\), detecting again three unit roots for the multivariate process. Here the first three canonical correlations are .99, .94 and .88, respectively. Although the unit root criterion presented in (32) indicates three unit roots, note that the last one is a borderline case. Anyway, cointegration is rejected, see Table 14.
Table 14 shows no evidence of any dynamic structure beyond the unit roots. This could be due to the fact that nonstationary components often dominate other weaker dynamic components, which detection requires a previous stationarity inducing transformation. For this reason, we apply again the NID algorithm to the stationary transformation \(\nabla \log (\textbf{y}_t)\), obtaining the results in Table 15. In this case there is a tie, as three criteria detect a firstorder structure while the other three conclude that the order is zero. When this happens, we prefer choosing the higher dynamic order because an overfitting error can be easily solved by pruning insignificant parameters. Therefore, we choose \(\hat{n}=1\).
After deciding the system order, SUBEST1 (with \(i=3\)) provides estimates for \(\alpha \) and its corresponding VARMA\(_E\). In this case, AIC, SBC and HQ coincide in choosing \(\hat{\alpha }=(1,0,0)\). Accordingly, the estimated model is:
with:
where the noise covariance estimates are presented in percentage. We then estimate the previous structure by ML, obtaining:
where the (1,3)element in \(\varvec{\hat{L}}\) was found nonsignificant and therefore pruned.
Last, Table 16 summarizes the output of NID when applied to the residuals of model (40), which does not show any symptom of misspecification. The \(\textbf{Q}\) matrix of Ljung and Box (1978), computed with 7 degrees of freedom, confirms this conclusion:
The VARMA echelon model (40) implies that the three series share a firstorder stationary AR component, and receive the effect of the innovations in Buffalo and Minneapolis at \(t1\). This structure is represented by seven free parameters, so it is parsimonious in comparison, e.g., with the nine parameters of the VAR(1) specification proposed by Tiao and Tsay (1989).
On the other hand, note that the error covariance matrix in Equation (40) implies that the disturbances in this model are highly correlated. This fact would explain the comovement observed in Fig. 2 and suggests that, in this case, the “Law of one price” could be described, not by common trends, but by common shocks affecting prices in different locations.
6.2 US interest rates
We will model now four monthly shortterm US interest rates, from August 1985 to January 2003, already analyzed by MartinManjon and Treadway (1997). These are the Fed’s Target Rate (\(TO_t\)), the effective rate (\(TE_t\)), and the secondary market yields for 3month (\(T3m_t\)) and 6month Tbills (\(T6m_t\)). Figure 3 shows the profile of these series.
We determine the integration order for each individual variable by applying NID. The unit root criterion clearly indicates that each individual series is I(1). The second step in the analysis consists in determining the dynamic order for the vector of variables by applying NID again. In this case, the procedure detects one unit root and, therefore, three cointegrating relationships, see Table 17.^{Footnote 14}
NID also provides estimates for the coefficients in the cointegrating equations, so the stationaryinducing transformations for the data is given by:
Therefore, cointegration operates between pairs of variables and the stationary combinations are close to the spreads between each rate and the Fed’s target. Note also that the structure of Equation (42) has a clear interpretation in terms of economic controllability, as determining the value of the Target Rate \(TO_t\) implies choosing the longrun mean value of \(TE_t\), \(T3m_t\) and \(T6m_t\) Applying NID to the stationarytransformed vector, \(\textbf{z}_{t}\), detects four stationary dynamic components and no additional unit roots, see Table 18.
In this situation, the next steps in the analysis would be: (a) determining the KI vector \(\alpha \), (b) estimating the final model, and (c) performing diagnostic checks. We do not include this results here because they do not add much to what was shown in previous examples.
7 Discussion
The modelbuilding approach proposed in this paper has substantial advantages in comparison with its main alternatives.
7.1 Comparison with the VAR methodology
The procedure to fit a Step1 model described in Sect. 3.1 is similar to standard VAR modelbuilding, see Lütkepohl (2005), as it relies on LS methods, explores a succession of linear models with increasing orders and compares the resulting models both, in terms of LRs and ICs. However, there are two important differences.
First, in VAR modeling the exploration of possible dynamic orders is not systematic, because it skips some intermediate dynamic orders. To see this, consider a vector of three time series. A standard VAR analysis would consider vector autoregressions of orders 0, 1, 2... which corresponding dynamic orders would be 0, \(1 \times 3=3\), \(2 \times 3=6\)... In comparison, our method explores the whole sequence of positive integer orders (0, 1, 2,...) up to a userspecified upper limit. Because of this, standard VAR modeling will be prone to over or underspecifying the system order.
Second, the VAR approach considers only autoregressive structures, while Step1 models have both, AR and MA terms. Accordingly, if the DGP process has some moving average structure our approach should provide a more parsimonious representation, even if we do not proceed to Step2.
Third, while a Step1 model can be compared with a VAR, in terms of the modeling decisions required (just deciding a scalar order) our approach provides a way to achieve further parsimony by enforcing a canonical structure (i.e., fitting a Step2 model) without much additional complexity. Conversely, the VAR methodology does not provide such an option.
7.2 Comparison with VARMA specification methods
In comparison with alternative VARMA specification procedures, as those in Jenkins and Alavi (1981), Tiao and Box (1981) and Reinsel (1997), our methodology:

1.
...provides a unified modeling procedure for single and multiple time series because the same basic decisions are required in both cases,

2.
...is scalable, meaning that the analyst may choose between a quick nonparsimonious Step1 model, or the more efficient Step2 and Step3 models,

3.
...is easy to implement in automatic mode, and

4.
...offers a flexible approach to the parsimony principle, which can be enforced in the latter stages of the analysis by adopting an efficient canonical structure (Step2) and pruning nonsignificant parameters (Step3).
On the other hand, standard VARMA specification methods: (a) do not enforce coherence with univariate modelbuilding, (b) are difficult to implement in automatic mode, (c) are not scalable, as they do not provide costefficient relaxations of the parsimony requirement and, in contradiction with this, (d) may yield an overparameterized model, as they do not assure the final model to be canonical.
7.3 Comparison with other methods using the Kronecker indices
The closer benchmarks for our proposal are the alternative methods to determine the KIs and, hence, to specify a VARMA\(_E\). This literature is rich and includes, among many works, those of Hannan and Kavalieris (1984b), Poskitt (1992), Lütkepohl and Poskitt (1996), Tsay (1989b), Li and Tsay (1998) and Poskitt (2016).
The common feature of all these methods is that they directly identify the KIs \(\alpha =(n_1,n_2,\ldots ,n_m)\), while we first determine the system order, n, and only then estimate and compare all the models with a sum of KIs equal to n. Our approach has several advantages:

1.
Determining the system order is useful in itself, as it allows one to specify a crude but fast Step1 model which may be adequate in some cases, see Sect. 3.1, thus avoiding the need to determine \(\alpha \).

2.
The dynamic order can be obtained with both, ICs and LR tests, because lower order specifications are nested in higher order ones. On the other hand, models with different \(\alpha \) nonnested in general, so they can only be compared with ICs.

3.
Limiting the search space to the KIs compatible with a previously estimated system order provides a faster and more robust performance than searching in an unbounded space.
Besides these advantages, our proposal adds extensions for nonstationary time series, cointegration and seasonality, not treated by other VARMA\(_E\) approaches.
7.4 Comparison with the SCM approach
Tiao and Tsay (1989) propose a structured method to specify a canonical model. They build on the socalled “Scalar Component” (SC), which is the basis of the family of “Scalar Component Models” (SCM). These concepts are analogous, but not equivalent, to those of KIs and VARMA\(_E\). In a nutshell, they build on a canonical correlation analysis to find a contemporaneous linear transformation for the vector process that: (a) reveals simplifying structures, (b) achieves parametric parsimony, and (c) identifies alternative “exchangeable” parameterizations.
Athanasopoulos et al. (2012) compare comprehensively the SC and KIbased approaches at theoretical, experimental, and empirical levels. They conclude that:

1.
VARMA\(_E\) and SC models are connected in theory.

2.
The greatest advantage of echelon forms is that their construction can be automated, being this impossible to achieve within the SC identification process.

3.
The experimental comparison is performed via Monte Carlo experiments. In this aspect, they conclude that both procedures work well in identifying some common VARMA DGPs.

4.
SC VARMA models are more flexible because their maximum autoregressive order does not have to be the same as the order of the moving average component, while these orders are constrained to be equal for a VARMA\(_E\) form and, perhaps for this reason,

5.
...the empirical outofsample forecast evaluation shows that SC VARMA models provide better forecasts than VARMA\(_E\) models which, in turn, provide better forecasts than VAR models.
These conclusions are fully consistent with our own experience. In the framework of this paper, we add significance tests in Step3 to relax the constraint on the autoregressive and moving average orders implied by the echelon form, so our Step3 and SC models could potentially have the same forecasting performance. Confirming this possibility would require and indepth forecasting power comparison, which exceeds the scope of this paper.
On the other hand, the methodology proposed here is scalable and includes extensions for nonstationary time series, cointegration and seasonality, not treated by the SCM approach.
7.5 Software implementation
An important disadvantage of our method in comparison with some alternatives is that the software required is not commonly available. To address this shortcoming, we implemented all the computational procedures described in this paper in the \(E^4\) functions NID and SUBEST1.
\(E^4\) is a MATLAB toolbox for time series modeling, which can be downloaded at: www.ucm.es/e4. The source code for all the functions in the toolbox is freely provided under the terms of the GNU General Public License. This site also includes a complete user manual and other reference materials.
Notes
Note that \({{\textbf {F}}_{\textbf {0}}}\) is unimodular. For a general definition of outputequivalent VARMA systems using unimodular matrices, see Hannan and Deistler (1988), Theorem 2.2.1.
More precisely, by IF we refer to the innovations steadystate form, where \(\textbf{K}\) and \(\varvec{\Sigma _a}\) are invariant.
Along the paper we fix i by rounding \(\log (T)\) to the nearest integer. There is not much literature about optimal values for p and f in short samples. For instance, in terms of forecasting, GarciaHiernaux (2011) shows that one obtains better results (in RMSE measure) when combining the forecasts provided by models with different values of i, rather than when a single value is used. Also, as subspace methods are not iterative, one can estimate with a reasonable computational cost all the models for a given sequence of i, and then choose the value that generates the best insample fit.
One can alternatively solve the Riccati equation to obtain an estimate of \(\textbf{K}\).
In these two expressions we use that \(\varvec{\mathcal {L}}^i_x\) tends to the null matrix for large i.
As in Bauer (2005b), pseudo means that the Gaussian likelihood is maximized even if the true noise distribution is unknown.
No equivalent result has been obtained for systems with exogenous variables.
Note however that there can be different canonical forms for the same system, being each one of them exactly identified. Canonical forms present in most cases the minimum number of parameters required to realize the system output.
Luenberger (1967) proposed the first canonical form for multivariate processes.
Unless otherwise indicated, we generally fix \(i=p=f\) to the integer closer to \(\log (T)\).
The term “balanced” refers to the fact that the resulting VARMA model has the same autoregressive and moving average orders, being both equal to the system order.
As previously mentioned, the state approach using the CVA weighting scheme in the stationary case leads to statistically efficient estimates. Obviously, this property holds if correct linear restrictions are imposed. SUBEST1 uses restrictions that are only asymptotically valid, and it is not easy to discuss analytically what happens in this case. Simulations suggest that asymptotic efficiency remains, although it seems that SUBEST1 does not return estimates as close to pseudo MLE as the CVA estimator does, in finite samples. We thank an anonymous referee for this observation.
With the highest efficiency we mean that, even if simulations in Sect. 2.3.3 suggest that SUBEST1 is asymptotically efficient, they also show that ML estimates have a smaller variance in short samples.
The numerical result of applying URC in (32) to the first four canonical correlations is \(.127,.055,.265,.649\), concluding then that there is only one unit root in the multivariate process.
References
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716–723
Arun K, Kung S (1990) Balanced approximation of stochastic systems. SIAM J Matrix Anal Appl 11(1):42–68
Athanasopoulos G, Poskitt D, Vahid F (2012) Two canonical VARMA forms: scalar component models visàvis the echelon form. Econom Rev 31(1):60–83
Bauer D (2001) Order estimation for subspace methods. Automatica 37:1561–1573
Bauer D (2005) Asymptotic properties of subspace estimators. Automatica 41:359–376
Bauer D (2005) Comparing the CCA subspace method to pseudo maximum likelihood methods in the case of no exogenous inputs. J Time Ser Anal 26(5):631–668
Bauer D (2005) Estimating linear dynamical systems using subspace methods. Econom Theory 21:181–211
Bauer D, Buschmeier R (2021) Asymptotic properties of estimators for seasonally cointegrated state space models obtained using the CVA subspace method. Entropy 23:436
Bauer D, Deistler M (1999) Balanced canonical forms for system identification. IEEE Trans Autom Control 44(6):1118–1131
Bauer D, Jansson M (2000) Analysis of the asymptotic properties of the MOESP type of subspace algorithms. Automatica 36:497–509
Bauer D, Ljung L (2002) Some facts about the choice of the weighting matrices in Larimore type of subspace algorithms. Automatica 38:763–773
Bauer D, Wagner M (2002) Estimating cointegrated systems using subspace algorithms. J Econom 111:47–84
Box G, Jenkins G (1970) Time series analysis, forecasting and control. HoldenDay, San Francisco
Box G, Jenkins G, Reinsel G, Ljung G (2015) Time series analysis: forecasting and control. Wiley Series in Probability and Statistics. Wiley, New York
Burr I (1976) Statistical quality control methods. Marcel Dekker, New York
Casals J, GarciaHiernaux A, Jerez M (2012) From general StateSpace to VARMAX models. Math Comput Simul 80(5):924–936
Casals J, GarciaHiernaux A, Jerez M, Sotoca S, Trindade A (2016) Statespace methods for time series analysis: theory, applications and software. ChapmanHall/CRC Press, New York
Casals J, Jerez M, Sotoca S (2002) An exact multivariate modelbased structural decomposition. J Am Stat Assoc 97(458):553–564
Casals J, Sotoca S, Jerez M (1999) A fast and stable method to compute the likelihood of time invariant state space models. Econ Lett 65(3):329–337
Casals J, Sotoca S, Jerez M (2013) Single versus multiplesource error models for signal extraction. J Stat Comput Simul 85(5):1053–1069
Deistler M, Peternell K, Scherrer W (1995) Consistency and relative efficency of subspace methods. Automatica 31(12):1865–1875
Dickey D, Fuller W (1981) Likelihood ratio statistics for autoregressive time series with a unit root. Econometrica 49(4):1057–1072
Favoreel W, De Moor B, Van Overschee P (2000) Subspace state space system identification for industrial processes. J Process Control 10:149–155
GarciaHiernaux A (2011) Forecasting linear dynamical systems using subspace methods. J Time Ser Anal 32(5):462–468
GarciaHiernaux A, Jerez M, Casals J (2009) Fast estimation methods for time series models in statespace form. J Stat Comput Simul 79(2):121–134
GarciaHiernaux A, Jerez M, Casals J (2010) Unit roots and cointegration modeling through a family of flexible information criteria. J Stat Comput Simul 80(2):173–189
GarciaHiernaux A, Jerez M, Casals J (2012) Estimating the system order by subspace methods. Comput Stat 27(3):411–425
Gevers MR (1985) ARMA models, their Kronecker indices and their McMillan degree. Int J Control 6(43):1745–1761
Golub G, Van Loan C (1996) Matrix computations. John Hopkins University Press, Baltimore
Hannan EJ, Deistler M (1988) The statistical theory of linear systems. Wiley, New York
Hannan EJ, Kavalieris L (1984) A method for autoregressivemoving average estimation. Biometrika 71:273–80
Hannan EJ, Kavalieris L (1984) Multivariate linear time series models. Adv Appl Probab 16:492–561
Hannan EJ, Quinn B (1979) The determination of the order of an autoregression. J R Stat Soc Ser B 41(2):713–723
Hannan EJ, Rissanen J (1982) Recursive estimation of mixed autoregressivemoving average order. Biometrika 69:81–94
Ho B, Kalman R (1966) Effective construction of linear statevariable models from inputoutput functions. Regelungstechnik 14:545–548
Jenkins G, Alavi A (1981) Somes aspects of modelling and forecasting multivariate time series. J Time Ser Anal 1(2):1–47
Johansen S (1991) Estimation and hypothesis testing of cointegration vectors in Gaussian vector autoregressive models. Econometrica 59:1551–1580
Kwiatkowski D, Phillips P, Schmidt P, Shin Y (1992) Testing the null hypothesis of stationarity against the alternative of a unit root: how sure are we that economic time series have a unit root? J Econom 54(1–3):159–178
Larimore WE (1990) Canonical variate analysis in identification, filtering and adaptive control. In: Proceedings of the 29th conference on decision and control, Hawaii, pp. 596–604
Li H, Tsay RS (1998) A unified approach to identifying multivariate time series models. J Am Stat Assoc 93(442):770–782
Ljung G, Box G (1978) On a measure of lack of fit in time series models. Biometrika 65(2):297–303
Luenberger DG (1967) Canonical forms for linear multivariate systems. IEEE Trans Autom Control 12:290–293
Lütkepohl H (2005) New introduction to multiple time series analysis. SpringerVerlag, Berlin
Lütkepohl H, Poskitt DS (1996) Specification of echelon form VARMA models. J Bus Econ Stat 14(1):69–79
Makridakis SG, Wheelwright SC, McGee VE (1983) Forecasting: methods and applications. Wiley, New York
MartinManjon R, Treadway A (1997) The Fed controls only one of the two interest rates in the US economy. ICAE working Paper 9716
Nsiri N, Roy R (1992) On the identification of ARMA echelonform models. Can J Stat 20:369–386
O’Donovan TM (1983) Shortterm forecasting. Wiley, New York
Peternell K (1995) Identification of linear dynamical systems by subspace and realizationbased algorithms (Unpublished doctoral dissertation). TU Wien
Peternell K, Scherrer W, Deistler M (1996) Statistical analysis of novel subspace identification methods. Signal Process 52:161–177
Poskitt DS (1992) Identification of echelon canonical forms for vector linear processes using least squares. Ann Stat 20:195–215
Poskitt DS (2016) Vector autoregressive moving average identification for macroeconomic modeling: a new methodology. J Econom 192:468–484
Qin SJ (2006) An overview of subspace identification. Comput Chem Eng 30:1502–1513
Quenouille MH (1957) The analysis of multiple time series. Griffin, London
Reinsel CG (1997) Elements of multivariate time series analysis, 2nd edn. SpringerVerlag, New York
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Tiao GC, Box GEP (1981) Modeling multiple time series with applications. J Am Stat Assoc 76:802–816
Tiao GC, Tsay RS (1989) Model specification in multivariate time series. J R Stat Soc Ser B 51(2):157–213
Tsay RS (1989) Identifying multivariate time series models. J Time Ser Anal 10(4):357–372
Tsay RS (1989) Parsimonious parametrization of vector autoregressive moving average models. J Bus Econ Stat 7(3):327–341
Van Overschee P, De Moor B (1994) N4SID: subspace algorithms for the identification of combined deterministicstochastic systems. Automatica 30(1):75–93
Van Overschee P, De Moor B (1995) A unifying theorem for three subspace system identification algorithms. Automatica 31(12):1853–1864
Verhaegen M (1994) Identification of the deterministic part of MIMO state space models given in innovations form from inputoutput data. Automatica 30(1):61–74
Wei W (2006) Time series analysis univariate and multivariate methods, 2nd edn. Addison Wesley, New York
Wold HOA (1964) A study in the analysis of stationary time series. Almqvist and Wiksell, Uppsala
Acknowledgements
The authors thank two anonymous referees for their valuable suggestions, which significantly improved this work. The first draft of this work was written while Miguel Jerez was visiting UC Davis. Special thanks are due to the Department of Economics for providing ideal working conditions. We also received useful feedback from the assistants to the research seminars at UC Davis, Universidad del Pais Vasco, Universidad Autonoma de Madrid and the 36th Annual International Symposium on Forecasting. Last, but not least, support from Instituto Complutense de Analisis Economico (ICAE), Programa Becas Complutenses Del Amo and from Programa de ayudas a grupos de investigacion SantanderUCM (Econometrics in State Space research group, ref. 940223) is gratefully acknowledged.
Funding
Open Access funding provided thanks to the CRUECSIC agreement with Springer Nature.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
GarciaHiernaux, A., Casals, J. & Jerez, M. Identification of canonical models for vectors of time series: a subspace approach. Stat Papers (2023). https://doi.org/10.1007/s0036202301451y
Received:
Revised:
Published:
DOI: https://doi.org/10.1007/s0036202301451y
Keywords
 System identification
 Canonical models
 Kronecker indices
 Subspace methods
 Statespace models