Markov-switching generalized additive models

We consider Markov-switching regression models, i.e. models for time series regression analyses where the functional relationship between covariates and response is subject to regime switching controlled by an unobservable Markov chain. Building on the powerful hidden Markov model machinery and the methods for penalized B-splines routinely used in regression analyses, we develop a framework for nonparametrically estimating the functional form of the effect of the covariates in such a regression model, assuming an additive structure of the predictor. The resulting class of Markov-switching generalized additive models is immensely flexible, and contains as special cases the common parametric Markov-switching regression models and also generalized additive and generalized linear models. The feasibility of the suggested maximum penalized likelihood approach is demonstrated by simulation and further illustrated by modelling how energy price in Spain depends on the Euro/Dollar exchange rate.


Introduction
In regression scenarios where the data have a time series structure, there is often parameter instability with respect to time (Kim et al., 2008). A popular strategy to account for such dynamic patterns is to allow for regime switching. The corresponding models are regression models with time-varying parameters, where the parameter vector takes on finitely many values and a not directly observable Markov chain controls which of the parameter vectors is active at any time. Such models are usually referred to as Markov-switching or regime-switching regression models, following the seminal papers by Goldfeld and Quandt (1973) and Hamilton (1989). A very basic Markov-switching regression model involves a time series {Y t } t=1...,T and an associated sequence of covariates x 1 , . . . , x T (including the possibility of x t = y t−1 ), with the relation between x t and Y t specified as where typically t iid ∼ N (0, 1), and where s t is the state an underlying, non-observable Nstate Markov chain is in at time t. In other words, the functional form of the relation between x t and Y t , and also the variance of the error variable, changes over time according to state switches in an underlying Markov chain. The states correspond to different regimes in the sense of different stochastic dynamics implied by them, and the Markov chain induces serial dependence, typically such that the states are persistent in the sense that regimes are active for longer periods of time, on average, than they would be if an independent mixture model was used to select among regimes. The classical example is an economic time series, where for example in a recession (or slow growth) period the effect of some explanatory variable x t on some economic indicator Y t might be very different compared to times of (high) economic growth (Hamilton, 2008).
The simple model given in (1) can be (and has been) modified in various ways, for example allowing for multiple covariates or for general error distribution models from the generalized linear model (GLM) framework; one example is given by the Markov-switching Poisson regression models discussed in Wang and Puterman (2001). However, in the existing literature the relationship between the target variable and the covariates is commonly specified in parametric form and usually assumed to be linear, with little investigation, if any, into the absolute or relative goodness of fit. The aim of the present work is to provide effective and accessible methods for a nonparametric estimation of the functional form of the predictor. These build on a) the strengths of the hidden Markov model (HMM) machinery (Zucchini and MacDonald, 2009), in particular the forward algorithm, which allows for a simple and fast evaluation of the likelihood of a Markov-switching regression model Our approach is by no means limited to models of the form given in (1). In fact, the flexibility of the HMM machinery lends itself to allow for the consideration of models from a much bigger class, which we will call Markov-switching generalized additive models.
These are simply generalized additive models (GAMs) with an additional time component, where the predictor -including additive smooth functions of covariates, parametric terms and error terms -is subject to regime changes controlled by an underlying Markov chain, analogously to (1). While the methods do not necessitate a restriction to additive structures, we believe these to be most relevant in practice and hence have decided to put the focus on corresponding models in the present work. Our work is closely related to that of de Souza and Heckman (2013). Those authors, however, confine their consideration to the case of only one covariate and the identity link function.
The paper is structured as follows. In Section 2, we formulate general Markov-switching regression models, describe how to efficiently evaluate their likelihood, and develop the spline-based nonparametric estimation of the functional form of the predicor. The performance of the suggested approach is then investigated in two simulation experiments in Section 3. We conclude in Section 4.
2 Markov-switching generalized additive models 2.1 Markov-switching regression models We begin by formulating a Markov-switching regression model with arbitrary form of the predictor, encompassing both parametric and nonparametric specifications. Let {Y t } t=1...,T denote the target variable of interest (a time series), and let x p1 , . . . , x pT denote the associated values of the p-th covariate considered, p = 1, . . . , P . We summarize the covariate values at time t in the vector x ·t = (x 1t , . . . , x P t ). Further let s 1 , . . . , s T denote the states of an underlying unobservable N -state Markov chain {S t } t=1...,T . Finally, we assume that conditional on (s t , x ·t ), Y t follows some distribution from the exponential family, and write .

-3 -Preprint
where g is some link function, typically the canonical link function associated with the exponential family distribution considered. That is, the expectation of Y t is linked to the covariate vector x ·t via the predictor function η (i) , which maps the covariate vector to R, when the underlying Markov chain is in state i, i.e. S t = i. Essentially there is one regression model for each state i, i = 1, . . . , N . In the following, we use the shorthand To fully specify the conditional distribution of Y t , additional parameters may be required, depending on the error distribution considered. For example, if Y t is conditionally Poisson-distributed, then (2) fully specifies the state-dependent distribution (e.g. with g(µ) = log(µ)), whereas if Y t is normally distributed (in which case g usually is the identity link), then the variance of the error needs to be specified, and would typically be assumed to also depend on the current state of the Markov chain. We use the notation φ (st) to denote such additional time-dependent parameters (typically dispersion parameters), and ). The simplest and probably most popular such model assumes a conditional normal distribution for Y t , a linear form of the predictor and a state-dependent error variance, leading to the model It is usually convenient to assume δ to be the stationary distribution, which, if it exists, is the solution to δΓ = δ subject to N i=1 δ i = 1.

Likelihood evaluation by forward recursion
A Markov-switching regression model, with conditional density p Y (y t , µ t , φ (st) ) and underlying Markov chain characterized by (Γ, δ), can be regarded as an HMM with additional dependence structure (here in the form of covariate influence); see Zucchini and MacDonald (2009). This opens up the way for exploiting the efficient and flexible HMM machinery.
Most importantly, irrespective of the type of exponential family distribution considered, an extremely efficient recursion strategy can be used to evaluate the likelihood of a Markovswitching regression model, namely the so-called forward algorithm. To see this, consider .
-4 -Preprint the vectors of forward variables, defined as the row vectors Then the following recursive scheme can be applied: where The recursion (4) follows immediately from which in turn can be derived in a straightforward manner using the model's dependence structure (for example analogously to the proof of Proposition 2 in Zucchini and MacDonald, 2009). The likelihood can thus be written as a matrix product: where 1 ∈ R N is a column vector of ones, and where θ is a vector comprising all model parameters. The computational cost of evaluating (5) is linear in the number of observations, T , such that a numerical maximization of the likelihood is feasible in most cases, even for very large T and moderate numbers of states N .

Nonparametric modelling of the predictor
Notably, the likelihood form given in (5) applies irrespective of the form of the conditional ). In particular, it can be used to estimate simple Markov-switching regression models, e.g. with linear predictors, or in fact with any GLM-type structure within states. Here we are concerned with a nonparametric estimation of the functional relationship between Y t and x ·t . To achieve this, we consider a GAM-type framework (Wood, 2006), with the predictor comprising additive smooth state-dependent functions of .
-5 -Preprint the covariates: We basically simply have one GAM associated with each state of the Markov chain.
In order to achieve a maximum of flexibility in the estimation of the functional form, we express each of the functions f (i) p , i = 1, . . . , N , p = 1, . . . , P , as a finite linear combination of a high number of basis functions, B 1 , . . . , B K : Note that different sets of basis functions can be applied to represent the different functions, but to keep the notation simple we here use a common set of basis functions, B 1 , . . . , B K , for all f (i) p . A common choice for the basis is to use B-splines, which form a numerically stable, convenient basis for the space of polynomial splines, i.e. piecewise polynomials that are fused together smoothly at the interval boundaries; see de Boor (1978) and Eilers and Marx (1996) for more details. We use cubic B-splines, in ascending order in the basis used in (6). The number of B-splines considered, K, determines the flexibility of the functional form, as an increasing number of basis functions allows for an increasing curvature of the predictor. Instead of trying to select an optimal number of basis elements, we follow Eilers and Marx (1996) and modify the likelihood by including a difference penalty on coefficients of adjacent B-splines. The number of basis B-splines, K, then simply needs to be sufficiently large in order to yield high flexibility for the functional estimates. Once this threshold is reached, a further increase in the number of basis elements no longer changes the fit to the data due to the impact of the penalty. Considering second-order differences -which leads to an approximation of the integrated squared curvature of the function estimate (Eilers and Marx, 1996) -leads to the difference penalty 0.5λ ip K k=3 (∆ 2 γ ipk ) 2 , where λ ip ≥ 0 are smoothing parameters and where ∆ 2 γ ipk = γ ipk − 2γ ip,k−1 + γ ip,k−2 .
We then modify the (log-)likelihood of the Markov-switching regression model -spe- (6) and underlying Markov chain characterized by (Γ, δ) -by including the above difference penalty, one for each of the smooth functions appearing in the predictor: The maximum penalized likelihood estimate then reflects a compromise between goodness-. -6 -Preprint of-fit and smoothness, where an increase in the smoothing parameters leads to an increased emphasis being put on smoothness. We discuss the choice of the smoothing parameters in more detail in the subsequent section. As λ ip → ∞, the corresponding penalty dominates the log-likelihood, leading to a sequence of estimated coefficients γ ip1 , . . . , γ ipK that are on a straight line. Thus, we obtain the common linear predictors, as given in (3), as a limiting case. Similarly, we can obtain parametric functions with arbitrary polynomial order q as limiting cases by considering (q + 1)-th order differences in the penalty. The common parametric regression models thus are essentially nested within the class of nonparametric models that we consider. One can of course obtain these nested special cases more directly, by simply specifying parametric rather than nonparametric forms for the predictor. On the other hand, it can clearly be advantageous not to constrain the functional form in any way a priori, though still allowing for the possibility of obtaining constrained parametric cases as a result of a data-driven choice of the smoothing parameters. Standard GAMs and even GLMs are also nested in the considered class of models (N = 1), but this observation is clearly less relevant, since powerful software is already available for these special cases.

Estimation
For given smoothing parameters, all model parameters -including the parameters determining the Markov chain, the coefficients γ ipk used in the linear combinations of B-splines and any other parameters required to specify the predictor -can be estimated simultaneously by numerically maximizing the penalized log-likelihood given in (7). For each function f (i) p , i = 1, . . . , N , p = 1, . . . , P , one of the coefficients has to be fixed to render the model identifiable (the intercept then controls the height of the predictor function).
We found that a convenient strategy to achieve this is to first standardize each sequence of covariates x p1 , . . . , x pT , p = 1, . . . , P , shifting all values by the sequence's mean and dividing the shifted values by the sequence's standard deviation, then considering an odd number of B-spline basis functions K and fixing γ ip,(K+1)/2 = 0. The numerical maximization is then carried out subject to well-known technical issues arising in all optimization problems, including parameter constraints and local maxima of the likelihood. The latter can be either easy to deal with or a challenging problem, depending on the complexity of the model considered. Uncertainty quantification, on both the estimates of parametric parts of the model and on the function estimates, can be performed using a parametric bootstrap (Efron and Tibshirani, 1993). In particular, we can obtain pointwise confidence intervals for the estimated regression functions, at specific values of the covariate, as the corresponding quantiles obtained from the bootstrap replications.
We apply a (generalized) cross-validation approach to choose the smoothing parameters .
-7 -Preprint in a data-driven way. A leave-one-out cross-validation will typically be computationally infeasible. Instead, for a given time series to be analyzed, we suggest to generate C random partitions such that in each partition a high percentage of the observations, e.g. 90%, form the calibration sample, while the remaining observations constitute the validation sample.
For each of the C partitions and any considered λ = (λ 11 , . . . , λ 1P , . . . , λ N 1 , . . . , λ N P ), the model is then calibrated by estimating the parameters using only the calibration sample (treating the data points from the validation sample as missing data, which is straightforward using the HMM forward algorithm; see Zucchini and MacDonald, 2009). Subsequently, proper scoring rules (Gneiting and Raftery, 2007) can be used on the validation sample to assess the model for the given λ and the corresponding calibrated model. For computational convenience, we consider the log-likelihood of the validation sample, under the model fitted in the calibration stage, as the score of interest (now treating the data points from the calibration sample as missing data). From some pre-specified grid Λ ⊂ R N ×P ≥0 , we then select the λ that yields the highest mean score over the C crossvalidation samples. The number of samples C needs to be high enough to give meaningful scores (i.e., such that the scores give a clear pattern rather than noise only; from our experience, C should not be smaller than 10), but must not be too high to allow for the approach to be computationally feasible.

Simulation experiments
Experiment I. We first considered a relatively simple scenario, with a Poisson-distributed target variable, a 2-state Markov chain selecting the regimes and only one covariate: The functional forms chosen for f (1) and f (2) are displayed by the dashed curves in Figure   1; both functions go through the origin. We further set β All covariate values were drawn independently from a uniform distribution on [−3, 3]. We ran 100 simulations, in each run generating 1000 observations from the model described.
The correctly specified model was then fitted via numerical maximum penalized likelihood . -8 -Preprint estimation, as described in Section 2.4 above, using the optimizer nlm in R. We set K = 15, hence using 15 B-spline basis densities in the representation of each functional estimate.
The sample mean estimates of the transition probabilities γ 11 and γ 22 were obtained as 0.949 (Monte Carlo standard deviation of estimates: 0.011) and 0.847 (0.027), respectively.
The estimated functionsf (1) andf (2) from all 100 simulation runs are visualized in Figure   1. Both have been shifted so that they go through the origin. The sample mean estimates of the predictor value for x t = 0 were obtained as 1.998 (0.033) and 1.986 (0.076) for states 1 and 2, respectively. Thus, overall, the results are very good. Experiment II. The second simulation experiment we conducted is slightly more involved, with a normally distributed target variable, a 2-state Markov chain and now two covariates: The functional forms we chose for f  Compared to Experiment I, we note that the standard deviation for the estimates of the predictor value for x 1t = x 2t = 0 are substantially higher. On average, the deviations of the functional estimates from the true functions are also higher.

Concluding remarks
We have exploited the strengths of the HMM machinery and of penalized B-splines to develop a flexible new class of models, namely Markov-switching GAMs. A key strength of the approach is ease of implementation, in particular the ease with which the code, once written for any Markov-switching GAM, can be modified to allow for various model formulations. This makes interactive searches for an optimal model among a suite of candidate formulations practically feasible. Model selection, although not explored in the current work, can be performed along the lines of Celeux and Durand (2008) using crossvalidated likelihood. We note that our approach is similar in spirit to that proposed in Langrock et al. (2013), where the aim is to nonparametrically estimate the densities of the state-dependent distributions of an HMM. The two main challenges with the suggested approach are: 1) the choice of an adequate smoothing parameter and 2) local maxima of the likelihood. For smoothing parameter selection, instead of cross-validation one could consider a leverage-based, approximate leave-one-out cross-validation which yields an AICtype criterion. Regarding local maxima, estimation via the EM algorithm, as suggested in de Souza and Heckman (2013) for a smaller class of models, could potentially be more robust (cf. Bulla and Berzel, 2008), but is technically more challenging and hence less userfriendly (MacDonald, in press). Overall, our approach shows promise as a useful novel tool in time series regression analysis.
There are various ways to extend the approach, in a relatively straightforward manner, in order to enlarge the class of models that can be considered. First, it is of course straightforward to consider semiparametric versions of the model, where some of the functional effects are modeled nonparametrically and others parametrically. Especially for complex models, with high numbers of states and/or high numbers of covariates considered, this can improve numerical stability and decrease the computational burden associated with the cross-validation. Second, the likelihood-based approach allows for the consideration of more involved dependence structures. In particular, semi-Markov state processeswhere the duration of a stay in any given state is explicitly modeled, rather than implicitly assumed to be geometrically distributed, as with a basic first-order Markov chain .
-11 -Preprint suggested by Langrock and Zucchini (2011). The consideration of interaction terms in the predictor is possible via the use of tensor products of univariate basis functions. Finally, it is conceptually straightforward to use the suggested type of estimation algorithm also for Markov-switching generalized additive models for location, shape and scale (GAMLSS, Rigby and Stasinopoulos, 2005).