Markovswitching generalized additive models
 1.9k Downloads
 4 Citations
Abstract
We consider Markovswitching regression models, i.e. models for time series regression analyses where the functional relationship between covariates and response is subject to regime switching controlled by an unobservable Markov chain. Building on the powerful hidden Markov model machinery and the methods for penalized Bsplines routinely used in regression analyses, we develop a framework for nonparametrically estimating the functional form of the effect of the covariates in such a regression model, assuming an additive structure of the predictor. The resulting class of Markovswitching generalized additive models is immensely flexible, and contains as special cases the common parametric Markovswitching regression models and also generalized additive and generalized linear models. The feasibility of the suggested maximum penalized likelihood approach is demonstrated by simulation. We further illustrate the approach using two real data applications, modelling (i) how sales data depend on advertising spending and (ii) how energy price in Spain depends on the Euro/Dollar exchange rate.
Keywords
Psplines Hidden Markov model Penalized likelihood Time series regression1 Introduction
The simple model given in (1) can be (and has been) modified in various ways, for example allowing for multiple covariates or for general error distributions from the generalized linear model (GLM) framework. An example for the latter is the Markovswitching Poisson regression model discussed in Wang and Puterman (2001). However, in the existing literature the relationship between the target variable and the covariates is commonly specified in parametric form and usually assumed to be linear, with little investigation, if any, into the absolute or relative goodness of fit. The aim of the present work is to provide effective and accessible methods for a nonparametric estimation of the functional form of the predictor. These build on a) the strengths of the hidden Markov model (HMM) machinery (Zucchini and MacDonald 2009), in particular the forward algorithm, which allows for a simple and fast evaluation of the likelihood of a Markovswitching regression model (parametric or nonparametric), and b) the general advantages of penalized Bsplines, i.e. Psplines (Eilers and Marx 1996), which we employ to obtain almost arbitrarily flexible functional estimators of the relationship between target variable and covariate(s). Model fitting is done via numerical maximum penalized likelihood estimation, using either generalized crossvalidation or an information criterion approach to select smoothing parameters that control the balance between goodnessoffit and smoothness. Since parametric polynomial models are included as limiting cases for very large smoothing parameters, this procedure also comprises the possibility to effectively reduce the functional effects to their parametric limiting cases, such that the conventional parametric Markovswitching regression models effectively are nested special cases of our more flexible models.
Our approach is by no means limited to models of the form given in (1). In fact, the flexibility of the HMM machinery allows for the consideration of models from a much bigger class, which we term Markovswitching generalized additive models (MSGAMs). These are simply generalized additive models (GAMs) with an additional time component, where the predictor—including additive smooth functions of covariates, parametric terms and error terms—is subject to regime changes controlled by an underlying Markov chain, analogously to (1). While the methods do not necessitate a restriction to additive structures, we believe these to be most relevant in practice and hence have decided to focus on these models in the present work. Our work is closely related to that of Souza and Heckman (2014). Those authors, however, confine their consideration to the case of only one covariate and the identity link function. Furthermore, we note that our approach is similar in spirit to that proposed in Langrock et al. (2015), where the aim is to nonparametrically estimate the densities of the statedependent distributions of an HMM.
The paper is structured as follows. In Sect. 2, we formulate general Markovswitching regression models, describe how to efficiently evaluate their likelihood, and develop the splinebased nonparametric estimation of the functional form of the predictor. The performance of the suggested approach is then investigated in three simulation experiments in Sect. 3. In Sect. 4, we demonstrate the feasibility and the potential of the approach by applying it (i) to advertising data and (ii) to Spanish energy price data. We conclude in Sect. 5.
2 Markovswitching generalized additive models
2.1 Markovswitching regression models
Assuming homogeneity of the Markov chain—which can easily be relaxed if desired—we summarize the probabilities of transitions between the different states in the \(N \times N\) transition probability matrix (t.p.m.) \(\varvec{\Gamma }=\left( \gamma _{ij} \right) \), where \(\gamma _{ij}=\Pr \bigl (S_{t+1}=j\vert S_t=i \bigr )\), \(i,j=1,\ldots ,N\). The initial state probabilities are summarized in the row vector \(\varvec{\delta }\), where \(\delta _{i} = \Pr (S_1=i)\), \(i=1,\ldots ,N\). It is usually convenient to assume \(\varvec{\delta }\) to be the stationary distribution, which, if it exists, is the solution to \(\varvec{\delta }\varvec{\Gamma }=\varvec{\delta }\) subject to \(\sum _{i=1}^N \delta _i=1\).
2.2 Likelihood evaluation by forward recursion
2.3 Nonparametric modelling of the predictor
2.4 Inference
For given smoothing parameters and given number of states, all model parameters—including the parameters determining the Markov chain, any dispersion parameters, the coefficients \(\gamma _{ipk}\) used in the linear combinations of Bsplines and any other parameters required to specify the predictor—can be estimated simultaneously by numerically maximizing the penalized loglikelihood given in (7). For each function \(f_p^{(i)}\), \(i=1,\ldots ,N\), \(p=1,\ldots ,P\), one of the coefficients needs to be fixed to render the model identifiable, such that the intercept controls the height of the predictor function. A convenient strategy to achieve this is to first standardize each sequence of covariates \(x_{p1},\ldots ,x_{pT}\), \(p=1,\ldots ,P\), shifting all values by the sequence’s mean and dividing the shifted values by the sequence’s standard deviation, and second consider an odd number of Bspline basis functions K with \(\gamma _{ip,(K+1)/2}=0\) fixed.
The numerical maximization is carried out subject to wellknown technical issues arising in all optimization problems, including parameter constraints and local maxima of the likelihood. The latter can be either easy to deal with or a challenging problem, depending on the complexity of the model considered. Numerical underflow (or overflow), which would typically arise for large T if the likelihood itself was considered, is prevented via the consideration of the loglikelihood. Since the likelihood is a product of matrices, this requires the implementation of a scaling algorithm (for details, see, e.g., Zucchini and MacDonald 2009). Any suitable optimization routine can be applied to perform the likelihood maximization. In this work, we used R and the optimizer nlm, which is a nonlinear minimizer based on a Newtontype optimization routine. For more details on the algorithm, see Schnabel et al. (1985).
Uncertainty quantification, on both the estimates of parametric parts of the model and on the function estimates, can be performed based on the approximate covariance matrix available as the inverse of the observed Fisher information, or alternatively using a parametric bootstrap (Efron and Tibshirani 1993). The latter avoids relying on asymptotics, which is particularly problematic when the number of Bspline basis functions increases with the sample size. From the bootstrap samples, we can obtain pointwise as well as simultaneous confidence intervals for the estimated regression functions. Pointwise confidence intervals are simply given via appropriate quantiles obtained from the bootstrap replications. Simultaneous confidence bands are obtained by scaling the pointwise confidence intervals until they contain a prespecified fraction of all bootstrapped curves completely (Krivobokova et al. 2010).
For the closely related class of nonparametric HMMs, identifiability holds under fairly weak conditions, which in practice will usually be satisfied, namely that the t.p.m. of the unobserved Markov chain has full rank and that the statespecific distributions are distinct (Gassiat et al. in press). This result transfers to the more general class of MSGAMs if, additionally, the statespecific GAMs are identifiable. Conditions for the latter are simply the same as in any standard GAM. In particular, the nonparametric functions have to be centered around zero. Furthermore, in order to guarantee estimability of a flexible smooth function on a given domain, it is necessary that the covariate values cover that domain sufficiently well. In practice, i.e. when dealing with finite sample sizes, parameter estimation will be difficult if the level of correlation, as induced by the unobserved Markov chain, is low, and also if the statespecific GAMs are similar. The stronger the correlation in the state process, the clearer becomes the pattern and hence the easier it is for the model to allocate observations to states. Similarly, the estimation performance will be best, in terms of numerical stability, if the statespecific GAMs are clearly distinct. (See also the simulation experiments in Sect. 3 below.)
2.5 Choice of the smoothing parameters
In Sect. 2.4, we described how to fit an MSGAM to data for a given smoothing parameter vector. To choose adequate smoothing parameters in a datadriven way, generalized crossvalidation can be applied. A leaveoneout crossvalidation will typically be computationally infeasible. Instead, for a given time series to be analyzed, we generate C random partitions such that in each partition a high percentage of the observations, e.g. 90 %, form the calibration sample, while the remaining observations constitute the validation sample. For each of the C partitions and any \(\varvec{\lambda }=(\lambda _{11},\ldots ,\lambda _{1P}, \ldots ,\lambda _{N1},\ldots ,\lambda _{NP})\), the model is then calibrated by estimating the parameters using only the calibration sample (treating the data points from the validation sample as missing data, which is straightforward using the HMM forward algorithm; see Zucchini and MacDonald 2009). Subsequently, proper scoring rules (Gneiting and Raftery 2007) can be used on the validation sample to assess the model for the given \(\varvec{\lambda }\) and the corresponding calibrated model. For computational convenience, we consider the loglikelihood of the validation sample, under the model fitted in the calibration stage, as the score of interest (now treating the data points from the calibration sample as missing data). From some prespecified grid \(\varvec{\Lambda } \subset {\mathbb {R}}_{\ge 0}^{N\times P}\), we then select the \(\varvec{\lambda }\) that yields the highest mean score over the C crossvalidation samples. The number of samples C needs to be high enough to give meaningful scores (i.e. such that the scores give a clear pattern rather than noise only; from our experience, C should not be smaller than 10), but must not be too high to allow for the approach to be computationally feasible.
2.6 Choice of the number of states

N is too small;

the distribution of the response variable is inadequate (e.g. due to overdispersion);

the functional form of the predictor is not flexible enough.
3 Simulation experiments
3.1 Scenario I
The sample mean estimates of the transition probabilities \(\gamma _{11}\) and \(\gamma _{22}\) were obtained as 0.894 (Monte Carlo standard deviation of estimates: 0.029) and 0.896 (0.032), respectively. The estimated functions \(\hat{f}^{(1)}\) and \(\hat{f}^{(2)}\) from all 200 simulation runs are visualized in Fig. 1. The functions have been shifted so that they go through the origin. All fits are fairly reasonable. The sample mean estimates of the predictor value for \(x_{t}=0\) were obtained as 2.002 (0.094) and 1.966 (0.095) for states 1 and 2, respectively.
3.2 Scenario II
For the choice of the smoothing parameter vector, we considered the grid \(\varvec{\Lambda }= \Lambda _1 \times \Lambda _2 \times \Lambda _3 \times \Lambda _4\), where \(\Lambda _1=\Lambda _2=\Lambda _3=\Lambda _4=\{0.25,4,64,1024,16384\}\). The AICbased smoothing parameter selection led to MISE estimates that overall were marginally lower than their counterparts obtained when using crossvalidation (0.555 compared to 0.565, averaged over all four functions being estimated), so again in the following we report the results obtained based on the AICtype criterion. The (true) function \(f_1^{(2)}\) is in fact a straight line, and, notably, the associated smoothing parameter was chosen as 16384, hence as the maximum possible value from the grid considered, in 129 out of the 200 cases, whereas for example for the function \(f_2^{(2)}\), which has a moderate curvature, the value 16384 was not chosen even once as the smoothing parameter.
In this experiment, the sample mean estimates of the transition probabilities \(\gamma _{11}\) and \(\gamma _{22}\) were obtained as 0.950 (Monte Carlo standard deviation of estimates: 0.011) and 0.948 (0.012), respectively. The estimated functions \(\hat{f}_1^{(1)}\), \(\hat{f}_1^{(2)}\), \(\hat{f}_2^{(1)}\) and \(\hat{f}_2^{(2)}\) from all 200 simulation runs are displayed in Fig. 2. Again all have been shifted so that they go through the origin. The sample mean estimates of the predictor value for \(x_{1t}=x_{2t}=0\) were 0.989 (0.369) and \(0.940\) (0.261) for states 1 and 2, respectively. The sample mean estimates of the statedependent error variances, \(\sigma _{1}\) and \(\sigma _{2}\), were obtained as 2.961 (0.107) and 1.980 (0.078), respectively. Again the results are very encouraging, with not a single simulation run leading to a complete failure in terms of capturing the overall pattern.
3.3 Scenario III
4 Real data examples
4.1 Advertising data
We first consider a classic data set on Lydia Pinkham’s annual sales and advertising expenditures during the period 1907–1960. The data set and its background are described in detail in Palda (1965). It comprises the sales in year t, \(y_t\), and the annual advertising expenditures, \(x_t\), of the company. Both figures are given in millions of U.S. dollars. The time series of annual sales displays two distinct peaks, in 1925 and 1945, respectively (see Fig. 1 in Palda 1965). Statistical analyses of such data can aid managers in determining the effectiveness of advertising (Smith et al. 2006).
4.2 Spanish energy prices
Next we analyze the data collected on the daily price of energy in Spain between 2002 and 2008. The data, 1784 observations in total, are available in the R package MSwM (SanchezEspigares et al. 2014). We consider the relationship over time between the price of energy, \(y_t\), and the Euro/Dollar exchange rate, \(x_t\). The commonly observed stochastic volatility of financial time series renders it unlikely that the relationship between these two variables is constant over time, and a possible, computationally efficient way to account for this is to consider a Markovswitching model. It is also probable that the two variables’ unknown relationship within a regime has a nonlinear functional form. As in the previous example, in the following we illustrate potential advantages of considering Markovswitching models with flexible nonparametric predictor functions, i.e. MSGAMs, rather than GAMs or parametric Markovswitching models when analyzing time series regression data.
To this end, we consider four different models for the energy price data. As benchmark models, we considered two parametric models with statedependent linear predictor \(\beta _0^{(s_t)} + \beta _1^{(s_t)} x_t\), with one (LIN) and two states (MSLIN), respectively, assuming the response variable \(y_t\) to be normally distributed with statedependent variance. Additionally, we considered two nonparametric models as introduced in Sect. 2.3, with one state (hence a basic GAM) and two states (MSGAM), respectively. In these two models, we assumed \(y_t\) to be gammadistributed, applying the log link function to meet the range restriction for the (positive) mean.
Models were also formally compared using an outofsample onestepahead forecast evaluation, by means of the sum of the loglikelihoods of observations \(y_u\) under the models fitted to all preceding observations, \(y_1,\ldots ,y_{u1}\), considering \(u=501,\ldots ,1784\) (such that models are fitted to a reasonable number of observations). We obtained the following loglikelihood scores for each model: \(2314\) for LIN, \(2191\) for GAM, \(2069\) for MSLIN and \(1703\) for MSGAM. Thus, in terms of outofsample forecasts, the MSGAM performed much better than any other model considered. Both twostate models performed much better than the singlestate models, however the inflexibility of the MSLIN model resulted in a poorer performance than that of its nonparametric counterpart, as clear nonlinear features in the regression data are ignored.
While this second example is simplistic—for example, other explanatory covariates such as the oil price will also heavily affect the energy price—it nevertheless does illustrate the substantially increased flexibility, and hence increased potential to fit the data at hand, of MSGAMs compared to their simpler parametric counterparts. At the very least, these models can prove useful as exploratory tools to identify key features in time series data with regimeswitching patterns, without making any restrictive assumptions on the functional relationships a priori.
5 Concluding remarks
We have exploited the strengths of the HMM machinery and of penalized Bsplines to develop a flexible new class of models, MSGAMs, which show promise as a useful tool in time series regression analysis. A key strength of the inferential approach is ease of implementation, in particular the ease with which the code, once written for any MSGAM, can be modified to allow for various model formulations. This makes interactive searches for an optimal model among a suite of candidate formulations practically feasible. Model selection, although not explored in detail in the current work, can be performed along the lines of Celeux and Durand (2008) using crossvalidated likelihood, or can be based on AICtype criteria such as the one we considered for smoothing parameter selection. For more complex model formulations, local maxima of the likelihood can become a challenging problem. In this regard, estimation via the EM algorithm, as suggested in Souza and Heckman (2014) for a smaller class of models, could potentially be more robust (cf. Bulla and Berzel 2008), but is technically more challenging, not as straightforward to generalize and hence less userfriendly (MacDonald 2014).
In the first example application, to advertising data, we demonstrated that the additional flexibility offered by MSGAMs can make an important difference regarding the exact quantification of the effect of some covariate (here: advertising expenditure) on some target variable (here: sales), in particular allowing to accurately quantify advertising wearout effects. In the second example application, to energy price data, the MSGAM clearly outperformed the competing models in an outofsample comparison. This improvement is due to its accommodation of both the need for regime switches over time and the need to capture nonlinear relationships within a regime. However, even the very flexible MSGAM exhibited some shortcomings in this example. In particular, it is apparent from the plots, but also from the estimates of the transition probabilities, which indicated a very high persistence of regimes, that the regimeswitching model addresses longterm dynamics, but fails to capture the shortterm (daytoday) variations within each regime. In this regard, it would be interesting to explore models that incorporate regime switching (for capturing longterm dynamics induced by persistent market states) but for example also autoregressive error terms within states (for capturing shortterm fluctuations). Furthermore, the plots motivate a distributional regression approach, where not only the mean but also variance and potentially other parameters are modeled as functions of the covariates considered. In particular, it is conceptually straightforward to use the suggested type of estimation algorithm also for MSGAMs for location, shape and scale (GAMLSS; Rigby and Stasinopoulos 2005).
There are various other ways to modify or extend the approach, in a relatively straightforward manner, in order to enlarge the class of models that can be considered. First, as already seen in the application to advertising data, it is of course straightforward to consider semiparametric versions of the model, where some of the functional effects are modeled nonparametrically and others parametrically. Especially for complex models, with high numbers of states and/or high numbers of covariates considered, this can improve numerical stability and decrease the computational burden associated with the smoothing parameter selection. Second, the consideration of interaction terms in the predictor is possible via the use of tensor products of univariate basis functions. Third, the likelihoodbased approach also allows for the consideration of more involved dependence structures (e.g. semiMarkov state processes; Langrock and Zucchini 2011). In particular, in the current model formulation we assume that a single univariate state process determines the GAM, such that changes in the state process affect all GAM parameters simultaneously. Conceptually there is no difficulty in devising models where different parts of the GAM are driven by different Markov state processes. However, with such models the dimensionality of the state process and hence the computational burden will increase rapidly. Finally, in case of multiple time series, random effects can be incorporated into a joint MSGAM formulation.
References
 Applegate, E.: The Rise of Advertising in the United States: A History of Innovation to 1960. Scarecrow Press, Lanham (2012)Google Scholar
 Bass, F.M., Bruce, N., Majumdar, S., Murthi, B.P.S.: Wearout effects of different advertising themes: a dynamic Bayesian model of the advertisingsales relationship. Mark. Sci. 26, 179–195 (2007)CrossRefGoogle Scholar
 Bulla, J., Berzel, A.: Computational issues in parameter estimation for stationary hidden Markov models. Comput. Stat. 13, 1–18 (2008)MathSciNetCrossRefGoogle Scholar
 Celeux, G., Durand, J.P.: Selecting hidden Markov model state number with crossvalidated likelihood. Comput. Stat. 23, 541–564 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
 Corkindale, D., Newall, J.: Advertising thresholds and wearout. Eur. J. Mark. 12, 329–378 (1978)CrossRefGoogle Scholar
 de Boor, C.: A Practical Guide to Splines. Springer, Berlin (1978)CrossRefzbMATHGoogle Scholar
 de Souza, C.P.E., Heckman, N.E.: Switching nonparametric regression models. J. Nonparametric Stat. 26, 617–637 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
 Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. Chapman & Hall/CRC, New York (1993)CrossRefzbMATHGoogle Scholar
 Eilers, P.H.C., Marx, B.D.: Flexible smoothing with \(B\)splines and penalties. Stat. Sci. 11, 89–121 (1996)MathSciNetCrossRefzbMATHGoogle Scholar
 Fahrmeir, L., Kneib, T., Lang, S., Marx, B.: Regression: Models, Methods and Applications. Springer, Berlin (2013)Google Scholar
 FrühwirthSchnatter, S.: Finite Mixture and Markov Switching Models. Springer, New York (2006)zbMATHGoogle Scholar
 Gassiat, E., Cleynen, A., Robin, S.: Inference in finite state space non parametric Hidden Markov models and applications. Stat. Comput. (2015). doi: 10.1007/s1122201495238
 Gneiting, T., Raftery, A.E.: Strictly proper scoring rules, prediction, and estimation. J. Am. Stat. Assoc. 102, 359–378 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
 Goldfeld, S.M., Quandt, R.E.: A Markov model for switching regressions. J. Econom. 1, 3–16 (1973)CrossRefzbMATHGoogle Scholar
 Gray, R.J.: Flexible methods for analyzing survival data using splines, with application to breast cancer prognosis. J. Am. Stat. Assoc. 87, 942–951 (1992)CrossRefGoogle Scholar
 Hamilton, J.D.: A new approach to the economic analysis of nonstationary time series and the business cycle. Econometrica 57, 357–384 (1989)MathSciNetCrossRefzbMATHGoogle Scholar
 Hamilton, J.D.: Regimeswitching models. In: Durlauf, S.N., Blume, L.E. (eds.) The New Palgrave Dictionary of Economics, 2nd edn. Palgrave Macmillan, New York (2008)Google Scholar
 Kim, C.J., Piger, J., Startz, R.: Estimation of Markov regimeswitching regression models with endogenous switching. J. Econom. 143, 263–273 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
 Krivobokova, T., Kneib, T., Claeskens, G.: Simultaneous confidence bands for penalized spline estimators. J. Am. Stat. Assoc. 105, 852–863 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
 Langrock, R., Zucchini, W.: Hidden Markov models with arbitrary state dwelltime distributions. Comput. Stat. Data Anal. 55, 715–724 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
 Langrock, R., Kneib, T., Sohn, A., DeRuiter, S.L.: Nonparametric inference in hidden Markov models using Psplines. Biometrics 71, 520–528 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
 MacDonald, I.L.: Numerical maximisation of likelihood: a neglected alternative to EM? Int. Stat. Rev. 82, 296–308 (2014)MathSciNetCrossRefGoogle Scholar
 Palda, K.S.: The measurement of cumulative advertising effects. J. Bus. 38, 162–179 (1965)CrossRefGoogle Scholar
 Psaradakis, Z., Spagnolo, F.: On the determination of the number of regimes in Markovswitching autoregressive models. J. Time Ser. Anal. 24, 237–252 (2003)MathSciNetCrossRefzbMATHGoogle Scholar
 Rigby, R.A., Stasinopoulos, D.M.: Generalized additive models for location, scale and shape. J. R. Stat. Soc. Ser. C 54, 507–554 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
 SanchezEspigares, J.A., LopezMoreno, A.: MSwM: Fitting MarkovSwitching Models. R package version 1.2. http://CRAN.Rproject.org/package=MSwM (2014)
 Schnabel, R.B., Koontz, J.E., Weiss, B.E.: A modular system of algorithms for unconstrained minimization. ACM Trans. Math. Softw. 11, 419–440 (1985)MathSciNetCrossRefzbMATHGoogle Scholar
 Smith, A., Naik, P.A., Tsai, C.H.: Markovswitching model selection using Kullback–Leibler divergence. J. Econom. 134, 553–577 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
 Wang, P., Puterman, M.L.: Markov Poisson regression models for discrete time series. Part 1: Methodology. J. Appl. Stat. 26, 855–869 (2001)CrossRefzbMATHGoogle Scholar
 Wood, S.: Generalized Additive Models: An Introduction with R. Chapman & Hall/CRC, Boca Raton (2006)zbMATHGoogle Scholar
 Zucchini, W., MacDonald, I.L.: Hidden Markov Models for Time Series: An Introduction Using R. Chapman & Hall/CRC, Boca Raton (2009)CrossRefzbMATHGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.