Goodness-of-fit testing for the marginal distribution of regime-switching models with an application to electricity spot prices

This paper complements a recently published study (Janczura and Weron in AStA-Adv Stat Anal 96(3):385–407, 2012) on efficient estimation of Markov regime-switching models. Here, we propose a new goodness-of-fit testing scheme for the marginal distribution of such models. We consider models with an observable (like threshold autoregressions) as well as a latent state process (like Markov regime-switching). The test is based on the Kolmogorov–Smirnov supremum-distance statistic and the concept of the weighted empirical distribution function. The motivation for this research comes from a recent stream of literature in energy economics concerning electricity spot price models. While the existence of distinct regimes in such data is generally unquestionable (due to the supply stack structure), the actual goodness-of-fit of the models requires statistical validation. We illustrate the proposed scheme by testing whether commonly used Markov regime-switching models fit deseasonalized electricity prices from the NEPOOL (US) day-ahead market.


Introduction
Regime-switching models have attracted a lot of attention in the recent years. A flexible specification allowing for abrupt changes in model dynamics has led to its popularity not only in econometrics (Choi 2009;Hamilton 1996b;Lux and Morales-Arias 2010), but also in other fields as diverse as traffic modeling (Cetin and Comert 2006), population dynamics (Luo and Mao 2007), river flow analysis (Vasas et al. 2007) or earthquake counts (Bulla and Berzel 2008). This paper is motivated by yet another stream of literature: electricity spot price models in energy economics (Bierbrauer et al. 2007;De Jong 2006;Erlwein et al. 2010;Huisman and de Jong 2003;Weron 2010, 2012;Bunn 2008, 2010;Mari 2008;Misiorek et al. 2006;Weron 2009). Regime-switching models have seen extensive use in this area due to their relative parsimony (a prerequisite in derivatives pricing) and the ability to capture the unique characteristics of electricity prices (in particular, the spiky and non-linear price behavior). While the existence of distinct regimes in electricity prices is generally unquestionable (being a consequence of the non-linear, heterogeneous supply stack structure in the power markets, see e.g., Eydeland and Wolyniec 2012;Weron 2006), the actual goodness-of-fit of the models requires statistical validation.
However, recent work concerning the statistical fit of regime-switching models has been mainly devoted to testing parameter stability versus the regime-switching hypothesis. Several tests have been constructed for the verification of the number of regimes. Most of them exploit the likelihood ratio technique (Cho and White 2007;Garcia 1998), but there are also approaches related to recurrence times (Sen and Hsieh 2009), likelihood criteria (Celeux and Durand 2008) or the information matrix (Hu and Shin 2008). Specification tests to detect autocorrelation and ARCH effects were proposed by Hamilton (1996a, based on the score function technique) and more recently by Smith (2008, utilizing the Rosenblatt transformation; see also Sect. 3.3). Smith found that the performance of the Ljung-Box test improved when used on the normally distributed Rosenblatt transformation. However, the considered Markov regime-switching models were relatively simple and had two states differing only in mean. Interestingly, the Rosenblatt transformation was used earlier for evaluating density forecasts of regime-switching (Berkowitz 2001;Diebold et al. 1998;Haas et al. 2004) and stochastic volatility models (Kim et al. 1998), typically in a risk management context.
On the other hand, to our best knowledge, procedures for goodness-of-fit testing of the marginal distribution of regime-switching models have not been derived to date (with the exception of Janczura and Weron 2009, where an ewedf-type test was introduced in the context of electricity spot price models, see Sect. 3.2.1 for details). With this paper we want to fill the gap and propose empirical distribution function (edf) based testing procedures built on the Kolmogorov-Smirnov test, that are dedicated to regime-switching models with observable as well as latent state processes. In contrast to the approaches based on the Rosenblatt transformation, the techniques proposed in this paper allow for testing the fit in the individual regimes as well as of the whole model. This can be advantageous in many situations as we additionally obtain information on which regimes are correctly and which are incorrectly specified. The derivation of the tests is not straightforward and in the case of a latent state process requires an application of the concept of the weighted empirical distribution function (wedf). Finally, we should also note, that the term "marginal distribution" does not mean that the proposed tests do not account for the dynamic regime structure. On the contrary, the dynamical structure is considered when constructing residuals used in the testing procedures.
The paper is structured as follows: in Sect. 2 we describe the structure of the analyzed regime-switching models and briefly explain the estimation process (for details we refer to an article recently published in AStA; . In Sect. 3, we introduce goodness-of-fit testing procedures appropriate for regime-switching models both with observable and latent state processes. Next, in Sect. 4 we provide a simulation study and check the performance of the proposed techniques. Since the motivation for this paper comes from the energy economics literature, in Sect. 5 we show how the presented testing procedure can be applied to verify the fit of Markov regime-switching models to electricity spot prices. Finally, in Sect. 6 we conclude and outline future work.

Model definition
Assume that the observed process X t may be in one of L states (regimes) at time t, dependent on the state process R t : Possible specifications of the process R t may be divided into two classes: those where the current state of the process is observable (like threshold models, e.g., TAR, SETAR) and those where it is latent. Probably, the most prominent representatives of the second group are the hidden Markov models (HMM; for a review see, e.g., Cappe et al. 2005) and their generalizations allowing for temporal dependence within the regimes-the Markov regime-switching models (MRS). Like in HMMs, in MRS models R t is assumed to be a Markov chain governed by the transition matrix P containing the probabilities p i j of switching from regime i at time t to regime j at time t + 1, for i, j = {1, 2, . . . , L}: The current state R t at time t depends on the past only through the most recent value R t−1 . The probability of being in regime j at time t + m starting from regime i at time t is given by where P denotes the transpose of P and e i is the ith column of the identity matrix.
In general, L regime models can be considered. However, for clarity of exposition we limit the discussion in this paper to two regime models only. Note, that this is not a very restrictive limitation-at least in the context of modeling electricity spot prices-since typically two or three regimes are enough to adequately model the dynamics (Janczura and Weron 2010;Karakatsani and Bunn 2010). Nonetheless, all presented results are also valid for L > 2. The definitions of the individual regimes can be arbitrarily chosen depending on the modeling needs. Again for the sake of clarity, in this paper we focus only on two commonly used in the energy economics literature specifications of MRS models (Ethier and Mount 1998;De Jong 2006;Hirsch 2009;Huisman and de Jong 2003;Janczura and Weron 2010;Mari 2008). The first one (denoted by I or type I) assumes that the process X t is driven by two independent regimes: (1) a mean-reverting AR(1) process: with 0 < β < 1 and σ > 0, where the residuals t s are independent, F 1 -distributed (in the following we assume that F 1 is the standard Gaussian cdf) and (2) an i.i.d. sample from a specified continuous, strictly monotone distribution F 2 : Observe that in such a model specification the values of the first regime X t,1 become latent when the process is in the second state and they do not depend on the realization (trajectory) of the second regime. Such a specification, though computationally challenging, is useful for modeling processes with radically different dynamics in the individual regimes. For instance, in wholesale power markets the price spikes or price drops (i.e., negative price spikes) are typically driven by unexpected changes of market conditions, like a generation outage or a severe heat wave (in case of price spikes) or favorable wind conditions combined with low consumption (in case of price drops). Price spikes occur due to lack of storage capabilities and limited flexibility to respond to sudden changes in supply and/or demand for electricity. When the reason for the spike is over the prices move back to the normal (or base) level, usually irrespective of the magnitude of the extreme prices a few hours or days earlier (Eydeland and Wolyniec 2012;Huisman 2009;Weron 2006). For an example of such a behavior see Fig. 1 in Sect. 5. In the second specification (denoted by II or type II) X t is described by an AR(1) process having different parameters in each regime, namely: where the residuals t s are independent, N (0, 1)-distributed random variables. Again, we assume that 0 < β i < 1 and σ i > 0.

Estimation
Estimation of regime-switching models with an observable state process boils down to the problem of independently estimating parameters in each regime. In case of MRS models, though, the estimation process is not straightforward, since the state process is latent and not directly observable. We have to infer the parameters and state process values at the same time. In this paper, we use a variant of the Expectation-Maximization (EM) algorithm that was first applied to MRS models by Hamilton (1990) and later refined by Kim (1994). It is a two-step iterative procedure, reaching a local maximum of the likelihood function: • Step 1 Denote the observation vector by x T = (x 1 , x 2 , . . . , x T ). For a parameter vector θ (n) compute the conditional probabilities P(R t = i|x T ; θ (n) )-the so called 'smoothed inferences'-for the process being in regime i at time t. • Step 2 Calculate new and more exact maximum likelihood estimates θ (n+1) using the log-likelihood function, weighted with the smoothed inferences from Step 1, i.e., where f i (x t |x t−1 ; θ (n+1) ) is the conditional density of the ith regime, and update the transition probabilities: .
For a detailed description of the estimation procedure see the original paper of Kim (1994) or a recent article of , where an efficient algorithm for MRS models of type I is presented.

Goodness-of-fit testing
In this section, we introduce a goodness-of-fit testing technique, that can be applied to evaluate the fit of regime-switching models. It is based on the Kolmogorov-Smirnov (K-S) goodness-of-fit test and verifies whether the null hypothesis H 0 that observations come from the distribution implied by the model specification cannot be rejected. The procedure can be easily adapted to other empirical distribution function (edf) type tests, like the Anderson-Darling test.
3.1 Testing in case of an observable state process

Specification I
In this case the hypothesis H 0 states that the sample (x 1 , x 2 , . . . , x T ) is generated from a regime-switching model with two independent regimes defined as: an AR(1) process (first regime) and i.i.d. F 2 -distributed random variables (second regime). Provided that the values of the state process R t are known, observations can be split into separate subsamples related to each of the regimes. Namely, subsample i consists of all values X t satisfying R t = i. The regimes are independent from each other, but the i.i.d. condition must be satisfied within the subsamples themselves. Therefore, the meanreverting regime observations are substituted by their respective residuals. Precisely, the following transformation is applied to each pair of consecutive AR(1) observations in regime R t = 1: where (k − 1) is the number of latent observations from the mean-reverting regime (or equivalently the number of observations from the second regime that occurred between two consecutive AR(1) observations) and α, β and σ are the model parameters, see (4). It is straightforward to see that if H 0 is true, transformation h(x t+k,1 , x t,1 , k) applied to consecutive observations from the mean-reverting AR(1) regime leads to a sample (y 1 1 , y 1 2 , . . . , y 1 n 1 ) of independent and conditionally N (0, 1)-distributed random variables. Note, that from now on we use the following notation. The original observed sample is denoted by (x 1 , x 2 , . . . , x T ). The i.i.d. (or conditionally i.i.d. in Sect. 3.2) samples in each of the regimes are denoted by (y 1 1 , y 1 2 , . . . , y 1 n 1 ) and (y 2 1 , y 2 2 , . . . , y 2 n 2 ), with n 1 + n 2 = T − 1. Note, that for the mean-reverting regime these samples are obtained by applying transformation (7).
Further, observe that transformation h(X t+k,1 , X t,1 , k) is based on subtracting the conditional mean from X t+k,1 and standardizing it with the conditional variance.
is the conditional expected value of X t+k,1 given (X 1,1 , X 2,1 , . . . , X t,1 ) and σ 2 1−(1−β) 2k 1−(1−β) 2 is the respective conditional variance. Transformation (7) ensures that the subsample containing observations from the mean-reverting regime is i.i.d. Since the second regime is i.i.d. by definition, standard goodness-of-fit tests based on the empirical distribution function (like the Kolmogorov-Smirnov or Anderson-Darling tests, see e.g., D'Agostino and Stevens 1986) can be applied to each of the subsamples. Recall that the Kolmogorov-Smirnov test statistic is given by: where n is the sample size, F n is the empirical distribution function (edf) and F is the corresponding theoretical cumulative distribution function (cdf). Hence, having an i.i.d. sample (y 1 , y 2 , . . . , y n ), the test statistic can be calculated as where I is the indicator function.
The goodness-of-fit of the marginal distribution of the individual regimes can be formally tested. For the mean-reverting regime, F is the standard Gaussian cdf and (y 1 , y 2 , . . . , y n 1 ) is the subsample of the standardized residuals obtained by applying transformation (7), while for the second regime, F is the model-specified cdf (i.e., F 2 ) and (y 1 , y 2 , . . . , y n 2 ) is the subsample of respective observations. Observe that the 'whole model' goodness-of-fit can be also verified, using the fact that for Indeed, a sample (y 1 1 , y 1 2 , . . . , y 1 n 1 , y 2 1 , y 2 2 , . . . , y 2 n 2 ), where y 1 t s are the standardized residuals of the mean-reverting regime, while y 2 t 's are the transformed variables corresponding to the second regime, i.e., N (0, 1)-distributed and, hence, the testing procedure is applicable.

Specification II
The H 0 hypothesis now states that the sample (x 1 , x 2 , . . . , x T ) is driven by a regimeswitching model defined by Eq. (6) with R t ∈ {1, 2}. Similarly, as in the independent regimes case, the testing procedure is based on extracting the residuals of the meanreverting process. Indeed, observe that under the H 0 hypothesis the transformation h(x t , x t−1 , 1), defined in (7), with parameters α R t , β R t and σ R t corresponding to the current value of the state process R t , yields an i.i.d. N (0, 1) distributed sample. Thus, the Kolmogorov-Smirnov test can be applied. The test statistic d n , see (9), is calculated with the standard Gaussian cdf and the sample (y 1 , y 2 , . . . , y T ) of the standardized residuals, i.e., y t = h(x t , x t−1 , 1).

Critical values
Note, that the described above testing procedure is valid only if the parameters of the hypothesized distribution are known. Unfortunately in typical applications the parameters have to be estimated beforehand. If this is the case, then the critical values for the test must be reduced (Čižek et al. 2011). In other words, if the value of the test statistics d n is d, then the p value is overestimated by P(d n ≥ d). Hence, if this probability is small, then the p value will be even smaller and the hypothesis will be rejected. However, if it is large then we have to obtain a more accurate estimate of the p value.
To cope with this problem, Ross (2002) recommends using Monte Carlo simulations. In our case the procedure reduces to the following steps. First, the parameter vectorθ is estimated from the dataset and the test statistic d n is calculated according to formula (9). Next,θ is used as a parameter vector for N simulated samples from the assumed model. For each sample the new parameter vectorθ i is estimated and the new test statistic d i n is calculated using formula (9). Finally, the p value is obtained as the proportion of simulated samples with the test statistic values higher or equal to d n , i.e., p value = 1 N #{i : d i n ≥ d n }.
3.2 Testing in case of a latent state process

The ewedf approach
Now, assume that the sample (x 1 , x 2 , . . . , x T ) is driven by an MRS model. The regimes are not directly observable and, hence, the standard edf approach can be used only if an identification of the state process is performed first. Recall that, as a result of the estimation procedure described in Sect. 2.2, the so called 'smoothed inferences' about the state process are derived. The smoothed inferences are the probabilities that the tth observation comes from a certain regime given the whole available information P(R t = i|x 1 , x 2 , . . . , x T ). Hence, a natural choice is to relate each observation with the most probable regime by letting R t = i if P(R t = i|x 1 , x 2 , . . . , x T ) > 0.5. Then, the testing procedure described in Sect. 3.1 is applicable. However, we have to mention, that the hypothesis H 0 now states that (x 1 , x 2 , . . . , x T ) is driven by a regime-switching model with known state process values. We call this approach 'ewedf', which stands for 'equally-weighted empirical distribution function'. It was introduced by Janczura and Weron (2009) in the context of electricity spot price MRS models.

The weighted empirical distribution function (wedf)
In the standard goodness-of-fit testing approach based on the edf each observation is taken into account with weight 1 n (i.e., inversely proportional to the size of the sample). However, in MRS models the state process is latent. The estimation procedure (the EM algorithm) only yields the probabilities that a certain observation comes from a given regime. Moreover, in the resulting marginal distribution of the MRS model each observation is, in fact, weighted with the corresponding probability. Therefore, a similar approach should be used in the testing procedure.
For this reason, we introduce here the concept of the weighted empirical distribution function (wedf): where (y 1 , y 2 , . . . , y n ) is a sample of observations and (w 1 , . . . , w n ) are the corresponding weights, such that 0 ≤ w t ≤ M, ∀ t=1,...,n . It is interesting to note, that the notion of the weighted empirical distribution function appears in the literature in different contexts. Maiboroda (1996Maiboroda ( , 2000 applied it to the problem of estimation and testing for homogeneity of components of mixtures with varying coefficients. Withers and Nadarajah (2010) investigated properties of distributions of smooth functionals of F n (x). In both approaches the weights were assumed to fulfill the condition n t=1 w t = n. A different choice of weights was used by Huang and Brill (2004). They proposed the level-crossing method to find weights improving efficiency of the edf in the distribution tails. Yet another approach employing the weighted distribution is the generalized (weighted) bootstrap technique, see e.g., Haeusler et al. (1991), where specified random weights are used to improve the resampling method.
However, to our best knowledge, none of the applications of wedf is related to goodness-of-fit testing of Markov regime-switching models. Here, we use the wedf concept to deal with the case when observations cannot be unambiguously classified to one of the regimes and, hence, a natural choice of weights of wedf seems to be

The wedf approach for specification II
First, let us focus on the parameter-switching specification. The H 0 hypothesis states that the sample x T = (x 1 , x 2 , . . . , x T ) is driven by the MRS model defined by equation (6). Assume that H 0 is true and the model parameters are known. Like in the observable state process case, the test cannot be applied directly to the observed sample. Let y i t s be the transformed variables corresponding to the ith regime, i.e., y i t s are obtained as , then y i t becomes the residual of the ith regime and, hence, has the standard Gaussian distribution. The weighted empirical distribution function (wedf) is then given by: where n is the size of the sample (here n = T − 1). Let be the σ -algebra generated by the state process {R t } t=1,2,...,T , i.e., the state process history up to time T . Observe that the elements of the sum in (11) are conditionally independent given . Indeed, if for a given t, R t = i then the tth component of the sum becomes I {y i t <x} and y i t 's given R t = i form an i.i.d. N (0, 1)-distributed sample. Moreover, the following lemma ensures that the true cdf of the residuals can be approximated by the wedf.

Lemma 1 If H 0 is true, then F n given by (11) is an unbiased, consistent estimator of the distribution of the residuals (in this case Gaussian).
Note, that proofs of all lemmas and theorems formulated in this section can be found in the Appendix.
The following theorem yields a version of the K-S test applicable to parameterswitching MRS model (6). Note, that if the state process was observable, it would boil down to the standard K-S test (Lehmann and Romano 2005, p. 584).
Theorem 1 Let F n be given by (11) and F be the standard Gaussian cdf. If H 0 is true and the model parameters are known, then the statistic

converges (weakly) to the Kolmogorov-Smirnov distribution K S as n → ∞.
If hypothesis H 0 is true then, by Theorem 1, the statistic D n asymptotically has the Kolmogorov-Smirnov distribution. Therefore, if n is large enough, the following approximation holds where κ ∼ K S and c is the critical value. (Recall that is the test statistic. Note that, for a given value of d n , P(κ > d n ) is the standard Kolmogorov-Smirnov test p value, so that the K-S test tables can be easily applied in the wedf approach. The above procedure is applicable to testing the distribution of the residuals of the (whole) model. A similar approach can be used for testing the distributions of the residuals of the individual regimes. Let the wedf for the ith regime be defined as: where again y i t s are the transformed variables corresponding to the ith regime, i.e., An analogue of Theorem 1 can be derived.
Theorem 2 Let F n be given by (15) and assume that R t is an ergodic Markov chain. If H 0 is true and the model parameters are known, then the statistic where

. , n} converges (weakly) to the Kolmogorov-Smirnov distribution K S as n → ∞.
Observe that √ w n can be approximated by . Hence, for a sample of (y i 1 , y i 2 , . . . , y i n ) the test statistic is given by and the standard testing procedure can be applied.

The wedf approach for specification I
Now, assume that the sample (x 1 , x 2 , . . . , x T ) is driven by the MRS model with independent regimes. The results of Theorems 1 and 2 can be applied, however, slight modifications of the tested sample(s) are required. First, observe that the values of the mean-reverting regime become latent, when the process is in the second state. As a consequence, the calculation of the conditional mean and variance, required for the derivation of the residuals, is not straightforward. We have: is the vector of preceding observations. Therefore, the standardized residuals are given by the transformation: where E(X t−1,1 |x t−1 ) and Var(X t−1,1 |x t−1 ) can be calculated using the following equalities: The latter formula is a consequence of the law of iterated expectation and basic properties of conditional expected values. Finally, the values P(R t = 1|x t ) are calculated from the Bayes rule during the EM estimation procedure (see e.g., Kim 1994). Note that the transformed variables (y 1 1 , y 1 2 , . . . , y 1 Now, to test the fit of the mean-reverting regime, it is enough to calculate d i n according to formula (18) with the standard Gaussian cdf and y 1 t = g(x t , x t−1 ). Observe, that the observations from the second regime are i.i.d. by definition, so the testing procedure is straightforward with F 2 cdf and sample (x 1 , x 2 , . . . , x T ). Moreover, the 'whole model' goodness-of-fit can be also verified. Theorem 1 is directly applicable, if the distributions of the samples corresponding to both regimes are the same F = F 1 = F 2 . Observe that, even if F 1 = F 2 , the test still can be applied using the fact that for The test statistic d n is calculated as in (14) with F 1 cdf (here Gaussian) and the sam- are the transformed variables of the mean-reverting regime, i.e., y 1 t = g(x t,1 , x t−1 ), while (y 2 1 , y 2 2 , . . . , y 2 T ) are the variables corresponding to the second regime, i.e., . Note, that as in the case of an observable state process, in the wedf approach we face the problem of estimating values that are later used to compute the test statistic. Again, this problem can be circumvented with the help of Monte Carlo simulations.
The p values can be computed as the proportion of simulated MRS model trajectories with the test statistic d n , see formulas (14) and (18), higher or equal to the value of d n obtained from the dataset.

The Rosenblatt transformation
The new tests proposed in this section are compared with an approach utilizing the Rosenblatt (1952) transformation. The latter is based on the fact that if a sample (x 1 , x 2 , . . . , x T ) is driven by a multivariate distribution F, then the transformed variables y t = F t (x t |x t−1 ), where F t is the corresponding conditional cdf, are independent and uniformly distributed. For the MRS models considered in this paper the transformation is given by where F i t is the conditional distribution of regime i. For specification I, F 1 t is the normal cdf with mean and variance given by (19), while F 2 t is the model defined, see (5), cdf of the second regime. For specification II, F i t is the normal cdf with mean sample under H 0 , the standard edf-type tests-like the Kolmogorov-Smirnov-can be applied. In order to test for the same distribution in all testing approaches, we apply one more transformation. Namely, like Berkowitz (2001), Haas et al. (2004) or Smith (2008), we calculate −1 (y t ) with being the standard normal cdf and obtain an independent The Rosenblatt transformation is a very useful and general tool, however, it can be used to test the goodness-of-fit of the whole model only. In contrast, the ewedf and the wedf approaches allow for testing the fit in the individual regimes. This can be advantageous in many situations as we additionally obtain the information which regimes are correctly and which are incorrectly specified. Moreover, the ewedf and the wedf approaches yield estimators of the regime and model cdfs, providing a readily available tool for further testing and model building. On the other hand, in case of the Rosenblatt transformation an empirical distribution function can only be constructed for the transformed variables making it hard to be interpreted.

Simulations
We now check the performance of the testing procedures introduced in Sect. 3. Due to space limitations, we focus on the more challenging case of a latent state process and consider four 2-regime MRS models defined in Table 1. The parameters of models Sim #1 and Sim #2 are chosen arbitrarily, while those of models Sim #3 and Sim #4 are estimated from the NEPOOL log-prices studied in Sect. 5. Furthermore, models Sim #1 and Sim #3 follow specification I, i.e., the first regime is driven by an AR(1) process, while the second regime is described by an i.i.d. sample of log-normally distributed random variables with parameters α 2 and σ 2 2 , i.e., L N (α 2 , σ 2 2 ). Recall, that a random variable X is log-normally distributed, X ∼ L N (α 2 , σ 2 2 ), if log(X ) ∼ N(α 2 , σ 2 2 ), for X > 0. In order to apply the tests to the 'whole model' (and not only to the individual regimes) we transform the second regime values On the other hand, models Sim #2 and Sim #4 are simulated from the parameter-switching AR(1) model, i.e., follow specification II, see formula (6). Finally note that since the regimes of the considered models are not directly observable, the standard edf-based goodness-of-fit tests cannot be used.

Known model parameters
We generate 10,000 trajectories of each of the four 2-regime MRS models defined in Table 1. The length of each trajectory is 2,000 observations, which corresponds to 5.5 years of daily data (note, that markets for electricity operate 365 day per year). We apply the ewedf, the wedf and the Rosenblatt transformation-based goodness-of-fit tests to each simulated trajectory and then calculate the percentage of rejected hypotheses H 0 at the 5 % significance level. We assume that the model parameters are known. The computation of E(X t,1 |x t ) in the wedf approach requires backward recursion until the previous observation from the mean-reverting regime is found, see (21). However, as the number of observations is limited, the condition P(R t = 1|x t ) = 1 might not be fulfilled at all. The estimation scheme requires some approximation or an additional assumption. Here, we assume that for each simulated trajectory the first observation comes from the mean-reverting regime.
In the ewedf approach the tested hypothesis says that the state process is known (and coincides with the proposed classification of the observations to the regimes). As a consequence, once the regimes are identified, it is equivalent to the standard edf approach. To test how it performs for an MRS model with a latent state process, we apply it to the simulated trajectories (we first identify the regimes, then test whether the sample is generated from the assumed MRS model).
The results reported in Table 2 indicate that only the wedf and Rosenblatt transformation-based tests yield correct percentages of rejected hypotheses. The values obtained for the ewedf-based test are far from the expected level of 5 %. The ewedf approach is more restrictive, probably due to the less flexible classification of the regimes in which the probabilities P(R t = i|x t ) can only take values of 0 or 1. This simple example clearly shows that in case of MRS models the ewedf approach is less reliable.
Finally, to measure the quality of regime classification, we use the regime classification measure (RCM) of Ang and Bekaert (2002), see the last column in Table 2. Since the true regime is a Bernoulli random variable, the RCM statistic is essentially a sample estimate of its variance. The RCM statistic is rescaled so that a value of 0 means perfect regime classification and a value of 100 implies that no information about the regimes is revealed. In our case, the regime classification is very good or good for specification I models used for modeling electricity prices in Sect. 5 (Sim #1 and Sim #3) and good or moderately good for specification II models (Sim #2 and Sim #4). The lower RCM values for type I models also imply that in these models the The results of the K-S test within the ewedf and the wedf approaches are reported in columns 2-4 and 5-7, respectively, independently for the two regimes (First, Second) and for the whole model (Model). Results obtained using the Rosenblatt transformation (Rtrans) are reported in column 8. Finally, the RCM statistic of Ang and Bekaert (2002) averaged over the 10,000 simulated trajectories is reported for each model in the last column regimes are better separated than in the respective type II models. Finally, in models whose parameters are estimated from the NEPOOL log-prices (Sim #3 and Sim #4) the regimes are less separated than in the other two models.

Unknown model parameters
The simulation results presented so far were obtained with the assumption that model parameters are known. Unfortunately, in typical applications the parameters have to be estimated before the testing procedure is performed. This may result in overestimated p values. To cope with this problem, we use Monte Carlo simulations (for details, see e.g., Čižek et al. 2011Čižek et al. , Ross 2002. For each of the 500 trajectories (of 2,000 observations) simulated from each of the four 2-regime MRS models defined in Table 1 the procedure is as follows: 1. Estimate the parameter vector (θ) and calculate the test statistic (d n ) according to formula (9). 2. For 'K-S table'-type ('K-S tab.') estimation calculate the p value using K-S test tables and assuming that the sample comes from a model with parameter vectorθ. 3. For 'MC simulation'-type ('MC sim.') estimation: (a) simulate N = 500 trajectories with parameter vectorθ (these trajectories will be used to compute the estimate of the p value), (b) for each trajectory i = 1, . . . , N estimate the parameter vector (θ i ) and calculate the test statistic (d i n ), (c) calculate the p value as the proportion of simulated trajectories with the test statistic values higher or equal to d n , i.e., 1 N #{i : d i n ≥ d n }.
Note, that in contrast to the simulation study of Sect. 4.1, where we used 10,000 simulated trajectories to obtain the percentages given in Table 2, now we use 500 simulated trajectories for the 'K-S table'-based percentages and 500×500 = 250,000 simulated trajectories for the 'MC simulation'-based percentages in Table 3. Despite the substantially increased computational burden the accuracy of the percentages of rejected hypotheses has decreased as now it is based on only 500 simulations. Looking at the test results based on the K-S test tables ('K-S tab.' in Table 3), for the ewedf approach the rejection percentages deviate significantly from the 5 % level. On the other hand, for the wedf and the Rosenblatt transformation-based approaches the p values are overestimated, what results in rejection percentages much lower than the 5 % significance level. Observe that for most of the models none of the tests were rejected. Therefore, if p values obtained with the wedf or the Rosenblatt transformation-based approaches are close to the significance level, the test may fail to reject a false H 0 hypothesis. This is not the case for the testing approach utilizing Monte Carlo simulations ('MC sim.' in Table 3) as the obtained rejection percentages are close to the 5 % significance level. This example clearly shows that the wedf and the Rosenblatt transformation test based on the K-S test tables can only be used if it returns a p value below the significance level (i.e., if it rejects the H 0 hypothesis) or well above the significance level. However, if the obtained p value is close to the significance level, Monte Carlo simulations should be performed. The results of the tests based on the ewedf as well as the wedf approaches are reported independently for the two regimes (First, Second) and the whole model (Model). Results obtained using the Rosenblatt transformation (Rtrans) are reported in the last column. The tests utilize K-S test tables (K-S tab.) or Monte Carlo simulations (MC sim.) with N = 500 repetitions. Note, that the rejection rates are approximations based on only 500 trajectories, not 10,000 trajectories as in Table 2 4

.3 Power of the tests
In this section we investigate the power of the tests. To this end, for a given MRS model we simulate 500 trajectories of 100, 500 or 2,000 observations each. Next, for each trajectory we calibrate an MRS model with an alternative specification of the regimes and perform goodness-of-fit tests to verify if the simulated trajectory can be driven by the alternative model. Finally, we calculate the percentages of the rejected hypotheses.
• AR-E vs AR-LN The trajectories are simulated from an MRS model following specification I, see (4) and (5), with an exponential distribution in the second regime, i.e., F 2 ∼ Exp(λ). The model is denoted here by AR-E and its parameters are given by: α = 10, β = 0.6, σ 2 = 10, λ = 30, p 11 = 0.6 and p 22 = 0.5. We test whether the simulated trajectories can be driven by a model following specification I with a log-normal distribution in the second regime (i.e., AR-LN). • CIR-LN vs AR-G The trajectories are simulated from an MRS model defined as: where α 1 = 1, β 1 = 0.8, σ 2 1 = 0.5, α 2 = 2, σ 2 2 = 0.5, p 11 = 0.6 and p 22 = 0.5, i.e., the first regime is a discrete time version of the square root process, also known as the CIR process (Cox et al. 1985), and the second is a log-normal random variable. Hence, the name CIR-LN. We test whether the simulated trajectories can be driven by a model following specification I with a Gaussian distribution in the second regime.
The test results are summarized in Table 4. The values obtained for the individual regimes are also provided, however, as the simulated and estimated models differ, these rejection rates are highly dependent on the classification of observations to the regimes during estimation. Therefore, in the discussion that follows we focus on the test results for the whole models. Comparing the power of the Monte Carlo approach with the one using K-S test tables, we observe that in most cases the latter method yields lower (or worse) rejection percentages. This is in compliance with the results obtained in Sect. 4.2. The only significant deviations from this pattern can be observed for the AR-E vs. AR-LN test scenario, when the true model (i.e., AR-E)-and hence the simulated trajectories-exhibit a low degree of regime separation. The latter is manifested by relatively high RCM values of 38.23-41.69 (averaged over 500 simulated trajectories for each of the three sample sizes), compared to RCM values of 8.07-8.28 and 13.68-15.81 for the first and third scenarios, respectively. Note, that the reported RCM values were computed for the true models fitted to the 500 simulated trajectories from these models.
Looking at the MC simulation results obtained for the largest samples of T = 2,000 observations, we can see that in almost all cases the false hypothesis was rejected. The lowest rejection rate for the ewedf approach was 0.7580, for the wedf approach-0.9620 and for the Rosenblatt transformation-0.9900. All three lowest rates were obtained for the challenging AR-E vs. AR-LN test scenario. However, if the samples are smaller, the power of the tests apparently decreases. The sample size of T = 100 observations does not seem to be enough, especially if the degree of regime separation is low, like in the AR-E vs. AR-LN scenario. This is not the case if the definitions of both regimes are significantly different, as for the CIR-LN vs. AR-G scenario, for which the power is satisfactory even if T = 100. Comparing the ewedf and wedf approaches, we can observe that the latter yields higher on average rejection rates. A comparison of the wedf and the Rosenblatt transformation-based approach does not yield such a clear picture. For the AR-ARG1 vs. AR-AR scenario the power of the Rosenblatt transformation-based approach is higher than of the wedf approach, for the AR-E vs. AR-LN scenario it is only slightly higher, while for the CIR-LN Table 4 Percentages of rejected hypotheses H 0 at the 5% significance level for the alternative models with parameters estimated for each of the 500 simulated trajectories of T = 100, 500 or 2000 observations vs. AR-G scenario it is the wedf approach which produces higher rejection rates (for small samples). Next, in Table 5 we consider three analogous scenarios, but this time the parameters of the simulated trajectories are estimated from the NEPOOL log-prices studied in Sect. 5. Like in Table 4, the 'K-S tab.' method generally yields lower (or worse) rejection percentages than the Monte Carlo approach. Comparing all three approaches (ewedf, wedf, Rosenblatt) we find that this time they have roughly the same 'whole model' rejection rates, but-except for the AR-E vs. AR-LN scenario-the power of the tests has decreased. To a large extent this can be explained by regime classification. Now the AR-E vs. AR-LN scenario exhibits much lower RCM values (0.59-0.92; again averaged over 500 simulated trajectories for each of the three sample sizes) than the first (23.76-30.91) and the third (7.08-9.31) scenarios. However, despite almost twice lower RCM values for the third scenario (NEPOOL estimated vs. arbitrary Table 5 Percentages of rejected hypotheses H 0 at the 5 % significance level for the alternative models with parameters estimated for each of the 500 simulated trajectories of T = 100, 500 or 2,000 observations parameters) the rejection percentages for this scenario have substantially decreased. Now they also do not increase with sample size, which may indicate problems with reaching the global maximum within the estimation procedure for small samples. This is not unexpected when conducting Monte Carlo studies of Markov regime-switching models as the likelihood surface can be highly irregular and contain several local maxima (Smith 2008).

Application to electricity spot prices
Now, we are ready to apply the new goodness-of-fit technique to electricity spot price models. We analyze the mean daily (baseload) day-ahead spot prices from the New England Power Pool SEMASS area (NEPOOL; US). The sample totals 1,827 daily observations (or 261 full weeks) and covers the 5-year period January 2, 2006-January 2, 2011, see Fig. 1. It is well known that electricity spot prices exhibit several characteristic features (Benth et al. 2008;Eydeland and Wolyniec 2012;Huisman 2009;Weron 2006), which have to be taken into account when modeling such processes. These include seasonality on the annual, weekly and daily level, mean reversion and price spikes. To cope with the seasonality we use the standard time series decomposition approach and let the electricity spot price P t be represented by a sum of two independent parts: a predictable (seasonal) component f t and a stochastic component X t , i.e., P t = f t + X t . Further, to address the mean-reverting and spiky behavior we let the log-prices, i.e., Y t = log(X t ), be driven by: • a 2-regime MRS model with mean-reverting, see (4), base regime (R t = 1) and i.i.d. shifted log-normally distributed spikes (R t = 2) or • a 3-regime MRS model with mean-reverting, see (4), base regime (R t = 1), i.i.d. shifted log-normally distributed spikes (R t = 2) and i.i.d. drops (R t = 3) distributed according to the inverted shifted log-normal law.
Note that we use here the independent regime specification (i.e., type I), since in wholesale power markets the price spikes ('Second regime') or price drops (i.e., negative price spikes; 'Third regime') are driven by unexpected but transient changes of market conditions. A generation outage or a severe heat wave typically do not last longer than a few hours or a few days; when it is over the prices move back to the normal level ('Base regime'), usually irrespective of the magnitude of the extreme prices a few hours or days earlier (De Jong 2006;Mari 2008).
Furthermore recall, that X follows the shifted log-normal law or inverted shifted log-normal law if log(X − q), respectively log(q − X ), has a Gaussian distribution. The cutoff level q can be different for the spike and the drop regime, however, heremotivated by the results of Janczura and Weron (2013)-we set it to the first and the third quartile of the dataset for drops and spikes, respectively. Using shifted log-normal distributions allows to increase the degree of separation as measured by RCM and, hence, increase the power of the tests (see Sect. 4). For instance, the 2-regime model now yields an RCM of 4.79, compared to 12.81 for Sim #3 in Table 2. It also leads to more fundamentally justified models since electricity price spikes are generally connected with scheduling units with higher marginal costs (like gas turbines, see e.g., Eydeland and Wolyniec 2012;Weron 2006).
Finally note, that such simple one-factor models may not be complex enough to address all features of electricity prices. In particular, the electricity forward prices implied by these spot price models exhibit the so-called Samuelson effect (i.e., a decrease in volatility with increasing time to maturity; for the considered models the volatility scales as e −β(T −t) ), but the rate of decrease is completely determined by the speed of mean-reversion β . However, the rate of decrease should be large only for maturities up to a year (Kiesel et al. 2009). Perhaps, incorporating another stochastic factor would lead to a more realistic forward price curve.
Following Weron (2009) the deseasonalization is conducted in three steps; for a thorough study of modeling seasonal components in electricity spot prices we refer to . First, the long-term seasonal component (LTSC) T t is estimated from daily spot prices P t using a wavelet filter-smoother of order 6. A single non-parametric LTSC is used here to represent the long-term non-periodic fuel price levels, the changing climate/consumption conditions throughout the years and strategic bidding practices. As shown by Janczura and Weron (2010), the wavelet-estimated LTSC pretty well reflects the 'average' fuel price level, understood as a combination of natural gas, crude oil and coal prices; see also Eydeland and Wolyniec (2012) and Karakatsani and Bunn (2010) for a treatment of fundamental and behavioral drivers of electricity prices. On the other hand, as discussed recently in , the use of the wavelet-based LTSC is somewhat controversial. Predicting it beyond the next few weeks is a difficult task, because individual wavelet functions are quite localized in time or (more generally) in space. Preliminary research suggests, however, that despite this feature the wavelet-based LTSC can be extrapolated into the future yielding a better on-average prediction of the level of future spot prices than an extrapolation of a sinusoidal LTSC (Nowotarski et al. 2011).
The price series without the LTSC is obtained by subtracting the T t approximation from P t . Next, the weekly periodicity s t is removed by subtracting the 'average week' calculated as the mean of prices corresponding to each day of the week (the US national holidays are treated as the 8-day of the week). Finally, the deseasonalized prices, i.e., X t = P t − T t − s t , are shifted so that the minimum of the new process X t is the same as the minimum of P t . The resulting deseasonalized time series can be seen in Figs. 2, 3. The estimated model parameters are presented in Table 6.
For both analyzed MRS models, tests based on the ewedf, the wedf and the Rosenblatt transformation are performed. The p values in Table 7 are reported both for the standard approach utilizing the K-S test tables (which generally leads to overestimated p values) and the much slower but more accurate Monte Carlo setup. The testing procedure in the Monte Carlo case is analogous to the one used in the simulation study, see Sect. 4.2 for a detailed description. Again, in order to verify the 'whole model' goodness-of-fit, we transform the spike and drop regime observations so that both samples are N (0, 1)-distributed.
For the 2-regime model the p values obtained from the K-S test tables indicate that the model cannot be rejected at the 5 % significance level. However, the base regime and the model p values are still quite low-especially for the wedf approach-so the

Fig. 3
Estimation results for the 3-regime MRS model with a mean-reverting base regime and independent log-normally distributed 'spikes' and 'drops' fitted to NEPOOL log-prices. Observations with P(R t = 2|x T ) > 0.5 or P(R t = 3|x T ) > 0.5, i.e., the 'spikes' or 'drops', are denoted by dots or 'x' in the upper panel. The lower panels display the probability P(R t = 2|x T ) or P(R t = 2|x T ) of being in the 'spike' or 'drop' regime, respectively conclusions of the test should be verified with the Monte Carlo simulations. Indeed, for the Monte Carlo based test only the spike regime yields a satisfactory fit, as the p value is well above the 5 % significance level. The base regime, as well as the whole model distribution, can be rejected at any reasonable level for the wedf and at the 5 % level for the Rosenblatt transformation. Apparently, the base regime process cannot model the sudden drops in the NEPOOL log-prices. However, if a third regime (modeling price drops) is introduced, the MRS model yields a satisfactory fit. In the 3-regime case none of the tests can be rejected at the 5 % significance level. Regime classification for the 2-and 3-regime models can be observed in Figs. 2 and 3, respectively.

Conclusions
While most of the electricity spot price models proposed in the literature are elegant, their fit to empirical data has either been not examined thoroughly or the signs of a bad fit ignored. As the empirical study of Sect. 5 has shown, even reasonably looking and popular models should be carefully tested before they are put to use in trading or risk management departments. The goodness-of-fit wedf-based test introduced in Sect. 3.2.2 provides an efficient tool for accepting or rejecting a given Markov regimeswitching (MRS) model for a particular data set. While its performance (including power; see Sect. 4.3) is similar to that of the Rosenblatt transformation-based approach, it provides an edge over the latter by yielding p values for the individual regimes. For instance, this allows to observe that in the 3-regime model the worst fit is obtained for the base regime. Perhaps the simple AR(1) structure is not enough to model the complex dynamics of electricity spot prices in the relatively calm, non-spiky periods. However, in this paper we have not restricted ourselves to MRS models but pursued a more general goal. Namely, we have proposed a goodness-of-fit testing scheme for the marginal distribution of regime-switching models, including variants with an observable and with a latent state process. For both specifications we have described the testing procedure. The models with a latent state process (i.e., MRS models) required the introduction of the concept of the weighted empirical distribution function (wedf) and a generalization of the Kolmogorov-Smirnov test to yield an efficient testing tool.
We have focused on two commonly used specifications of regime-switching models in the energy economics literature-one with dependent autoregressive states and a second one with independent autoregressive and i.i.d. regimes. Nonetheless, the proposed approach can be easily applied to other specifications of regime-switching models (for instance, to 3-regime models with heteroscedastic base regime dynamics; see Janczura and Weron 2010). Very likely it can be also extended to other goodnessof-fit edf-type tests, like the Anderson-Darling. As the latter puts more weight to the observations in the tails of the distribution than the Kolmogorov-Smirnov test, it might be more discriminatory and provide a better testing tool for extremely spiky data. Future work will be conducted in this direction.
Finally note, that a good in-sample fit does not necessarily imply a good forecast behavior. Although Kosater and Mosler (2006) found for German electricity price data that for long run point forecasts (30-80 days ahead) an MRS model with regimes driven by two AR(1) processes was slightly more accurate than a simple AR(1) model, for shorter time horizons both model classes performed alike. It remains an open question how do the MRS models fitted to NEPOOL log-prices in Sect. 5 perform in terms of forecasting. The adequacy of MRS models for forecasting in general has been questioned by Bessec and Bouabdallah (2005). However, as Weron and Misiorek (2008) have shown, regime-switching models may behave better than their linear competitors in volatile periods. They might also have an edge in density forecasts, but this has to be verified yet.
where F(x) is the distribution of the residuals and is the σ -algebra generated by the state process values. Similarly, Next, from the conditional Kolmogorov inequality (for details see Prakasa Rao 2009), for any δ > 0 we have (a.s.): As a consequence, and F n (x) converges in probability to F(x) as n → ∞.
. Therefore, in the following we will limit ourselves to the case 0 < F(x) < 1. Let where Y i t are the transformed variables of the ith regime, i.e., n I {Y i t <x} and Y i t becomes the residual of the ith regime. Therefore, Y 1 , Y 2 , . . . , Y T are -independent, where is the σ -algebra generated by the state process {R t } t=1,2,...,T , and have the same conditional distribution. Hence, from the conditional version of the Central Limit Theorem (for details see Grzenda and Zieba 2008), we have: Next, note that E(Y t | ) = 1 n F(x), see equation (24), and Var( (25). Hence, (30) yields: The latter is equivalent to where W B y is a Brownian bridge, i.e., W B y ∼ N (0, y(1 − y)), see e.g., Lehmann and Romano (2005, p. 585). Let y = F(x) and observe that I {y i t <x} = I {y i t <F −1 (y)} = I {F(y i t )<y} . Moreover, if y i t are F-distributed, then F(y i t ) are driven by the uniform distribution on [0, 1]. Formula (32) ensures that Z n (y) = √ n[F n (F −1 (y)) − y] converges pointwise to a Brownian bridge. In order to prove the convergence of Z n (y) in D([0, 1]), i.e., in the space of right continuous functions that have left-hand limits, it is enough to show: (1) the weak convergence of the finite dimensional distributions of Z n and that (2) E |Z n (y) − Z n (y 1 )| γ |Z n (y 2 ) − Z n (y)| γ ≤ [g(y 2 ) − g(y 1 )] 2α (33) for y 1 < y < y 2 and n ≥ 1, where γ ≥ 0, α > 1/2 and g is a non-decreasing, continuous function on [0,1] (see Theorem 15.6 in Billingsley 1968) (1) Let 0 < y 1 < y 2 < 1. We will show that (Z n (y 1 ), Z n (y 2 ) − Z n (y 1 )) converges weakly to (W B y 1 , W B y 2 −W B y 1 ). First, observe that (Z n (y 1 ), Z n (y 2 )− Z n (y 1 )) conditional on is multinomially distributed with variances (y 1 (1 − y 1 ), (y 2 − y 1 )[1 − (y 2 − y 1 )]) and covariance −y 1 (y 2 − y 1 ). This can be calculated using the same arguments as in the proof of Lemma 1. Hence, by the central limit theorem for multinomial trials, as n → ∞, for any s ∈ R 2 we have ϕ (Z n (y 1 ),Z n (y 2 )−Z n (y 1 )) (s) → ϕ ( where ϕ Z is the characteristic function of Z and ϕ Z is the conditional (on ) characteristic function of Z . Finally, by the dominated convergence theorem, (Z n (y 1 ), Z n (y 2 )− Z n (y 2 )) converges weakly to (W y 1 , W y 2 −W y 1 ). The convergence of finite dimensional distributions for any 0 < y 1 < y 2 < ... < y m < 1 follows with the same arguments.
where is the σ -algebra generated by the state process values. ( p i j ) n × n 0 1 n + n − 1 1 1 n − 1 + · · · + n n − 1 1 for q = max i, j=1,2 ( p i j ) n . If q < 1, then F n (x) converges in probability to F(x) for n → ∞. is given by:

Proof [Theorem 2] Let
where is the σ -algebra generated by the state process values.
Moreover, from (38) and (39)   Finally, from (44) and (45) we have and the proof is completed with the same arguments as in the proof of Theorem 1.