Goodness-of-fit testing for the marginal distribution of regime-switching models

In this paper we propose a new goodness-of-fit testing scheme for the marginal distribution of regime-switching models. We consider models with an observable (like threshold autoregressions), as well as, a latent state process (like Markov regime-switching). The test is based on the Kolmogorov-Smirnov supremum-distance statistic and the concept of the weighted empirical distribution function. The motivation for this research comes from a recent stream of literature in energy economics concerning electricity spot price models. While the existence of distinct regimes in such data is generally unquestionable (due to the supply stack structure), the actual goodness-of-fit of the models requires statistical validation. We illustrate the proposed scheme by testing whether a commonly used Markov regime-switching model fits deseasonalized electricity prices from the NEPOOL (U.S.) day-ahead market.


2007)
, river flow analysis (Vasas et al., 2007) or earthquake counts (Bulla and Berzel, 2008). This paper is motivated by yet another stream of literature: electricity spot price models in energy economics (Bierbrauer et al., 2007;De Jong, 2006;Huisman and de Jong, 2003;Janczura and Weron, 2010;Bunn, 2008, 2010;Mari, 2008;Misiorek et al., 2006;Weron, 2009). Regime-switching models have seen extensive use in this area due to their relative parsimony (a prerequisite in derivatives pricing) and the ability to capture the unique characteristics of electricity prices (in particular, the spiky and non-linear price behavior). While the existence of distinct regimes in electricity prices is generally unquestionable (being a consequence of the non-linear, heterogeneous supply stack structure in the power markets, see e.g. Eydeland and Wolyniec, 2003;Weron, 2006), the actual goodness-of-fit of the models requires statistical validation.
However, recent work concerning the statistical fit of regime-switching models has been mainly devoted to testing parameter stability versus the regime-switching hypothesis. Several tests have been constructed for the verification of the number of regimes. Most of them exploit the likelihood ratio technique (Cho and White, 2007;Garcia, 1998), but there are also approaches related to recurrence times (Sen and Hsieh, 2009), likelihood criteria (Celeux and Durand, 2008) or the information matrix (Hu and Shin, 2008). Specification tests, like tests for omitted autocorrelation or omitted explanatory variables based on the score function technique, were proposed earlier by . On the other hand, to our best knowledge, procedures for goodnessof-fit testing of the marginal distribution of regime-switching models have not been derived to date (with the exception of Janczura and Weron, 2009, where an ewedf-type test, see Section 3.2.1, was introduced in the context of electricity spot price models). With this paper we want to fill the gap. We propose an empirical distribution function (edf) based testing technique build on the Kolmogorov-Smirnov test. The procedure is developed for regime-switching models with an observable, as well as, a latent state process. The latter involves the concept of the weighted empirical distribution function (wedf).
The paper is structured as follows. In Section 2 we describe the structure of the analyzed regime-switching models and briefly explain the estimation process. In Section 3 we introduce goodness-of-fit testing procedures appropriate for regime-switching models both with observable and latent state processes. Next, in Section 4 we provide a simulation study and check the performance of the proposed technique. Since the motivation for this paper comes from the energy economics literature, in Section 5 we show how the presented testing procedure can be applied to verify the fit of Markov regime-switching models to electricity spot prices. Finally, in Section 6 we conclude.

Model definition
Assume that the observed process X t may be in one of L states (regimes) at time t, driven by a state process R t : (1) The possible specifications of the process R t may be divided into two classes: those where the current state of the process is observable (like threshold models, e.g. TAR, SETAR) and those where it is latent. Probably the most prominent representatives of the second group are the hidden Markov models (HMM; for a review see e.g. Cappe et al., 2005) and their generalizations allowing for temporal dependence within the regimes -the Markov regimeswitching models (MRS). Like in HMMs, in MRS models R t is assumed to be a Markov chain governed by the transition matrix P containing the probabilities p ij of switching from regime i at time t to regime j at time t + 1, for i, j = {1, 2, ..., L}: The current state R t at time t depends on the past only through the most recent value R t−1 . The probability of being in regime j at time t + m starting from regime i at time t is given by where P ′ denotes the transpose of P and e i denotes the ith column of the identity matrix. In general L regime models can be considered. However, for clarity of exposition we limit the discussion in this paper to two regime models only. Note, that this is not a very restrictive limitation -at least in the context of modeling electricity spot prices -since typically two or three regimes are enough to adequately model the dynamics (Janczura and Weron, 2010;Karakatsani and Bunn, 2010). Nonetheless, all presented results are valid for L > 2. The definitions of the individual regimes can be arbitrarily chosen depending on the modeling needs. Again for the sake of clarity, in this paper we focus only on two commonly used in the energy economics literature specifications of MRS models (Ethier and Mount, 1998;De Jong, 2006;Hirsch, 2009;Huisman and de Jong, 2003;Janczura and Weron, 2010;Mari, 2008). The first one (I) assumes that the process X t is driven by two independent regimes: (i) a mean-reverting AR(1) process: where ǫ t ∼ N (0, 1) and (ii) an i.i.d. sample from a specified distribution F 2 : In the second specification (II) X t is described by an AR(1) process having different parameters in each regime, namely:

Calibration
Calibration of regime-switching models with an observable state process boils down to the problem of independently estimating parameters in each regime.
In case of MRS models, though, the calibration process is not straightforward, since the state process is latent and not directly observable. We have to infer the parameters and state process values at the same time. In this paper we use the Expectation-Maximization (EM) algorithm that was first applied to MRS models by Hamilton (1990) and later refined by Kim (1994). It is a two-step iterative procedure, reaching a local maximum of the likelihood function: -Step 1 For a parameter vector θ compute the conditional probabilities P (R t = j|x 1 , ..., x T ; θ) -the so called 'smoothed inferences' -for the process being in regime j at time t. -Step 2 Calculate new and more exact maximum likelihood estimates of θ using the likelihood function, weighted with the smoothed inferences from step 1.
Note, that the introduction of independent regimes -like in specification (I) -results in a significantly increased computational burden. See Janczura and Weron (2011) for an efficient modification of the algorithm to overcome this problem.
3 Goodness-of-fit testing In this Section we introduce a goodness-of-fit testing technique, that can be applied to evaluate the fit of regime-switching models. It is based on the Kolmogorov-Smirnov (K-S) goodness-of-fit test and verifies whether the null hypothesis H 0 that i.i.d. observations come from the distribution specified by the model cannot be rejected. The procedure can be easily adapted to other empirical distribution function (edf) type tests, e.g. Anderson-Darling.

Testing in case of an observable state process
First, we focus on the independent regimes specification (I). The hypothesis H 0 states that the sample (x 1 , x 2 , ..., x T ) is generated from a regimeswitching model with two independent regimes defined as: an AR(1) process (first regime) and i.i.d. F 2 -distributed random variables (second regime). Provided that the values of the state process R t are known, observations can be split into separate subsamples related to each of the regimes. Namely, subsample i consists of all values X t satisfying R t = i. The regimes are independent from each other, but the i.i.d. condition must be satisfied within the subsamples themselves. Therefore the mean-reverting regime observations are substituted by their respective residuals. Precisely, the following transformation is applied to each pair of consecutive AR(1) observations in regime R t = 1: where (k − 1) is the number of latent observations from the mean reverting regime and α, β and σ are the model parameters, see (4). It is straightforward to see that if H 0 is true, transformation h(x t+k,1 , x t,1 , k) applied to consecutive observations from the mean-reverting AR(1) regime leads to a sample of independent and N (0, 1) distributed random variables. Further, observe that transformation h(X t+k,1 , X t,1 , k) is based on subtracting the conditional mean from X t+k,1 and standardizing it with the con- is the conditional expected value of X t+k,1 given (X 1,1 , X 2,1 , ..., X t,1 ) and σ 2 1−(1−β) 2k 1−(1−β) 2 is the respective conditional variance.
A similar transformation can be applied to a more general class of processes given by the stochastic differential equation (SDE) of the form where µ(X t ) and σ(X t ) are deterministic functions of the current (price) level X t and W t is a Wiener process. Processes of this form have been recently used to build regime-switching models for interest rates (Choi, 2009) and electricity prices (Janczura and Weron, 2009). Also note, that the AR(1) process defined by (4) is a discrete time version of the SDE (8) with µ(X t ) = α − βX t and σ(X t ) = σ, which is known as the Vasiček model in the fixed income literature. Using the Euler scheme and rearranging terms of formula (8), we get that has the standard Gaussian distribution. However, since the Euler scheme is an approximation of a continuous process, (9) is valid only for small ∆t (for details on errors of the Euler scheme see e.g. Bally and Talay, 1996). In contrast, transformation (7) is exact. Transformation (7) ensures that the subsample containing observations from the mean-reverting regime is i.i.d. Since the second regime is i.i.d. by definition, standard goodness-of-fit tests based on the empirical distribution function (like the Kolmogorov-Smirnov or Anderson-Darling tests, see e.g. D' Agostino and Stevens, 1986) can be applied to each of the subsamples. Recall that the Kolmogorov-Smirnov test statistic is given by: where F n is the empirical distribution function (edf) and F is the corresponding theoretical cumulative distribution function (cdf). Hence, having an i.i.d. sample (y 1 , y 2 , ..., y n ), the test statistic can be calculated as where I is the indicator function.
The goodness-of-fit of the marginal distribution of the individual regimes, as well as, of the whole model can be formally tested. For the mean-reverting regime F is the standard Gaussian cdf and (y 1 , y 2 , ..., y n ) is the subsample of the standardized residuals obtained by applying transformation (7), while for the second regime F is the model specified cdf (i.e. F 2 ) and (y 1 , y 2 , ..., y n ) is the subsample of respective observations. The 'whole model' testing technique is now straightforward. Indeed, observe that combining both subsamples yields an i.i.d. sample coming from a distribution being a mixture of normal and model-specified (F 2 ) laws. The cdf is then given by where P (R = i) is the probability of the process being in regime i and F i is the cdf related to regime i. Recall, that for the mean-reverting AR(1) regime F 1 is the standard Gaussian cdf. Hence, statistic (11) can be calculated with F given by (12) and the sample (y 1 1 , y 1 2 , ..., y 1 n1 , y 2 1 , y 2 2 , ..., y 2 n2 ), where (y 1 1 , y 1 2 , ..., y 1 n1 ) is the subsample of the standardized residuals given by (7), (y 2 1 , y 2 2 , ..., y 2 n2 ) are the observed values and n 1 + n 2 = T − 1. Now, we focus on the case when the model dynamics is described by an AR(1) process with different parameters in each regime (specification II). The H 0 hypothesis states that the sample (x 1 , x 2 , ..., x T ) is driven by a regimeswitching model defined by equation (6) with R t ∈ {1, 2}. Similarly as in the independent regimes case, the testing procedure is based on extracting the residuals of the mean-reverting process. Indeed, observe that under the H 0 hypothesis the transformation h(x t , x t−1 , 1), defined in (7), with parameters α Rt , β Rt and σ Rt corresponding to the current value of the state process R t , yields an i.i.d. N (0, 1) distributed sample. Thus, the Kolmogorov-Smirnov test can now be applied. The test statistic d n , see (11), is calculated with the standard Gaussian cdf and the sample (y 1 , y 2 , .., y T ) of the standardized residuals, i.e. y t = h(x t , x t−1 , 1).
Note, however, that the described testing procedure is valid only if the parameters of the hypothesized distribution are known. Unfortunately in typical applications the parameters have to be estimated beforehand. If this is the case, then the critical values for the test must be reduced (Čižek et al., 2011). In other words, if the value of the test statistics d n is d, then the p-value is overestimated by P (d n ≥ d). Hence, if this probability is small, then the pvalue will be even smaller and the hypothesis will be rejected. However, if it is large then we have to obtain a more accurate estimate of the p-value.
To cope with this problem, Ross (2002) recommends to use Monte Carlo simulations. In our case the procedure boils down to the following steps. First, the parameter vectorθ is estimated from the dataset and the test statistic d n is calculated according to formula (11). Next,θ is used as a parameter vector for N simulated samples from the assumed process. For each sample the new parameter vectorθ i is estimated and the new test statistic d i n is calculated (like in formula (11)). Finally, the p-value is obtained as the proportion of simulated samples with the test statistic values higher or equal to d n , i.e.
3.2 Testing in case of a latent state process

The ewedf approach
Now, assume that the sample (x 1 , x 2 , ..., x T ) is driven by a MRS model. The regimes are not directly observable and, hence, the standard edf approach can be used only if an identification of the state process is performed first.
Recall that, as a result of the estimation procedure described in Section 2.2, the so called 'smoothed inferences' about the state process are derived. The smoothed inferences are the probabilities that the t-th observation comes from a certain regime given the whole available informationP (R t = i) = P (R t = i|x 1 , x 2 , ..., x T ). Hence, a natural choice is to relate each observation with the most probable regime by letting R t = i ifP (R t = i) > 0.5. Then, the testing procedure described in the previous Section is applicable. However, we have to mention, that the hypothesis H 0 now states that (x 1 , x 2 , ..., x T ) is driven by a regime-switching model with known state process values. We call this approach 'ewedf', which stands for 'equally-weighted empirical distribution function'. It was introduced by Janczura and Weron (2009) in the context of electricity spot price MRS models.

The wedf approach
In the standard goodness-of-fit testing approach based on the edf each observation is taken into account with weight 1 n (i.e. inversely proportional to the size of the sample). However, in MRS models the state process is latent. The estimation procedure (the EM algorithm) only yields the probabilities that a certain observation comes from a given regime. Moreover, in the resulting marginal distribution of the MRS model each observation is, in fact, weighted with the corresponding probability. Therefore, a similar approach should be used in the testing procedure.
For this reason we introduce here the concept of the weighted empirical distribution function (wedf): where y 1 , y 2 , ..., y n is a sample of observations and w 1 , ..., w n are the corresponding weights, such that 0 ≤ w t ≤ M , ∀ t=1,...,n . It is interesting to note, that the notion of the weighted empirical distribution function appears in the literature in different contexts. Maiboroda (1996Maiboroda ( , 2000 applied it to the problem of estimation and testing for homogeneity of components of mixtures with varying coefficients. Withers and Nadarajah (2010) investigated properties of distributions of smooth functionals of F n (x). In both approaches weights were assumed to fulfill the condition n t=1 w t = n. Different choice of weights was given by Huang and Brill (2004). They proposed the level-crossing method to find weights improving efficiency of the edf in the distribution tails. Yet another approach employing weighted distribution is the generalized (weighted) bootstrap technique, see e.g. Haeusler et al. (1991), where specified random weights are used to improve the resampling method.
However, to our best knowledge, none of the applications of wedf is related to goodness-of-fit testing of Markov regime-switching models. Here we use the wedf concept to deal with the case when observations cannot be unambiguously classified to one of the regimes. The only restrictions imposed on the choice of the weights are the ones guarantying that F n (x) is an unbiased and consistent estimator of F (x), namely ∀ t∈N 0 ≤ w t ≤ M and lim n→∞ n t=1 w t = ∞. To see this, note that from the Chebyshev's inequality (Billingsley, 1986, p. 65), for any δ > 0 we have Hence, since E[F n (x)] = F (x), F n (x) converges in probability to F (x).
The following lemma yields a generalization of the K-S test to the case of the wedf (for a proof see the Appendix). Note, that if each w t ≡ 1, it simplifies to the result for the standard Kolmogorov-Smirnov test (Lehmann and Romano, 2005, p. 584).
Lemma 1 If Y 1 , Y 2 , ... are independent, generated from a continuous and strictly monotone distribution F , ∀ t∈N V ar(Y t ) < ∞, 0 ≤ w t ≤ M and lim n→∞ n t=1 w 2 t = ∞ then the statistic converges (weakly) to the Kolmogorov-Smirnov distribution KS as n → ∞.
If hypothesis H 0 is true then, by Lemma 1, the statistic D n asymptotically has the Kolmogorov-Smirnov distribution. Therefore if n is large enough, the following approximation holds where κ ∼ KS and c is the critical value. Hence, the p-value for the analyzed sample (y 1 , y 2 , ..., y n ) can be approximated by P is the test statistic. Note that, for a given value of d n , P (κ > d n ) is the standard Kolmogorov-Smirnov test p-value, so the K-S test tables can be easily applied in the wedf approach. Lemma 1 can be used in case of MRS models. Indeed, if the state process R t is a Markov chain with no transient states and w t = P (R t = j), the assumptions of Lemma 1 are satisfied. Goodness-of-fit of the individual regimes, as well as, of the whole model can be verified. Again, to ensure the i.i.d. condition for the tested sample, the mean-reverting regime observations are substituted with the corresponding residuals.
First, let us focus on the parameter-switching specification (II). The H 0 hypothesis states that the sample (x 1 , x 2 , ..., x T ) is driven by the MRS model defined by equation (6). Observe, that the residuals of the ith regime are then obtained from the transformation h(x t , x t−1 , 1), see (7), with parameters α i , β i and σ i . Hence, the ith regime test statistic d n , see (16), is calculated with the standard Gaussian cdf, the sample (y 1 , y 2 , ..., y T −1 ), where y t = h(x t+1 , x t , 1), and the weights w t =P (R t = i). The goodness-of-fit of the whole model can be verified based on the sample consisting of residuals from both regimes. Precisely, the test statistic d n is derived with the standard Gaussian cdf, the sample (y 1 1 , y 1 2 , ..., y 1 T −1 , y 2 1 , y 2 2 , ..., y 2 T −1 ), where (y i 1 , y i 2 , ..., y i T −1 ) are the ith regime residuals, and the corresponding weights (w 1 1 , w 1 2 , ..., w 1 T −1 , w 2 1 , w 2 2 , ..., w 2 Now, assume that the sample (x 1 , x 2 , ..., x T ) is driven by the MRS model with independent regimes (specification I). First, observe that the values of the mean-reverting regime become latent, when the process is in the second state. As a consequence, the calculation of the conditional mean and variance, required for the derivation of the residuals, is not straightforward. We have: where x t−1 = (x 1 , x 2 , ..., x t−1 ) is the vector of preceding observations. Therefore, the standardized residuals are given by the transformation: where E(X t−1,1 |x t−1 ) and V ar(X t−1,1 |x t−1 ) can be calculated using the following equalities: The latter formula is a consequence of the law of iterated expectation and basic properties of conditional expected values. Finally, the values P (R t = 1|x t ) are calculated from the Bayes rule during the EM estimation procedure (see e.g. Kim, 1994). Note, that if k is such a number that P (R t−1 = 1|x t−1 ) = ... = P (R t−k+1 = 1|x t−k+1 ) = 0 and P (R t−k = 1|x t−k ) = 1, then g leads to transformation (7), i.e. g(X t,1 , x t−1 ) = h(X t,1 , x t−k,1 , k). Now, to test the fit of the mean-reverting regime, it is enough to calculate d n according to formula (16) with the standard Gaussian cdf, the sample (y 1 , y 2 , ..., y T −1 ), where y t = g(x t,1 , x t−1 ) and the weights w t =P (R t = 1). Observe that, the observations from the second regime are i.i.d. by definition, so the testing procedure is straightforward. The test statistic is derived with the F 2 cdf, the sample of non-transformed observations (x 1 , x 2 , ..., x T ) and the corresponding weights w t =P (R t = 2). Again, to verify the goodness-of-fit of the whole model the subsamples corresponding to each of the regimes are combined. If H 0 is true, then the resulting distribution should be a mixture of Gaussian and F 2 laws. The test statistic d n is, hence, calculated as in (16) with the mixture cdf (12) and the sample (y 1 1 , y 1 2 , ..., y 1 T −1 , y 2 1 , y 2 2 , ..., y 2 T −1 ), where (y 1 1 , y 1 2 , ..., y 1 T −1 ) are the residuals, i.e. y 1 t = g(x t,1 , x t−1 ), while (y 2 1 , y 2 2 , ..., y 2 T −1 ) are the non-transformed observed variables, i.e. y 2 t = x t . The corresponding weights are then (w 1 1 , w 1 2 , ..., w 1 T −1 , w 2 1 , w 2 2 ), where w i t =P (R t = i). Note, that like in the case of an observable state process, in the wedf approach we face the problem of estimating values that are later used to compute the test statistic. Since the state process is latent, the weights w t have to be estimated from the dataset. However, Lemma 1 is valid for known weights. Again, this problem can be circumvented with the help of Monte Carlo simulations. The p-values can be computed as the proportion of simulated MRS model trajectories with the test statistic d n , see formula (16), higher or equal to the test statistic value obtained from the dataset.

Simulations
In this Section we check the performance of the procedures introduced in Section 3.2. To this end, we generate 10000 trajectories of two MRS models, Sim #1 and Sim #2, defined in Table 1. The first model follows specification (I), i.e. the first regime is driven by an AR(1) process, while the second regime is described by an i.i.d. sample of log-normally distributed random variables (Sim #1; with parameters α 2 and σ 2 2 , i.e. LN (α 2 , σ 2 2 )). Recall, that a random variable X is log-normally distributed, LN (α 2 , σ 2 2 ), if log(X) ∼ N(α 2 , σ 2 2 ), for X > 0. Sim #2 is simulated from the parameter-switching AR(1) model, i.e. it follows specification (II), see formula (6). The length of each trajectory is 2000 observations. Note, that the regimes of MRS models are not directly observable and, hence, the standard edf approach cannot be used.
We apply the ewedf, as well as, the wedf-based goodness-of-fit test to each simulated trajectory and then calculate the percentage of rejected hypotheses H 0 at the 5% significance level. We assume that the model parameters are known. The computation of E(X t,1 |x t ) in the wedf approach requires backward recursion until the previous observation from the mean-reverting regime is found, see (19). However, as the number of observations is limited, the condition P (R t = 1|x t ) = 1 might not be fulfilled at all. The calibration scheme requires some approximation or an additional assumption. Here we assume that for each simulated trajectory the first observation comes from the meanreverting regime.
In the ewedf approach the tested hypothesis says that the state process is known (and coincides with the proposed classification of the observations to the regimes). As a consequence, once the regimes are identified, it is equivalent to the standard edf approach. To test how it performs for a MRS model with a latent state process we apply it to the simulated trajectories (we first identify the regimes, then test whether the sample is generated from the assumed MRS model).
The results reported in Table 2 indicate that only the wedf-based test yields correct percent of rejected hypotheses. The values obtained for the ewedf-based test are far from the expected level of 5%. This simple example clearly shows that in case of MRS models the wedf approach is more reliable.
Further evidence is provided in Figure 1 where we illustrate the different types of empirical distribution functions. The wedf and ewedf functions are compared with the true edf. Note, that the edf can be calculated only when the simulated state process is known. Naturally, when dealing with real data, Table 1 Parameters of two MRS models analyzed in the simulation study of Section 4. The first model (Sim #1) follows specification (I), i.e. the first regime is driven by an AR(1) model, while the second regime is described by an i.i.d. sample of log-normally distributed random variables. Sim #2 is simulated from the parameter-switching AR(1) model, i.e. it follows specification (II).

Parameters
Probabilities    Table 1 for parameter values. The distribution functions for the model Sim #1 are plotted for the i.i.d. log-normal regime. The distribution functions for the model Sim #2 are plotted for the model residuals.
the state process is latent and, hence, the standard edf cannot be computed. The distribution functions are calculated separately for the log-normal regime of a sample trajectory of the MRS model Sim #1 and for the model residuals of a sample trajectory of the MRS model Sim #2, see Table 1 for parameter values. Observe that, while the wedf function replicates the true edf quite well, the ewedf approximation is not that good. This is in compliance with the rejection percentage given in Table 2. Next, we compare the test results of the wedf approach based on the K-S test tables and Monte Carlo simulations. We calculate the p-values for sample Table 3 Means and standard deviations of the differences between p-values obtained by the wedf approach utilizing K-S test tables and Monte Carlo (MC) simulations. The calculations are based on 1000 simulated trajectories of the MRS model (for parameter details see Table  1). The number of Monte Carlo repetitions is equal to 1000.  Table  1). The number of repetitions in the Monte Carlo approach is 1000. The obtained results are given in Table 3. Observe that the means of the differences between p-values calculated by the two methods are close to zero and the standard deviations do not exceed 0.032. Hence, errors resulting from estimation of weights are quite low. However, if a p-value obtained from the K-S test tables is close to the significance level, the Monte Carlo approach should be used. The simulation results presented so far were obtained with an assumption that the model parameters are known. Unfortunately in typical applications the parameters have to be estimated before the testing procedure is performed. This may result in overestimated p-values. To cope with this problem, as recommended by Ross (2002), we use Monte Carlo simulations. We simulate 500 sample trajectories of the MRS model Sim #1 or Sim #2, estimate the model parameters and apply goodness-of-fit tests. We report the rejection percentage of the K-S test based on the Monte Carlo approach with 500 repetitions, as well as, the K-S test tables. For the Monte Carlo approach the procedure is as follows:

Regime
estimate the parameter vectorθ and calculate the test statistic d n according to formula (11), -simulate N trajectories withθ, -for each trajectory estimate the new parameter vectorθ i and calculate the new test statistic d i n , -calculate p-value as the proportion of simulated trajectories with the test statistic values higher or equal to d n , i.e. 1 N #{i : d i n ≥ d n }. The results obtained for the 5% significance level are given in Table 4.
Looking at the test results based on the K-S test tables, for the ewedf approach the rejection percentage is again too high. On the other hand, for the wedf approach the p-values are overestimated, what results in the rejection percentage much lower than the 5% significance level. Observe that for the model Sim #2 none of the tests were rejected. Therefore, if p-values obtained with the wedf approach are close to the significance level, the test may wrongly not reject a false H 0 hypothesis. This is not the case for the wedf approach with Monte Carlo simulations as the obtained rejection percentage is close to the 5% significance level. This example clearly shows that the results of the Table 4 Percentage of rejected hypotheses H 0 at the 5% significance level calculated from 500 simulated trajectories of the models defined in Table 1 with parameters estimated from the sample. The results of the K-S test based on the ewedf, as well as, the wedf approach are reported independently for the two regimes (First, Second)   wedf test based on the K-S test tables can only be used if the test returns a p-value below the significance level (i.e. if it rejects the H 0 hypothesis) or well above the significance level. However, if the obtained p-value is close to the significance level, Monte Carlo simulations should be performed.

Application to electricity spot prices
Now, we are ready to apply the new goodness-of-fit technique to electricity spot price models. We analyze the mean daily (baseload) day-ahead spot prices from the New England Power Pool SEMASS area (NEPOOL; U.S.). The sample totals 1827 daily observations (or 261 full weeks) and covers the 5-year period January 2, 2006 -January 2, 2011, see Figure 2.
It is well known that electricity spot prices exhibit several characteristic features (Eydeland and Wolyniec, 2003;Weron, 2006), which have to be taken into account when modeling such processes. These include seasonality on the annual, weekly and daily level, mean reversion and abrupt short-lived price changes called spikes. To cope with the seasonality we use the standard time series decomposition approach and let the electricity spot price P t be repre-  sented by a sum of two independent parts: a predictable (seasonal) component f t and a stochastic component X t , i.e. P t = f t + X t . Further, to address the mean reverting and spiky behavior we let the log-prices, i.e. Y t = log(X t ), be driven by: a 2-regime MRS model with mean-reverting, see (4), base regime (R t = 1) and i.i.d. shifted log-normally distributed spikes (R t = 2) -a 3-regime MRS model with mean-reverting, see (4), base regime (R t = 1), i.i.d. shifted log-normally distributed spikes (R t = 2) and i.i.d. drops (R t = 2) distributed according to the inverted shifted log-normal law.
Recall, that X follows the shifted log-normal law (inverted shifted log-normal law) if log(X − q) (log(q − X)) has the Gaussian distribution. Note that q can be chosen arbitrarily, however, here we set it to the median of the dataset. Following Weron (2009) the deseasonalization is conducted in three steps. First, the long term seasonal component (LTSC) T t is estimated from daily spot prices P t using a wavelet filter-smoother of order 6 (for details see Trück et al., 2007). A single non-parametric LTSC is used here to represent the long-term non-periodic fuel price levels, the changing climate/consumption conditions throughout the years and strategic bidding practices. As shown by Janczura and Weron (2010), the wavelet-estimated LTSC pretty well reflects the 'average' fuel price level, understood as a combination of natural gas, crude oil and coal prices.
The price series without the LTSC is obtained by subtracting the T t approximation from P t . Next, the weekly periodicity s t is removed by subtracting the 'average week' calculated as the mean of prices corresponding to each day of the week (U.S. national holidays are treated as the eight day of the week). Finally, the deseasonalized prices, i.e. X t = P t − T t − s t , are shifted so that the minimum of the new process X t is the same as the minimum of P t . The Table 5 Parameters of the MRS model with mean reverting base regime and independent log-normally distributed spikes (and inverted log-normal drops in the 3-regime case) fitted to NEPOOL deseasonalized log-prices.

Model
Base resulting deseasonalized time series can be seen in Figure 2. The estimated model parameters are presented in Table 5. For both analyzed models the K-S test based on the ewedf, as well as, the wedf approach is performed. Moreover, since the standard approach based on the K-S test tables might produce overestimated p-values, the Monte Carlo results are also provided. The obtained p-values are reported in Table 6. For the 2-regime model the p-values obtained from the K-S test tables indicate that the model cannot be rejected at the 5% significance level. However, the base regime and the model p-values are still quite low, so the conclusions of the test should be verified with the Monte Carlo simulations. Indeed, for the Monte Carlo based test only the spike regime yields a satisfactory fit, as the p-value is well above the 5% significance level. The base regime, as well as, the whole model distribution can be rejected at any reasonable level. Apparently, the base regime process cannot model the sudden drops in the NEPOOL logprices. However, if a third regime (modeling price drops) is introduced, the MRS model yields a satisfactory fit. Observe that in the 3-regime case none of the tests can be rejected at the 5% significance level.

Conclusions
While most of the electricity spot price models proposed in the literature are elegant, their fit to empirical data has either been not examined thoroughly or the signs of a bad fit ignored. As the empirical study of Section 5 has shown, even reasonably looking and popular models should be carefully tested before they are put to use in trading or risk management departments. The goodnessof-fit wedf-based test introduced in Section 3.2.2 provides an efficient tool for accepting or rejecting a given Markov regime-switching (MRS) model for a particular data set.
However, in this paper we have not restricted ourselves to MRS models but pursued a more general goal. Namely, we have proposed a goodness-offit testing scheme for the marginal distribution of regime-switching models, including variants with an observable and with a latent state process. For both specifications we have described the testing procedure. The models with a latent state process (i.e. MRS models) required the introduction of the concept of the weighted empirical distribution function (wedf) and a generalization of the Kolmogorov-Smirnov test to yield an efficient testing tool.
We have focused on two commonly used specifications of regime-switching models in the energy economics literature -one with dependent autoregressive states and a second one with independent autoregressive and i.i.d. regimes. Nonetheless, the proposed approach can be easily applied to other specifications of regime-switching models (for instance, to 3-regime models with heteroscedastic base regime dynamics; see Janczura and Weron, 2010). Very likely it can be also extended to other goodness-of-fit edf-type tests, like the Anderson-Darling. As the latter puts more weight to the observations in the tails of the distribution than the Kolmogorov-Smirnov test, it might be more discriminatory and provide a better testing tool for extremely spiky data. Future work will be carried on in this direction.