Bayesian analysis of mixture autoregressive models covering the complete parameter space

Mixture autoregressive (MAR) models provide a flexible way to model time series with predictive distributions which depend on the recent history of the process and are able to accommodate asymmetry and multimodality. Bayesian inference for such models offers the additional advantage of incorporating the uncertainty in the estimated models into the predictions. We introduce a new way of sampling from the posterior distribution of the parameters of MAR models which allows for covering the complete parameter space of the models, unlike previous approaches. We also propose a relabelling algorithm to deal a posteriori with label switching. We apply our new method to simulated and real datasets, discuss the accuracy and performance of our new method, as well as its advantages over previous studies. The idea of density forecasting using MCMC output is also introduced.


Introduction
Mixture autoregressive (MAR) models (Wong and Li 2000) provide a flexible way to model time series with predictive distributions which depend on the recent history of the process. Not only do the predictive distributions change over time, they are also different for different horizons for predictions made at a fixed time point. As a consequence, they inherently accommodate asymmetry, multimodality and heteroskedasticity. For this reason, mixture autoregressive models have been considered a valuable alternative to other models for financial time series, such as the SETAR model (Tong 1990), the Gaussian transition mixture distribution model (Le et al. 1996), or the widely used class of GARCH models (Nelson 1991). Wong and Li (2000) considered estimation of MAR models based on the EM algorithm (Dempster et al. 1977). That method is particularly well suited for mixture-type models and works well. On the other hand, a Bayesian approach can offer the advantage of incorporating the uncertainty in the estimated models into the predictions. Sampietro (2006) presented the first Bayesian analysis of MAR models. In his work, reversible jump MCMC (Green 1995) is used to select the autoregressive orders of the components in the mixture, and models with different number of components are compared using methods by Chib (1995) and Chib and Jeliazkov (2001), which exploit the marginal likelihood identity. In addition, he derives analytically posterior distributions for all parameters in the selected model.
The Bayesian updates of the autoregressive parameters are problematic, because the parameters need to be kept in the stationarity region, which is very complex, and so cannot really be updated independently of each other. In the case of autoregressive (AR) models, it is routine to use parametrisation in terms of partial autocorrelations (Jones 1987), which are subject only to the restriction to be in the interval (−1, 1). Sampietro (2006) adapted this neatly to MAR models by parameterising the autoregressive parameters of each component of the MAR model with the partial autocorrelations of an AR model with those parameters.
A major drawback of Sampietro's sampling algorithm for the autoregressive parameters, is that it restricts the parameters of each component to be in the stationarity region of an autoregressive model. While this guarantees that the MAR model is stationary, it excludes from consideration considerable part of the stationarity region of the MAR model (Wong and Li 2000, p. 98;Boshnakov 2011). Depending on the mixture probabilities, the excluded part can be substantial. For example, most examples in Wong and Li (2000, p. 98) cannot be handled by Sampietro's approach, see also the examples in Section 4. Hossain (2012) developed a full analysis (model selection and sampling), which reduced the constraints of Sampietro's analysis. Using Metropolis-Hastings algorithm and a truncated Gaussian proposal distribution for the moves, he directly simulated the autoregressive parameters from their posterior distribution. This method still imposes a constraint on the autoregressive parameters through the choice of boundaries for the truncated Gaussian proposal. While the truncation is used to keep the parameters in the stationarity region, the choice of boundaries is arbitrary and can leave out a substantial part of the stationarity region of the model. In addition, his reversible jump move for the autoregressive order seems conservative, as it uses functions which always prefer jumps towards low autoregressive orders (this will be seen in Section 3.5).
A common problem associated with mixtures is label switching (see for instance Celeux 2000), which derives from symmetry in the likelihood function. If no prior information is available to distinguish components in the mixture, then the posterior distribution will also be symmetric. It is essential that label switching is detected and handled properly in order to obtain meaningful results. A common way to deal with this, also used by Sampietro (2006) and Hossain (2012), is to impose identifiability constraints. However, it is well known that such constraints may lead to bias and other problems. In the case of MAR models, Hossain (2012) showed that these constraints may affect convergence to the posterior distribution.
We develop a new procedure which resolves the above problems. We propose an alternative Metropolis-Hastings move to sample directly from the posterior distribution of the autoregressive components. Our method covers the complete parameter space. We also propose a way of selecting optimal autoregressive orders using reversible jump MCMC for choosing the autoregressive order of each component in the mixture, which is less conservative than that of Hossain. We propose the use of a relabelling algorithm to deal a posteriori with label switching.
We apply the new method to both simulated and real datasets, and discuss the accuracy and performance of our algorithm, as well as its advantages over previous studies. Finally, we briefly introduce the idea of density forecasting using MCMC output.
The structure of the paper is as follows. In Section 2 we introduce the mixture autoregressive model and the notation we need. In Section 3 we give detailed description of our method for Bayesian analysis of MAR models, including model selection, full description of the sampling algorithm, and the relabelling algorithm to deal with label switching. Section 4 shows results from application of our method to simulated and real dataset. Section 5 introduces the idea of density forecast using MCMC output.
2 The mixture autoregressive model A process {y t } is said to follow a Mixture autoregressive (MAR) process if its distribution function, conditional on past information, can be written as where • F t−1 is the sigma field generated by the process up to (and including) t − 1. Informally, F t−1 denotes all the available information at time t − 1, the most immediate past.
• g is the total number of autoregressive components.
• F k is the distribution function (CDF) of a standardised distribution with location parameter zero and scale parameter one. The corresponding density function will be denoted by f k .
• φ k = (φ k0 , φ k1 , . . . , φ kp k ) is the vector of autoregressive parameters for the k th component, with φ k0 being the shift. Here, p k is the autoregressive order of component k and we define p = max(p k ) to be the largest order among the components. A useful convention is to set φ kj = 0, for p k + 1 ≤ j ≤ p.
• σ k > 0 is the scale parameter for the k th component. We denote by σ = (σ 1 , . . . σ g ) the vector of scale parameters. Furthermore, we define the precision, τ k , of the k th component by τ k = 1/σ 2 k .
• If the process starts at t = 1, then Equation (1) holds for t > p.
We will refer to the model defined by Equation (1) as MAR(g; p 1 , . . . , p g ) model.
The following notation will also be needed. Let The error term associated with the kth component at time t is defined by A useful alternative expression for ν tk is the following mean corrected form: Comparing the two representations we get ( A nice feature of this model is that the one-step predictive distributions are given directly by the specification of the model with Equation (1). The h-steps ahead predictive distributions of y t+h at time t can be obtained by simulation (Wong and Li 2000) or, in the case of Gaussian and α-stable components, analytically (Boshnakov 2009).
We focus here on mixtures of Gaussian components. In this case, using the standard notations Φ and φ for the CDF and PDF of the standard Normal distribution, we have F k ≡ Φ and f k ≡ φ, for k = 1, . . . , g. The model in Equation (1) can hence be written as or, alternatively, in terms of the conditional pdf The correlation structure of a MAR process with maximum order p is similar to that of an AR(p) process. At lag h we have:

Stability of the MAR model
Stationarity conditions for MAR time series have some similarity to those for autoregressions with some notable differences. Below we give the results we need, see Boshnakov (2011) and the references therein for further details. A matrix is stable if and only if all of its eigenvalues have moduli smaller than one (equivalently, lie inside the unit circle). Consider the companion matrices We say that the MAR model is stable if and only if the matrix.
is stable (⊗ is the Kronecker product). If a MAR model is stable, then it can be used as a model for stationary time series. The stability condition is sometimes called stationarity condition. If g = 1, the MAR model reduces to an AR model and the above condition states that the model is stable if and only if A 1 ⊗A 1 is stable, which is equivalent to the same requirement for A 1 . For g > 1, it is still true that if all matrices A k , . . . , A k , k = 1, . . . , g, are stable, then A is also stable. However the inverse is no longer true, i.e. A may be stable even if one or more of the matrices A k are not stable.
What the above means is that the parameters of some of the components of a MAR model may not correspond to stationary AR models. It is convenient to refer to such components as "non-stationary".
Partial autocorrelations are often used as parameters of autoregressive models because they transform the stationarity region of the autoregressive parameters to a hyper-cube with sides (−1, 1). The above discussion shows that the partial autocorrelations corresponding to the components of a MAR model cannot be used as parameters if coverage of the entire stationary region of the MAR model is desired.
3 Bayesian analysis of mixture autoregressive models 3.1 Likelihood function and missing data formulation Given data y 1 , . . . , y n , the likelihood function for the MAR model in the case of Gaussian mixture components takes the form in (see Equation (5)) The likelihood function is not very tractable and a standard approach is to recur to a missing data formulation (Dempster et al. 1977). Let Z t = (Z t1 , . . . , Z tg ) be a latent allocation random variable, where Z t is a g-dimensional vector with entry k equal to 1 if y t comes from the k th component of the mixture, and 0 otherwise. We assume that the Z t s are discrete random variables, independently drawn from the discrete distribution: This setup, widely exploited in the literature (see, for instance Dempster et al. 1977, Diebolt andRobert 1994) allows to rewrite the likelihood function in a much more tractable way as follows: In practice, the Z t s are not available. We adopt a Bayesian approach to deal with this. We set suitable prior distributions on the latent variables and the parameters of the model and develop a methodology for obtaining posterior distributions of the parameters and dealing with other issues arising in the model building process.

Priors setup and choice of hyperparameters
The setup of prior distributions is based on Sampietro (2006) and Hossain (2012). In the absence of any relevant prior information it is natural to assume a priori that each data point is equally likely to be generated from any component, i.e. π 1 = · · · = π g = 1/g. This is a discrete uniform distribution, which is a particular case of the multinomial distribution. The conjugate prior of the latter is the Dirichlet distribution. We therefore set the prior for the mixing weigths vector, π, to π ∼ D (w 1 , . . . , w g ) , w 1 = · · · = w g = 1.
The prior distribution on the component means is a normal distribution with common fixed hyperparameters ζ for the mean and κ for the precision, i.e.
For the component precisions, τ k , a hierarchical approach is adopted, as suggested in Richardson and Green (1997). Here, for a generic k th component the prior is a Gamma distribution with hyperparameters c (fixed) and λ, which itself follows a gamma distribution with fixed hyperparameters a and b. We have therefore The main difference between our approach and that of Sampietro (2006) and Hossain (2012) is in the treatment of the autoregressive parameters. Sampietro (2006) exploits the one-to-one relationship between partial autocorrelations and autoregressive parameters for autoregressive models descirbed in Jones (1987). Namely, he parameterises each MAR component with partial autocorrelations, draws samples from the posterior distribution of the partial autocorrelations via Gibbs-type moves and converts them to autoregressive parameters using the functional relationship between partial autocorrelations and autoregressive parameters. Of course, the term "partial autocorrellations" doesn't refer to the actual partial autocorrellations of the MAR process, they are simply transformed parameters. The advantage of this procedure is that the stability region for the partial autocorrelation parameters is just a hyper-cube with marginals in the interval [−1, 1], while for the AR parameters it is a body whose boundary involves non-linear relationships between the parameters.
A drawback of the partial autocorrelations approach in the MAR case is that it covers only a subset of the stability region of the model. Depending on the other parameters, the loss may be substantial.
Hossain (2012) overcomes the above drawbacks by simulating the AR parameters directly. He uses Random Walk Metropolis, while applying a constraint to the proposal distribution (a truncated Normal). The truncation is chosen as a compromise that ensures that most of the stability region is covered, while keeping a reasonable acceptance rate. Although effective with "well behaved" data, there are scenarios, especially concerning financial examples, in which the loss of information due to a pre-set truncation becomes significant, as will be shown later on. In this paper, we choose Random Walk Metropolis for simulation from the posterior distribution of autoregressive parameters, while exploiting the stability condition to avoid restraining the parameter space a priori.
With the above considerations, for the autoregressive parameters we choose a multivariate uniform distribution with range in the stability region of the model, and independence between parameters is assumed. Hence, for a generic φ k prior distribution is such that: where I denotes the indicator function assuming value 1 if the condition is satisfied and 0 otherwise. We prefer this to a Normal prior as it better allows to explore the parameter space, and detect the presence of multimodality.
Choice of hyperparameters. Here we discuss the settings for the hyperparameters ζ, κ, a, b, and c. We have already discussed that the hyperparameters for the Dirichlet prior distribution on the mixing weights (all equal to 1). Also, λ is a hyperparameter but it is a random variable with distribution which will be fully specified once a and b are.
Following Richardson and Green (1997), let R y = max(y) − min(y) be the length of the interval variation of the dataset. Also fix the two hyperparameters a = 0.2 and c = 2. The remaining hyperparameters are set as follows:

Posterior distributions and acceptance probability for RWM
Following Sampietro (2006) and Hossain (2012), posterior distributions for all but the autoregressive parameters are as follows: where, for k = 1, . . . , g, All these parameters are updated via a Gibbs-type move. Similarly, Z t s are simulated from a multinomial distribution with associated posterior probabilities.
To update autoregressive parameters, let φ k , k = 1, . . . , g, be the set of current states of the autoregressive parameters, i.e. a set of observations from the posterior distribution of φ k . We can simulate φ Here γ k , k = 1, . . . , g is a tuning parameter, chosen in such way that the acceptance rate of RWM is optimal (20 − 25%) for component k. We allow γ k to change between components, but to be constant within the same component. Notice the difference between our proposal and the two-step approach by Sampietro (2006), or the truncated Normal proposal chosen by Hossain (2012). The probability of accepting a move to the proposed φ * k is where q (φ k , φ * k ) = q (φ * k , φ k ), due to the symmetry in the Normal proposal. Therefore, the acceptance probability will only depend on the likelihood ratio of the new set of parameters over the current set of parameters, i.e. where This means that the likelihood ratio for the k th component is independent of current values of parameters for the remaining components. This enables to calculate likelihood ratios separately for each component. The procedure described builds a candidate model with updated mixing weights, shift, scale and autoregressive parameters. However, because stability of such model does not only depend on the autoregressive parameters, we must ensure that the stability condition of Section 2.1 is satisfied. If this is not the case, the candidate model and all its parameters are rejected, and the current state of the chain is set to be the same as at the previous iteration.

Dealing with label switching
Once the samples have been drawn, label switching is dealt with using a kmeans clustering algorithm proposed by Celeux (2000). It is natural to use the identifiability constraint π 1 > π 2 > · · · > π g but it is well known that it is problematic. Examples are given in the discussion to the paper by Richardson and Green (1997). It was shown in fact by Hossain (2012) that applying an identifiability constraint such as π 1 > π 2 > · · · > π g may in some cases affect convergence of the chain. With our approach instead, we do not interfere with the chain during the simulation, and hence convergence is not affected.
Our algorithm works by first choosing the first m simulated values of the output after convergence. The value m shall be chosen small enough for labels switch to not have occurred yet, and large enough to be able to calculate reliable initial values of cluster centres and their respective variances.
Let θ = (θ 1 , . . . , θ g ) be a subset of model parameters of size q, and N the size of the converged sample. For any centre coordinate θ i , i = 1, . . . , q we calculate the mean and variance, based on the first m simulated values, respectively as: We set this to be the "true" permutation of the components, i.e. we now have an initial centerθ (0) with variancess (0) 2 i , i = 1, . . . , q. The remaining g!−1 permutations can be obtained by simply permuting these centres.
From these initial estimates, the r th iteration (r = 1, . . . , N − m) of the procedure consists of two steps: • the parameter vector θ (m+r) is assigned to the cluster such that the normalised squared distance is minimised, whereθ (m+r−1) i is the i th centre coordinate and s (m+r−1) i its standard deviation, at the latest update m + r − 1.
• Centre coordinates and their variances are respectively updated as follows: and (s for i = 1, . . . , q.
For the mixture autoregressive case, it is not always clear which subset of the parameters should be used. In fact, group separation might seem clearer in the mixing weights at times, as well as in the scale or shift parameters. Therefore this method requires graphical assistance, i.e. checking the raw output looking for clear group separation. However, it is advisable not to use the autoregressive parameters, especially when the orders are different.
Once the selected subset has been relabelled, labes for the remaining parameters can be switched accordingly.

Reversible Jump MCMC for choosing autoregressive orders
For this step, we use Reversible Jump MCMC (Green 1995). At each iteration, one component k is randomly chosen from the model. Let p k be the current autoregressive order of this component, and set p max to be the largest possible value p k may assume. For the selected component, we propose to increase or decrease its autoregressive order by 1 with probabilities where b(p k ) = 1 − d(p k ), and such that d(1) = 0 and b(p max ) = 0. Notice that d(p k ) (or equivalently b(p k )) may be any function defined in the interval [0, 1] satisfying such condition. For instance, Hossain (2012) introduced two parametric functions for this step. However, in absence of relevant prior information, we choose b(p k ) = d(p k ) = 0.5 in our analysis, while presenting the method in the general case. Finally, it is necessary to point out that in both scenarios we have a 1-1 mapping between current and proposed model, so that the resulting Jacobian is always equal to 1.
Given a proposed move, we proceed as follows: • If the proposal is to move from p k to p * k = p k − 1, we simply drop φ kp k , and calculate the acceptance probability by multiplying the likelihood ratio and the proposal ratio, i.e.
where φ φ kp k − φ kp k 1/ √ γ k is the density of the parameter dropped out of the model, according to its proposal distribution.
If the candidate model is not stable, then it is automatically rejected, i.e. α M p k , M p * k = 0.
• If the proposed move is from p k to p * k = p k + 1, we proceed by simulating the additional parameter from a suitable distribution. In absence of relevant prior information, the choice is to simulate a value from a uniform distribution centred in 0 and with appropriate range, so that values both close and far apart from 0, both positive and negative, are taken into consideration.
These considerations lead to draw φ kp * k ∼ U (−1.5, 1.5) The acceptance probability is in this case where 3 is the inverse ofthe U (−1.5, 1.5) density.
Once again, if the candidate model is not stable, α M p k , M p * k = 0 and the current model is retained.

Choosing the number of components
To select the appropriate number of autoregressive components in the mixture, we apply the methods proposed by Chib (1995) and Chib and Jeliazkov (2001), respectively, for use of output from Gibbs and Metropolis-Hastings sampling. Both make use of the marginal likelihood identity.
From Bayes theorem, we know that where p(g) is the prior distribution on g, and f (y | g) is the marginal likelihood function, defined as with θ = (φ, π, µ, τ ) being the parameter vector of the model. For any values θ * , p * , number of components g and observed data y, we can use the marginal likelihood identity to decompose the marginal likelihood into parts that are know or can be estimated Notice that the only quantity not readily available in the above equation is p (θ * | p * , y, g). However, this can be estimated by running reduced MCMC simulations for fixed p * (which can be obtained by the RJMCMC method described in Section 5.1), as follows: Once these quantities are estimated (see 25, 26, 27, 28), plug them in Equation (22), together with the other known quantities, to obtain the marginal likelihood for the model with fixed number of components g. For higher accuracy of results, it is suggested to compare marginal likelihood with different g at points of high density in the posterior distribution of θ * . We will use the estimated highest posterior density values.
First, produce a reduced chain of length N j to obtain φ * k , the highest density value for φ k , using the sampling algorithm in Section 4.3, applied to the nonfixed set of parameters only. Define Ψ k * , the set of known (fixed) parameters with the addition of φ * k . From a second reduced chain of length k ) denote acceptance probabilities respectively of the first and second chain. We can finally estimate the value of the posterior density at φ * k aŝ Repeat this procedure for all k = 1, . . . , g and multiply the single densities to obtainp Note that there are no requirements on what N i and N j should be, granted the first chain is long enough to have reached the stationary distribution.

Application
For comparative and demonstrative purposes, we show applications of our method using two simulated datasets from (A) F (y t |F t−1 ) = 0.5Φ y t + 0.5y t−1 1 + 0.5Φ y t − y t−1 2 and (B) respectively with 300 and 600 observations. Process (A) is similar to the one considered by Hossain (2012) and Wong and Li (2000), while (B) was chosen to illustrate in practice how labels switch is dealt with. The issue of labels switch for (B) can be seen in Figure 3, where we show the raw MCMC output with signs of label switch between components 2 and 3 (green and red lines), and the relabelled output after applying the algorithm. The algorithm then proceeds as described in Algorithm 1 below: Algorithm 1 1: for g ← 2, . . . , g max do 2: RJMCMC and determine p * 1 , . . . , p * k 3: Calculate f (y | g) 4: Select g * = max f (y | g), g = 2, . . . , g max 5: Simulate f (θ | y, g * , p * )  Table 1: Results from simulation studies. "Preference" is the proportion of times the model was retained against all models with same number of components.
As we can see from Tables, 1, 2 and 3, and Figures 2 and 4, the "true" model is chosen in both cases, as it has the largest marginal log-likelihood. In addition, true values of the parameters are found in high density regions of their respective posterior distributions.

The IBM common stock closing prices
The IBM common stock closing prices (Box and Jenkins 1976) is a financial time series widely explored several times in the literature (see, for instance Wong and Li 2000). It contains 369 observations from May 17 th 1961 to November 2 nd 1962.
Following previous studies, we consider the series of first order differences. To allow direct comparison with Wong and Li (2000) and Hossain (2012), we set φ k0 = 0, k = 1, 2, 3.
With the procedure outlined in Algorithm 1 our method chooses a MAR(3; 4, 1, 1) to best fit the data, amongst all 2, 3, and 4 component models of maximum order p k = 5, k = 1, . . . , g, with a marginal log-likelihood of −1245.51. We immediately notice that this is different from the selected model in Wong and Li (2000). Such difference may occur as the frequentist approach fails to capture the multimodality in the distribution of certain parameters, which we can clearly see from Figure 6. In fact, by attempting to fit a MAR(3; 4, 1, 1) model by EM-Algorithm from several different starting points, we concluded that this would actually provide a better fit than the MAR(3; 1, 1, 1) chosen by Wong and Li.

The Canadian lynx data
Another dataset widely explored in time series literature, and in our interest by Wong and Li (2000), is the annual record of Canadian lynx trapped in the Mackenzie River district in Canada between 1821 and 1934. This dataset, listed by Elton and Nicholson (1942), includes 111 observations. Following previous studies, we consider the natural logarithm of the data, which presents a typical autoregressive correlation structure with 10 years cycles. We notice the presence of multimodality in the log-data, with two local maxima (see Figure 7). This suggest that the series may be in fact generated by a mixture of two components.
In their analysis, Wong and Li (2000) choose a MAR(2; 2, 2) as best model to fit the data. However, their choice was based on the minimum BIC criterion, which has been acknowledged for not always being reliable for MAR models, particularly with small datasets.
Aiming to have a better insight about the data, we apply our Bayesian method. The selected model is in this case a MAR(2; 1, 2), preferred over a MAR(2; 2, 2) by the algorithm, and to all 2, 3 and 4 component models with autoregressive order p = 1, 2, 3, 4. The marginal log-likelihood of the model is −131.0381. We generated a sample of size 100000 from the posterior distribution of the parameters of the selected MAR(2; 1, 2) model. It is noticed that, for most paramters, the 90% credibility region includes the MLEs obtained by Wong and Li (2000). The only exception stands for the scale parameters, which seem to be slightly larger than such MLEs. However, this may be due to our model containing one fewer AR parameter. On the other hand, these results are in line with the estimates obtained by fitting a MAR(2; 1, 2) using the EM algorithm, since all estimates are well within the corresponding 90% highest posterior density region.   Wong and Li (2000). Sample size is 100000, after 50000 burn-in iterations.

Bayesian density forecasts with mixture autoregressive models
Once a sample from the posterior is obtained, it is useful to use these to make predictions on future (or off-set) observations. Wong and Li (2000) and Boshnakov (2009) respectively introduced a simulation based and an analytical method for for density forecasts assuming a MAR model. The first method relies on Monte Carlo simulations, while the second derives exact h-step ahead predictive distributions of a given observation.
On one hand, we could estimate density forecasts using the highest posterior density values (i.e. the peak of the posterior distribution). However, it is better in this case to exploit the entire simulated sample as follows: 1. Label each simulation from 1 to N , e.g. θ (i) , i = 1, . . . , N .

Estimate the density forecast
In this way, we obtain a sample from the h-steps ahead density forecast of an observation of interest. We estimate the 1-step and 2-steps predictive distributions of the IBM data at t = 258 using the analytical method by Boshnakov (2009), and compare them to the ones obtained by EM algorithm. (see Figure 9). The solid red lines represent the density obtained by Boshnakov (2009) using EM estimates and the exact method. Results of our method are represented by the solid black lines, with the dashed lines as 90% credibility region. The figure also shows how quickly the uncertainty on the predictions grows as we move further in the future, with the 2-step predictive density looking much flatter.
We can see that there are no substantial differences in the shape of these predictive distributions. However, we notice that, particularly for the 2-steps predictor, averaging seems to "stabilise" the density line.
We notice from the plots that, clearly for the 1-step predictor and slighlty for the 2-step predictor, the density obtained by MCMC attaches higher density the observations of interest y 259 and y 260 .

Conclusion
We presented an innovative fully Bayesian analysis of mixture autoregressive models with Gaussian components, in particular a new method for simulation from the posterior distribution of the autoregressive parameters, which covers the whole stationarity region, compared to previous approaches that constrained it in one way or another. Our approach allowed us to better capture presence of multimodality in the posterior distribution of model parameters. We also introduced a way of dealing with label switching that does not interfere with convergence to the posterior distribution of the model parameters. This consisted in using a relabelling algorithm a posteriori. Simulations indicate that the method works well. We presented results for two simulated data sets. In both cases the "true" model was selected, and posterior distributions showed high densities regions around the "true" values of the parameters.
The ability of our method to explore the complete stationarity region of the autoregressive parameters allows it to capture better multimodality of distributions. This was illustrated with the IBM and the Canadian Lynx datasets. In the former ( Figure 6) we saw how multimodality in the posterior distribution of autoregressive parameters was captured, aspects which were missed in the analyses of Hossain (2012), (see for example Figures 3.10 and 3.11). For this example, it was also noticed that modes of posterior distributions of the autoregressive parameters roughly correspond to point estimates obtained by EM estimation. In the latter (Figure 8), we found the mode of φ 21 to be quite distant from 0, with values close to 2 lying in the credibility interval. In this case, the risk with Hossain's method would be to truncate the Normal proposal at points such that a significant part of the stationarity region of the model is not covered. Sampietro's method would have failed to detect such a mode, since it is outside the interval [−1, 1].
In conclusion, we may say that our algorithm provides accurate and informative estimation, and therefore may result in more accurate predictions.
Further work could be done to improve the efficiency of our method. Possible improvements to the method include a different algorithm for sampling of autoregressive parameters.
In particular, acceptance rates for the Random Walk Metropolis moves used for sampling the autoregressive parameters can be rather low for mixtures of large number of components or for components with large autoregressive orders, making the algorithm slow at times, with the added risk of it not being able to explore the complete parameter space efficiently. A different procedure, such as the Metropolis Adjusted Langevin Algorithm (MALA), may be considered to improve the efficiency.
Gaussian mixtures are very flexible but alternatives are worth considering. In particular, components with standardised t-distribution could allow modelling heavier tails with small number of components.
1. For each dataset, we simulate a sample of size 100000 from the posterior distribution of the parameters, after allowing 10000 iterations as burn-in period.
2. For each parameter, we find the overall minimum and maximum over the 400 samples, say l and u. From here, we identify a grid of 512 equally spaced values in the range [l, u], and evaluate the density of such points under each posterior.
3. Finally, we average for each of the points to obtain a unique average density.
The figure below summarises results of applying this procedure. As we can see, the densities are well in line with the true values of the parameters.