1 Introduction

Air pollution is a common hazard to which inhabitants of large and medium sized cities around the world are exposed. Among the many pollutants present in those cities are ozone (O\(_{3}\)) and particulate matter with a diameter smaller than 10 microns (PM\(_{10}\)). The health hazard caused by these two pollutants are well known. For instance, high levels of ozone may make a susceptible part of the population staying in that environment for a long period of time to experience upper respiratory system problems as well as eyes irritation (see, for example, Bell et al. 2004, 2005; Cifuentes et al. 2001; Dockery et al. 1992; Galizia and Kinney 1999; O’Neill et al. 2004; Gauderman et al. 2004; Gouveia and Fletcher 2000; Loomis et al. 1996; Martins et al. 2002; WHO 2006). In the case of PM\(_{10}\), they may also irritate the respiratory system, and there is evidence that long-term exposure to this pollutant may increase the risk of cardiovascular disease and lung cancer (Gauderman et al. 2004; Mauderly 1997; Mauderly and Oberdörster 1997; Thurston 1996; Itô and Thurston 1996). Additionally, if pregnant women are exposed to these two pollutants, adverse effects may be present in the newborn. Therefore, it is very important to study their behaviours.

A first step to understanding the behaviours of these, and other pollutants in general, is to monitor their levels. Hence, environmental authorities throughout the world have assembled monitoring networks in problematic and not-so-problematic cities. Also, environmental standards have been implemented along with many other actions aimed to decreasing air pollution levels and avoid population exposure.

Mexico is among the many countries adopting measures to control air pollution levels. Since the late 1980s, environmental authorities at the federal level and, in particular, in Mexico City, have been implementing measures such as restrictions on the number of vehicles circulating in the metropolitan area, regulation of more than 300 industries, restrictions on the circulation of some types of vehicles on Saturdays, a decrease of the country’s environmental thresholds to declare the occurrence of an environmental alert (NOM 2014a, b). For a timeline of the implementation of the many measures, see, for instance, Achcar et al. (2011).

Nowadays, emergency alerts in Mexico City are based on exceedances of ozone and/or PM\(_{10}\) with a lately implemented protocol for particulate matter with a diameter smaller that 2.5 microns (PM\(_{2.5}\)). In the present work one interest is in estimating the probability of having one of these type of exceedances in a given day, taking into account the present and past information about these pollutants. We are also interested in estimating the probability that in a given day in the year we have either no exceedances, exceedance of one of the pollutants or exceedances of both. In order to do that, we use a bivariate non-homogeneous Markov model of order \(K \ge 0\). The parameters of the model will be estimated using the Bayesian point of view.

The main aim of the present work is to understand the historical behaviour of the data. As a secondary aim, we are interested in providing a way of warning the possible occurrences of exceedances of the Mexican air quality thresholds for ozone and PM\(_{10}\). Hence, if the historical data suggest a high probability of an exceedance occurring, then close attention should be paid to the present behaviour of the data, and if this behaviour suggests the continuity of the past behaviour, then some measures could be taken in advance to prevent population exposure to high levels of pollution. Note that due to the non stationarity of the data, the estimated probability of occurrence of an exceedance may be higher than it really is; however, that is not a bad thing since one of the aim is just to alert the population and be on the safe side.

Non-homogeneous Markov models have already been used to study problems in air pollution. For instance, in Paroli et al. (2005) we have a non-homogeneous Markov mixture of periodic auto-regressive models to study air pollution data from the lagoon of Venice, Italy. A Gibbs sampling algorithm estimates the parameters within the Bayesian framework. In Lagona et al. (2011) non-homogeneous hidden Markov models are used to study multi-pollutants exceedances problems. In that work an E-M type algorithm estimates the parameters involved in the model. Also, Rodrigues et al. (2015) consider a non-homogenous version of the model used in Álvarez et al. (2005) to study the sequence recording the ozone exceedances in Mexico City. The maximum a posteriori method is used to estimate the order and the transition probabilities of the chain. In Rodrigues et al. (2019), a non-homogeneous Markov chain with transition probabilities having a seasonal behaviour rules the sequence recording whether or not an ozone exceedance has occurred in a given day. A similar approach was also taken by Paroli et al. (2005).

Other environmental areas also use non-homogeneous Markov chain models. For instance, Rajagopalan et al. (1996) and Hughes et al. (1999) use non-homogeneous Markov models to study the occurrence of precipitation. In Drton et al. (2003), these models are used to study occurrence of tornado activity.

Although in the present work we also consider the non-homogeneous Markov chain approach and the Bayesian point of view to estimate the parameters present in the model, the novelty here is that two or more pollutants may be studied via a multivariate non-homogeneous Markov chain of order \(K \ge 0\) model. We assume that the order of the chain as well as its initial and transition probabilities are unknown and need to be estimated. The chosen estimation methods are the maximum a posteriori and the leave-one-out cross validation (see, for instance, Chang 2019) for the order of the chain and, having this information, a sample of the initial and transition probabilities generated using their posterior distributions are used to estimate them.

This work is organised as follows. In Sect. 2, we describe the mathematical model. Section 3 gives the Bayesian formulation of the model. In Sect. 4 an application to the case of Mexico City ozone and PM\(_{10}\) data is presented. In Sect. 5, a discussion of the results as well as some additional comments are given. An appendix, given before the list of references, presents some additional results as well as a simulation study to corroborate de choice of the order of the chain using the methods considered here. This work also has a Supplementary Material file presenting the tables with the values of the estimated probabilities of interest in the case of the selected model.

2 A multivariate non-homogeneous Markov chain model

Let \(N > 0\) and \(T_{l} > 0\) be given natural numbers representing, respectively, the number of years and the number of days in the lth year where pollutants’ measurements were taken, \(l = 1, 2, \ldots , N\). Hence, depending on whether or not we have a leap year, either \(T_{l} = 366\) or \(T_{l} = 365\), \(l = 1, 2, \ldots , N\). Following Drton et al. (2003), we take \(T_{l} = T = 366\), \(l = 1, 2, \ldots , N\), with the convention that for non-leap year a measurement is set to zero when \(t = 366\). Hence, from now on, we drop the index “l” from the notation of the observational period.

Remark

We could also leave each year with its original length. In Rodrigues et al. (2019) that was performed in the model used there. The aim was to verify whether or not significant changes in the results could occur. The findings were that it does not matter if we take \(T_{l} = T = 366\) for all \(l = 1, 2, \ldots , N\), or if we consider the original number of days for each year. Hence, in order to simplify the model and the computer programmes, we keep the convention that \(T_{l} = T = 366\), \(l = 1, 2, \ldots , N\), assigning value zero to the measurement if we have \(t = 366\) and a non-leap year.

Assume that there are \(d \ge 1\) pollutants we are interested in studying. Let \(Z_{i, t}\) be the measurement of the ith pollutant on the tth day, \(t = 1, 2, \ldots , T\), \(i = 1, 2, \ldots , d\). Denote by \(\textbf{Z}_{t} = (Z_{1, t}, Z_{2, t}, \ldots , Z_{d, t})\) the vector recording the pollutants’ measurements at time t in a given year, \(t = 1, 2, \ldots , T\). Indicate by \(L_{i} > 0\) the environmental threshold for the ith pollutant, \(i = 1, 2, \ldots , d\). Consider \(\textbf{E} = \{ \textbf{E}_{t} = (E_{1, t}, E_{2, t}, \ldots , E_{d, t}) \,: \, t \ge 0 \}\) the process defined as follows,

$$\begin{aligned} E_{i, t} = \left\{ \begin{array}{ll} 0, \text{ if } Z_{i, t} < L_{i}, \\ 1, \text{ if } Z_{i, t} \ge L_{i}, \end{array} \right. \end{aligned}$$
(1)

\(i = 1, 2, \ldots , d\), \(t \ge 0\). Hence, \(E_{i, t}\) indicates whether or not on the tth day of a given year the ith pollutant’s measurement exceeds its corresponding environmental threshold, \(i = 1, 2, \ldots , d\), \(t = 1, 2, \ldots , T\).

By definition of the vector \(\textbf{E}_{t}\), it is possible to see that it assumes values on the set \(S_{1} = \{ (x_{1}, x_{2}, \ldots , x_{d}) \in \{ 0, 1 \}^{d} \}\). This could be viewed as the binary decomposition of a number in \(S_{2} = \{ 0, 1, \ldots , 2^{d} - 1 \}\) since we can associate a vector in \(S_{1}\) to a number in \(S_{2}\) using the transformation \(f: S_{1} \rightarrow S_{2}\) given by \(f(x_{1}, x_{2}, \ldots , x_{d}) = \sum _{i=0}^{d-1} x_{i+1} \, 2^{i}\). Hence, the process \(\textbf{E}\), which assume values on a d-dimensional space, may be viewed as a unidimensional process assuming values in \(S_{2}\). Let \(\textbf{W} = \{ W_{t} \,: \, t \ge 0 \}\) indicate that process. Therefore, if we denote by \(\tilde{w}\) an element of \(S_{2}\), let \(\textbf{x} = (x_{1}, x_{2}, \ldots , x_{d}) \leftrightarrow \tilde{w}\) indicate that \(\tilde{w} \in S_{2}\) corresponds to the state \(\textbf{x} \in S_{1}\) via the transformation \(f(\cdot )\).

Remarks

1. On occasions the “tilde” used to indicate an element in \(S_{2}\), will be dropped. That will be made when numbers, instead of letter, are used to indicate the state in \(S_{2}\).

2. Since the aim here is to analyse whether or not exceedances of the the thresholds occurred, the processes \(\textbf{E}\) and \(\textbf{W}\) can be used without problems. Note that we are not interested in studying the possible dependence between the pollutants’ measurements and the way the concentration of one of them affects the other. That is studied elsewhere.

3. Up to now we have three sequences. The sequence \(\textbf{Z} = \{ \textbf{Z}_{t} \,: \, t \ge 0 \}\) which records the actual pollutants’ measurements, the sequence \(\textbf{E} = \{ \textbf{E}_{t} : \, t \ge 0 \}\) which is the transformed sequence \(\textbf{Z}\) and records whether or not the pollutants’ measurements are above the associated thresholds and the sequence \(\textbf{W} = \{ W_{t} \,: \, t \ge 0 \}\) which records the univariate values associated with \(\textbf{E}\) through the function \(f(\cdot )\).

As in Álvarez et al. (2005) and Rodrigues et al. (2015), we assume that \(\textbf{W}\) is ruled by a non-homogeneous Markov chain of order \(K \ge 0\). We also assume that K is a random variable with state space \(\mathcal{S} = \{ 0, 1, 2, \ldots , M \}\), where \(M \ge 0\) is a given fixed integer number. Denote this chain by \(\textbf{X}^{(K)} = \{ \textbf{X}^{(K)}_{t} \,: \, t = 1, 2, \ldots T \}\). Therefore, for each year the sequence \(\textbf{W}\) is considered a realisation of the non-homogeneous Markov chain \(\textbf{X}^{(K)}\). Note that each element \(\textbf{X}^{(K)}_{t}\) of \(\textbf{X}^{(K)}\) has as its coordinates values of the sequence \(\textbf{W}\), i.e., \(\textbf{X}^{(K)}_{t} = (\tilde{y}_{t}, \tilde{y}_{t+1}, \ldots , \tilde{y}_{t+K-1})\) means that \(W_{t} = \tilde{y}_{t}, W_{t+1} = \tilde{y}_{t+1}, \ldots , W_{t+K-1} = \tilde{y}_{t+K-1}\).

Thus, \(\textbf{X}^{(K)}\) has as its state space the set \(S_{3}^{(K)} = \{ { \textbf{y} = (\tilde{y}_{1}, \tilde{y}_{2}, \ldots , \tilde{y}_{K})} \in \left( S_{2} \right) ^{K} \}\). We may also see an element \(S_{3}^{(K)}\) as an element of \(S_{4}^{(K)} = \{ 0, 1, \ldots , \left| S_{2} \right| ^{K} - 1 \}\) (where, for A a set, we use the notation |A| to indicate the cardinality of A). This may be achieved by using a transformation \(g(\cdot )\) similar to \(f(\cdot )\), but now defined on \(S_{3}^{(K)}\) and with image in \(S_{4}^{(K)}\). This function \(g: S_{3}^{(K)} \rightarrow S_{4}^{(K)}\) is given by \(g({ \tilde{y}_{1}, \tilde{y}_{2}, \ldots , \tilde{y}_{K}}) = \sum _{i=0}^{K - 1}{ \tilde{y}_{i+1} }\, \left| S_{2} \right| ^{i}\). For \(\textbf{y} = ({ \tilde{y}_{1}, \tilde{y}_{2}, \ldots , \tilde{y}_{K}}) \in S_{3}^{(K)}\) and \(\overline{m} \in S_{4}^{(K)}\) we use the notation \(\textbf{y} \leftrightarrow \overline{m}\) to indicate their correspondence. Therefore, each K-dimensional state \(\textbf{y}\) of the Markov chain \(\textbf{X}^{(K)}\) may be associated with a unidimensional value \(\overline{m} \in S_{4}^{(K)}\) via \(g(\cdot )\).

From Rodrigues et al. (2015) we also have that, for \(K \ge 2\), if the set of observed value is \({ (\tilde{z}_{1}, \tilde{z}_{2}, \ldots , \tilde{z}_{T})} \in (S_{2})^{T}\), then the transition probabilities of \(\textbf{X}^{(K)}\) from \({ \textbf{X}^{(K)}_{t} = (\tilde{z}_{t}, \tilde{z}_{t+1}, \ldots ,}\) \({ \tilde{z}_{t+K-1})} \in S_{3}^{(K)}\) are different of zero, if and only if, the next state is \({ \textbf{X}^{(K)}_{t+1} = (\tilde{z}_{t+1}, \tilde{z}_{t+2}, \ldots ,}{ \tilde{z}_{t+K})}\), i.e., if the observation following \({ \tilde{z}_{t}, \tilde{z}_{t+1}, \ldots ,} { \tilde{z}_{t+K-1}}\) is \(\tilde{z}_{t+K}\). Hence, as in Álvarez et al. (2005) and Boys and Henderson (2002), instead of considering the transition from \({ \textbf{X}^{(K)}_{t} = (\tilde{z}_{t}, \tilde{z}_{t+1}, \ldots , \tilde{z}_{t+K-1})}\) to \({ \textbf{X}^{(K)}_{t+1} = (\tilde{z}_{t+1}, \tilde{z}_{t+2}, \ldots , \tilde{z}_{t+K})}\), we may consider the transformed state spaces \(S_{4}^{(K)}\) and \(S_{2}\), and the transition probabilities of \(\textbf{X}^{(K)}\) may be written as (see, for instance, Álvarez et al. 2005),

$$\begin{aligned} P^{(K)} _{\overline{m} j}(t) = P(W_{t+K} = \tilde{j} \, | \, \textbf{X}_{t}^{(K)} = (\tilde{z}_{t}, \tilde{z}_{t+1}, \ldots , \tilde{z}_{t+K-1}) \leftrightarrow \overline{m}), \end{aligned}$$
(2)

where \(\overline{m} \in S_{4}^{(K)}\), \(\tilde{j} \in S_{2}\), and \(1 \le t \le T - K\).

As in Rodrigues et al. (2015), indicate by \(Q_{\overline{m}}^{(K)}(t)\), \(\overline{m} \in S_{4}^{(K)}\), the probability \(P(\textbf{X}_{t}^{(K)} = \overline{m})\), \(t = 1, 2, \ldots , T\). Hence, when \(t = 1\), we have that \(Q_{\overline{m}}^{(K)}(1)\), \(\overline{m} \in S_{4}^{(K)}\), is the initial distribution of \(\textbf{X}^{(K)}\).

Remarks

1. When \(K = 0\), we have \(S_{4}^{(K)} = S_{2}\) and the transition probabilities (2) are equal to \(P^{(0)} _{\overline{m} \tilde{j}}(t) = P^{(0)} _{\tilde{i} \tilde{j}}(t) = P(W_{t} = \tilde{j}) = Q_{\tilde{j}}^{(0)}(t)\), \(t = 1, 2, \ldots , T\), \(\tilde{i}, \tilde{j} \in S_{2}\). In the case of \(K =1\), we also have \(S_{4}^{(K)} = S_{2}\) and the transition probabilities are such that \(P^{(1)} _{\overline{m} \tilde{j}}(t) = P^{(1)} _{\tilde{i} \tilde{j}}(t) = P(W_{t+1} = \tilde{j} \, | \, W_{t} = \tilde{i})\), \(t = 1, 2, \ldots , T-1\), \(\tilde{i}, \tilde{j} \in S_{2}\).

2. Unless otherwise stated, from now on, we are going to use the state space \(S_{4}^{(K)}\) and the corresponding transition probabilities (2).

In addition to estimating the order K of the Markov chain, we will also estimate its transition probabilities \(P^{(K)} _{\overline{m} \tilde{j}}(t)\), for each t, as well as the initial probabilities \(Q^{(K)} _{\overline{m}}(1)\), \(\overline{m} \in S_{4}^{(K)}\), \(\tilde{j} \in S_{2}\). We indicate by \(P^{(K)}(t) = \left( P^{(K)} _{\overline{m} \tilde{j}}(t) \right) _{\tilde{j} \in S_{2}, \overline{m} \in S_{4}^{(K)}}\), the transition matrix at time t.

As an additional information provided by the model, we are able to estimate \(P(W_{t} = \tilde{i})\), \( \tilde{i} \in S_{2}\). These values report the probability of whether or not we have exceedances of one, more than one or none of the pollutants. They are obtained by taking advantage of the Markov property and using a recursive form. A way of obtaining these probabilities is given at the end of the next section.

Remark

Even though at the end we are dealing with a univariate chain, it reflects the behaviour of a multi-dimensional problem since we are studying the behaviours of several pollutants at once. The transformed univariate chain is used in order to facilitate the analysis.

3 A Bayesian estimation of the parameters of the model

The order of a Markov chain could be estimated using the auto-correlation function associated to the chain. An alternative method to estimate the order and consequently the transition probabilities is to use the so-called reversible jump Markov chain Monte Carlo algorithm. That was used in Álvarez and Rodrigues (2008). However, the line followed here (using maximum a posteriori and leave-one-out) is simpler and also produces suitable results. In the case of the initial and transition probabilities we have, for instance, the maximum likelihood method (see, for instance, Fleming and Harrington 1978) and the empirical estimators (see, for instance, Aalen and Johansen 1978). In the present work, we use the Bayesian approach (see, for instance, Robert 2007 and Carlin and Louis 2000) to estimate the order and the transition probabilities. Inference is performed using the information provided by the so-called posterior distributions of the parameters. The posterior distribution of a vector of parameters \(\varvec{\theta }\) given the observed data \(\textbf{D}\), indicated by \(P(\varvec{\theta } \, | \, \textbf{D})\), is such that \(P(\varvec{\theta } \, | \, \textbf{D}) \propto L( \textbf{D} \, | \, \varvec{\theta }) \, P(\varvec{\theta })\), where \(L( \textbf{D} \, | \, \varvec{\theta })\) is the likelihood function of the model, and \(P(\varvec{\theta })\) is the prior distribution of the vector \(\varvec{\theta }\). These elements are given as follows.

3.1 The likelihood function and the prior and posterior distributions

In the present case, the vector of parameter is \(\varvec{\theta } = (K, Q^{(K)}(1), P^{(K)}(t), t = 1, 2, \ldots , T - K)\) which belongs to the following sample space

$$\begin{aligned} \Theta = \bigcup _{K=0}^{M} \left\{ \{ K \} \times \Delta _{|S_{4}^{(K)} |} \times \Delta _{|S_{2} |}^{| S_{4}^{(K)} | \, (T - K)} \right\} \end{aligned}$$

where \(\Delta _{l} = \{ (x_{1}, x_{2}, \ldots , x_{l}) \in \mathbb {R}^{l} \,: \, x_{i} \ge 0, \, i = 1, 2, \ldots , l; \, \sum _{l=1}^{l} x_{i} = 1 \}\) is the (\(l-1\))-dimensional simplex. (If \(K=0\), then the parametric space reduces to \(\Theta = \Delta _{|S_{2} |}^{T}\).)

Let \(\textbf{Y} = \{ \textbf{Y}_{t} : \, t = 1, 2, \ldots , T \}\) indicate our observed data. Instead of using the measurements \(\textbf{Z}_{t}\) as our observed values, we will use the values in \(S_{2}\) associated with \(\textbf{E}_{t}\) which in turn is associated with both the measurements \(\textbf{Z}_{t}\), \(t = 1, 2, \ldots , T\), and the transformed sequence \(\textbf{W}\). Hence, we take \(\textbf{Y} = \textbf{W}\).

Let \((\tilde{x}_{1}, \tilde{x}_{2}, \ldots , \tilde{x}_{K})\) be such that the observation at time t is \(\tilde{x}_{1}\). For a order \(K \ge 1\), indicate by \(n_{\overline{m} \tilde{i}}^{(K)}(t)\) the number of years in which the vector \((\tilde{x}_{1}, \tilde{x}_{2}, \ldots , \tilde{x}_{K})\) corresponding to a state \(\overline{m} \in S_{4}^{(K)}\) is followed by the observation \(\tilde{i} \in S_{2}\). Also, define \(n_{\overline{m}}^{(0)}(t)\), \(\overline{m} \in S_{4}^{(0)} = S_{2}\), as the number of years in which we have the observation \(\overline{m}\) at time t, \(t = 1, 2, \ldots , T\). Additionally, let \(n_{\overline{m}}^{(K)}\) indicate the number of years in which the state corresponding to the initial \(K \ge 1\) days (or the first day if \(K = 0\)) is equivalent to the value \(\overline{m} \in S_{4}^{(K)}\). Note that, \(n_{\overline{m}}^{(0)} = n_{\overline{m}}^{(1)}\), \(\overline{m} \in S_{4}^{(0)} =S_{4}^{(1)} = S_{2}\).

Since a Markov model is assumed, the likelihood function is given by (see, for instance, Drton et al. 2003; Fleming and Harrington 1978; Boys and Henderson 2002; Fan and Tsai 1999)

$$\begin{aligned} L( \textbf{Y} \, | \, \varvec{\theta }) \propto \prod _{\overline{m} \in S_{4}^{(K)}} \left\{ \left[ Q_{\overline{m}}^{(K)}(1) \right] ^{n_{\overline{m}}^{(K)}} \, \left( \prod _{t=1}^{T-K} \, \left[ \prod _{ \tilde{i} \in S_{2}} \left( P_{\overline{m} \tilde{i}}^{(K)}(t) \right) ^{n_{\overline{m} \tilde{i}}^{(K)}(t)} \right] \right) \right\} . \end{aligned}$$
(3)

When \(K=0\), the expression (3) simplifies to

$$\begin{aligned} L( \textbf{Y} \, | \, \varvec{\theta }) \propto \prod _{t=1}^{T} \prod _{\overline{m} \in S_{4}^{(0)}} \left[ Q_{\overline{m}}^{(0)}(t) \right] ^{n_{\overline{m}}^{(0)}(t)}, \end{aligned}$$
(4)

where \(Q_{\overline{m}}^{(0)}(t) = P(X_{t}^{(0)} = \overline{m}) = P(W_{t} = \overline{m})\), \(t = 1, 2, \ldots , T\), \(\overline{m} \in S_{4}^{(0)} = S_{2}\).

The prior distribution of the vector of parameters is given as follows. We assume a prior independence of \(P^{(K)}(t)\) as functions of t. Also, since the forms of \(P^{(K)}(t)\) and \(Q^{(K)}(1)\) depend on the value of K, we have, for \(\varvec{\theta } = (K, Q^{(K)}(1), P^{(K)}(t), t = 1, 2, \ldots , T - K)\), that

$$\begin{aligned} P(\varvec{\theta }) = P(Q^{(K)}(1) \, | \, K) \, \left[ \prod _{t = 1}^{T - K} P \left( P^{(K)}(t) \, | \, K \right) \right] \, P(K), \end{aligned}$$

where \(P(Q^{(K)}(1) \, | \, K)\) and \(P \left( P^{(K)}(t) \, | \, K \right) \) are the prior distributions of the initial distribution \(Q^{(K)}(1)\) and of the transition matrix \(P^{(K)}(t)\) given the order of the chain, respectively, and P(K) is the prior distribution of the order K.

Remark

When we have \(K = 0\), the vector of parameters is \(\varvec{\theta }' = (Q_{\overline{m}}^{(0)}(t); \, t = 1, 2, \ldots , T)\), whose prior distribution simplifies to \(P(\varvec{\theta }' \, | \, K = 0) = \left[ \prod _{t = 1}^{T} P(Q^{(0)}(t)) \right] \), where \(P(Q^{(0)}(t))\) is the prior distribution of the probability vector \(Q^{(0)}(t) = (Q_{\overline{m}}^{(0)}(t), \, \overline{m} \in S_{4}^{(0)})\), \(t = 1, 2, \ldots , T\).

Given the nature of transition matrices, we assume that rows are independent and that, given the order K of the chain, each row of the transition matrix \(P^{(K)}(t)\) will have as prior distribution a Dirichlet distribution with appropriate hyperparameters. Therefore, given that \(K = k\), row \((P_{\overline{m} 0}^{(k)}(t), P_{\overline{m} 1}^{(k)}(t), \ldots , P_{\overline{m} (2^{d} - 1)}^{(k)}(t))\) has as prior distribution a Dirichlet with hyperparameters \((\alpha _{\overline{m} 0}^{(k)}(t), \alpha _{\overline{m} 1}^{(k)}(t), \ldots , \alpha _{\overline{m} (2^{d} - 1)}^{(k)}(t))\), \(t = 1, 2, \ldots , T\), \( \overline{m} \in S_{4}^{(K)}\); i.e.,

$$\begin{aligned} P \left( P^{(K)}(t) \, | \, K \right) = \prod _{\overline{m} \in S_{4}^{(K)}} \left\{ \frac{ \Gamma \left[ \sum _{\tilde{i} \in S_{2}}\alpha _{\overline{m} \tilde{i}}^{(K)}(t)\right] }{\prod _{\tilde{i} \in S_{2}} \Gamma \left( \alpha _{\overline{m} \tilde{i}}^{(K)}(t) \right) } \, \left( \prod _{ \tilde{i} \in S_{2}} \left[ P_{\overline{m} \tilde{i}}^{(K)}(t) \right] ^{\alpha _{\overline{m} \tilde{i}}^{(K)}(t) -1} \right) \right\} \end{aligned}$$

for t in the appropriate range. The initial distribution \(Q^{(K)}(1)\), will also have a Dirichlet prior distribution, but now with hyperparameters \((\alpha _{\overline{m}}^{(K)}; \overline{m} \in S_{4}^{(K)})\). Therefore,

$$\begin{aligned} P \left( Q^{(K)}(1) \, | \, K \right) = \frac{\Gamma \left( \sum _{\overline{m} \in S_{4}^{(K)}} \alpha _{\overline{m}}^{(K)}\right) }{\prod _{\overline{m} \in S_{4}^{(K)}} \Gamma (\alpha _{\overline{m}}^{(K)})} \, \prod _{\overline{m} \in S_{4}^{(K)}} \left[ Q_{\overline{m}}^{(K)}(1) \right] ^{\alpha _{\overline{m}}^{(K)} -1}. \end{aligned}$$

If \(K = 0\), then \(Q^{(0)}(t)\) has as prior distribution a Dirichlet distribution with hyperparameters \((\alpha _{\overline{m}}^{(0)}(t); \overline{m} \in S_{4}^{(0)} = S_{2})\), \( t = 1, 2, \ldots , T\). Therefore, for any given t,

$$\begin{aligned} P \left( Q^{(0)}(t) \, | \, K = 0 \right) = \frac{\Gamma \left( \sum _{\overline{m} \in S_{4}^{(K)}} \alpha _{\overline{m}}^{(0)}(t) \right) }{\prod _{\overline{m} \in S_{4}^{(K)}} \Gamma (\alpha _{\overline{m}}^{(0)}(t))} \, \prod _{\overline{m} \in S_{4}^{(K)}} \left[ Q_{\overline{m}}^{(0)}(t) \right] ^{\alpha _{\overline{m}}^{(0)}(t) -1}. \end{aligned}$$

We assume that K has as prior distribution a truncated Poisson distribution with rate \(\lambda > 0\), defined on the set \(\mathcal{S}\); i.e.,

$$\begin{aligned} P(K) \propto \frac{ \lambda ^{K}}{K!} \, I_{\mathcal{S}}(K), \end{aligned}$$
(5)

where \(I_{A}(x) = 1\), if \(x \in A\) and is zero otherwise.

Therefore, from Boys and Henderson (2002); Fan and Tsai (1999); Álvarez et al. (2005); and Rodrigues et al. (2015), the conditional posterior distribution of \(P^{(K)}(t)\) given K, is proportional to the product of Dirichlet distributions with hyperparameters \((n_{\overline{m} 0}^{(K)}(t) + \alpha _{\overline{m} 0}^{(K)}(t), n_{\overline{m} 1}^{(K)}(t) + \alpha _{\overline{m} 1}^{(K)}(t), \ldots , n_{\overline{m} (2^{d} - 1)}^{(K)}(t) + \alpha _{\overline{m} (2^{d} - 1)}^{(K)}(t))\). Additionally, the posterior distribution of the initial distribution \(Q^{(K)}(1)\) given K is also proportional to a Dirichlet distribution, but now with hyperparameters \((\alpha _{\overline{m}}^{(K)} + n_{\overline{m}}^{(K)}; \overline{m} \in S_{4}^{(K)})\). When \(K = 0\), we have that the posterior distribution \(P(Q^{(0)}(t) \, | \, K = 0, \textbf{Y})\) is proportional to the product of Dirichlet distributions with hyperparameters \((n_{\overline{m}}^{(0)}(t) + \alpha _{\overline{m}}^{(0)}(t); \overline{m} \in S_{2})\). Furthermore, from Álvarez et al. (2005),

$$\begin{aligned} L(\textbf{Y} \, | \, K)\propto & {} \frac{\Gamma \left( \sum _{\overline{m} \in S_{4}^{(K)}} \alpha _{\overline{m}}^{(K)} \right) }{\Gamma \left( \sum _{\overline{m} \in S_{4}^{(K)}} \left[ \alpha _{\overline{m}}^{(K)} + n_{\overline{m}}^{(K)} \right] \right) } \, \left( \prod _{\overline{m} \in S_{4}^{(K)}} \frac{ \Gamma \left( n_{\overline{m}}^{(K)} + \alpha _{\overline{m}}^{(K)} \right) }{\Gamma \left( \alpha _{\overline{m}}^{(K)} \right) } \right) \\{} & {} \prod _{t=1}^{T-K} \left\{ \prod _{\overline{m} \in S_{4}^{(K)}} \left[ \frac{ \Gamma \left( \sum _{\tilde{j} \in S_{2}} \alpha _{\overline{m} \tilde{j}}^{(K)}(t) \right) }{\Gamma \left( \sum _{\tilde{j} \in S_{2}} [n_{\overline{m} \tilde{j}}^{(K)}(t) + \alpha _{\overline{m} \tilde{j}}^{(K)}(t)] \right) } \, \prod _{\tilde{j} \in S_{2}} \frac{\Gamma ( n_{\overline{m} \tilde{j}}^{(K)}(t) + \alpha _{\overline{m} \tilde{j}}^{(K)}(t))}{\Gamma (\alpha _{\overline{m} \tilde{j}}^{(K)}(t))} \right] \right\} , \end{aligned}$$

with the appropriate adaptation for the case of \(K = 0\). Hence, the posterior distribution of the order K is

$$\begin{aligned} P(K | \, \textbf{Y}) = \frac{1}{c} \, L(\textbf{Y} \, | \, K) \, \frac{\lambda ^{K}}{K!} \end{aligned}$$
(6)

where \(c = \sum _{k \in \mathcal{S}} L(\textbf{Y} \, | \, K = k) \, \left( \lambda ^{k}/k! \right) \) is the normalising constant.

Hence, we will use (6) to estimate the value of K that maximises the posterior distribution \(P(K | \, \textbf{Y})\). We will also use the leave-one-out cross-validation method (see, for instance, Chang 2019). Once that K is estimated, samples of the posterior transition and initial distributions associated to that K are obtained. The generated values are used to estimate the corresponding distributions via the law of large numbers.

Remark

The hyperparameters of the prior distributions will be considered known and will be specified later when the model is applied to the data.

3.2 Estimation of the probabilities of exceedances

Once the posterior distributions are obtained and samples of size \(M' > 0\) of the corresponding initial and transitions probabilities are produced, we may estimate the probabilities \(P(W_{t} = \tilde{i})\), \(\tilde{i} \in S_{2}\), \(t = 1, 2, \ldots , T\), using the law of the large numbers in the following way. For each element l of the sample of the initial and transition probabilities it is possible to calculate the a value of \(P(W_{t} = \tilde{i})\) which we indicate by \(P^{(l)}(W_{t} = \tilde{i})\). Then, the estimated probability \(\hat{P}(W_{t} = \tilde{i})\) is,

$$\begin{aligned} \hat{P}(W_{t} = \tilde{i}) = \frac{1}{M'} \, \sum _{l=1}^{M'} P^{(l)}(W_{t} = \tilde{i}), \end{aligned}$$
(7)

where each \(P^{(l)}(W_{t} = \tilde{i})\) is obtained as follows. (We consider separately the cases where \(K = 0\), \(K = 1\) and \(K \ge 2\).)

Case 1.:

When \(\textbf{X}^{(K)}\) has order \(K = 0\), then \(P(W_{t} = \tilde{i})\) corresponds to the probabilities \(Q_{\tilde{i}}^{(0)}(t)\), \(\tilde{i} \in S_{2}\), \(t = 1,2, \ldots , T\). Hence, generate a sample of size \(M'\) of \(Q_{\tilde{i}}^{(0)}(t)\) using its posterior distribution, \(\tilde{i} \in S_{2}\). Denote the lth generated value by \(Q_{\tilde{i}}^{(0,l)}(t)\) and assign \(P^{(l)}(W_{t} = \tilde{i}) = Q_{\tilde{i}}^{(0,l)}(t)\), \(\tilde{i} \in S_{2}\), \(t = 1, 2, \ldots , T\), \(l = 1, 2, \ldots , M'\).

Case 2.:

If the order of \(\textbf{X}^{(K)}\) is \(K = 1\), then we may take advantage of the Markov property as well as the estimated initial distributions. That is made as follows. Note that, for \(t \ge 2\) and \(\tilde{j} \in S_{2}\), we may write \(P(W_{t} = \tilde{j})\) recursively as (Morales-Morillón 2018),

$$\begin{aligned} P(W_{t} = \tilde{j})= & {} \sum _{ \tilde{i} \in S_{2}} P(W_{t} = \tilde{j} \, | \, W_{t-1} = \tilde{i}) \, P(W_{t-1} = \tilde{i}) \nonumber \\= & {} \sum _{ \tilde{i} \in S_{2}} P_{\tilde{i}\tilde{j}}^{(1)}(t-1) \, P(W_{t-1} = \tilde{i}), \end{aligned}$$
(8)

where \(P(W_{1} = \tilde{j}) = Q^{(1)}_{\tilde{j}}(1)\). In order to estimate the values of \(P^{(l)}(W_{t} = \tilde{j})\), \(\tilde{j} \in S_{2}\), we proceed as follows.

Step 1:

Generate a sample of size \(M'>0\) of \(Q^{(1)}_{\tilde{i}}(1)\) and let \(Q^{(1,l)}_{\tilde{i}}(1)\) indicate its lth element, \(\tilde{i} \in S_{2}\), \(l = 1, 2, \ldots , M'\). Assign \(P^{(l)}(W_{1} = \tilde{i}) = Q_{\tilde{i}}^{(1,l)}(1)\), \(\tilde{i} \in S_{2}\), \(l = 1, 2, \ldots , M'\).

Step 2:

For \(t \ge 2\), take the generated sample of size \(M'>0\) of \(Q^{(1)}_{\tilde{i}}(1)\) (obtained in Step 1) and generate a sample also of size \(M'\) of \(P_{\tilde{i}\tilde{j}}^{(1)}(t)\) using its posterior distribution, \(\tilde{i}, \tilde{j} \in S_{2}\), \(t = 1, 2, \ldots , T-1\). Let \(Q^{(1,l)}_{\tilde{i}}(1)\) and \(P_{\tilde{i}\tilde{j}}^{(1,l)}(t)\) indicate the lth element of corresponding samples, \(\tilde{i}, \tilde{j} \in S_{2}\), \(t = 1, 2, \ldots , T-1\), \(l = 1, 2, \ldots , M'\). Then, for each \(l = 1, 2, \ldots , M'\), use (8) to recursively obtain \(P^{(l)}(W_{t} = \tilde{j})\). Hence,

$$\begin{aligned} P^{(l)}(W_{t} = \tilde{j}) = \sum _{ \tilde{i} \in S_{2}} P_{\tilde{i}\tilde{j}}^{(1,l)}(t-1) \, P^{(l)}(W_{t-1} = \tilde{i}), \end{aligned}$$

where \(P^{(l)}(W_{1} = \tilde{i}) = Q_{\tilde{i}}^{(1,l)}(1)\), \(\tilde{i}, \tilde{j} \in S_{2}\), \(t = 2, 3, \ldots , T\), \(l = 1, 2, \ldots , M'\).

Case 3.:

When the selected order of the chain \(\textbf{X}^{(K)}\) is \(K \ge 2\), then we proceed as follows.

Step 1:

For \(t = 1, 2, \ldots , K\). Generate a sample of size \(M'>0\) of the initial distribution \(Q^{(K)}_{\overline{m}}(1)\) and let \(Q^{(K,l)}_{\overline{m}}(1)\) indicate its lth element, \(\overline{m} \in S_{4}^{(K)}\), \(l = 1, 2, \ldots , M'\). Let \(\overline{m} \leftrightarrow (\tilde{x}_{1}, \tilde{x}_{2}, \ldots , \tilde{x}_{K})\). Taking into account that \(Q^{(K,l)}_{\overline{m}}(1) = P^{(l)}(\textbf{X}_{1}^{(K)} = \overline{m} \leftrightarrow \overline{m} \leftrightarrow (\tilde{x}_{1}, \tilde{x}_{2}, \ldots , \tilde{x}_{K}))\), we have

$$\begin{aligned} P^{(l)}(W_{t} = \tilde{x}_{t}) = \sum _{s \ne t} \sum _{\tilde{x}_{s} \in S_{2}} P^{(l)}(\textbf{X}_{1}^{(K)} = (\tilde{x}_{1}, \ldots , \tilde{x}_{K})), \end{aligned}$$

where \(\tilde{x}_{t} \in S_{2}\), \(l = 1, 2, \ldots , M'\).

Step 2:

For \(t \ge K + 1\) and \(\overline{m} \leftrightarrow (\tilde{x}_{1}, \tilde{x}_{2}, \ldots , \tilde{x}_{K})\), consider the generated sample of size \(M'>0\) of \(Q^{(K)}_{\overline{m}}(1) = Q^{(K)}_{(\tilde{x}_{1}, \tilde{x}_{2}, \ldots , \tilde{x}_{K})}(1)\) (obtained in Step 1) and additionally generate a sample also of size \(M'\) of the transition probabilities \(P_{\overline{m x}}^{(K)}(t) = P_{(\tilde{x}_{1}, \tilde{x}_{2}, \ldots , \tilde{x}_{K}) \tilde{x}}^{(K)}(t)\) using their posterior distributions, \(\overline{m} \in S_{4}^{(K)}\), \(\tilde{x} \in S_{2}\). Let \(Q^{(K,l)}_{\overline{m}}(1)\) and \(P_{\overline{m} \tilde{x}}^{(K,l)}(t)\) indicate the lth element of the corresponding samples, \(\overline{m} \in S_{4}^{(K)}\), \(\tilde{x} \in S_{2}\), \(t = K+1, K+2, \ldots , T-K\), \(l = 1, 2, \ldots , M'\). In order to obtain \(P^{(l)}(W_{t} = \tilde{i})\), \(\tilde{i} \in S_{2}\), \(t = K+1, K+2, \ldots , T\), note that, using the Markov property, we may write the following recursive formula,

$$\begin{aligned} P^{(l)}(W_{t} = \tilde{i})= & {} \sum _{s=1}^{K} \sum _{\tilde{x}_{s} \in S_{2}} P^{(l)}(W_{t} = \tilde{i} \, | \, \textbf{X}_{t-K}^{(K)} = (\tilde{x}_{1}, \tilde{x}_{2}, \ldots , \tilde{x}_{K})) \nonumber \\{} & {} P^{(l)}(\textbf{X}_{t-K}^{(K)} = (\tilde{x}_{1}, \tilde{x}_{2}, \ldots , \tilde{x}_{K})) \nonumber \\= & {} \sum _{s=1}^{K} \sum _{\tilde{x}_{s} \in S_{2}} P_{(\tilde{x}_{1}, \tilde{x}_{2}, \ldots , \tilde{x}_{K}) \tilde{i}}^{(K,l)}(t - K) \, P^{(l)}(\textbf{X}_{t-K}^{(K)} = (\tilde{x}_{1}, \tilde{x}_{2}, \ldots , \tilde{x}_{K})),\nonumber \\ \end{aligned}$$
(9)

where \(P(\textbf{X}_{s}^{(K)} = (\tilde{x}_{1}, \tilde{x}_{2}, \ldots , \tilde{x}_{K}) \leftrightarrow \overline{m})\) may be obtained recursively as

$$\begin{aligned}{} & {} P(\textbf{X}_{s}^{(K)} = (\tilde{x}_{1}, \tilde{x}_{2}, \ldots , \tilde{x}_{K}) \leftrightarrow \overline{m})\nonumber \\{} & {} \quad = \left\{ \begin{array}{ll} Q^{(K)}_{\overline{m}}(1) &{}, \text{ for } s = 1,\\ \sum _{\tilde{x} \in S_{2}} P^{(K)}_{ (\tilde{x}, \tilde{x}_{1}, \ldots , \tilde{x}_{K-1}) \tilde{x}_{K}}(s-1) &{} \, \\ \qquad \qquad P(\textbf{X}_{s-1}^{(K)} = (\tilde{x}, \tilde{x}_{1}, \ldots , \tilde{x}_{K-1})) &{}, \text{ for } s \ge 2 \end{array} \right. \end{aligned}$$
(10)

Remark

As mentioned earlier, besides using the maximum a posteriori to select the order of the chain, we also use a modification to the non-homogeneous case of the leave-one-out cross validation method (LOO) – see, for instance, Chang (2019) and references therein. Once this order is estimated, if a lower order is selected, we may also use, for instance, the well known sum of the absolute values of the differences (SAD) using the estimated and observed initial and transition probabilities to see how well the estimated probability values describe the observed. These methods are described in Section A of the Appendix.

4 Application to ozone and PM\(_{10}\) data from the monitoring network of Mexico City

In this section we consider the model presented in Sect. 2, in the particular case of \(d\! = \!2\). An application will be made to measurements of two pollutants collected at Mexico City’s monitoring network. This network comprises of several monitoring stations placed throughout the metropolitan area. The pollutants considered are ozone (given in parts per million - ppm) and PM\(_{10}\) (given in microgram per cubic meter - \(\mu g/m^{3}\)). Even though ozone in Mexico City has been measured since late 1980s, systematic PM\(_{10}\) monitoring started only in 1995. Hence, we have decided to use measurements from that year on for both pollutants. Therefore, we have considered measurements obtained from 01 January 1995 to 31 December 2019. In the analysis, we will use the daily maximum measurements for those pollutants. Measurements are taken minute by minute in each station and the hourly averaged result is reported every hour. In a given station the ozone daily maximum measurement is the maximum of the 24-hour averaged results. Ozone daily maximum measurement for the metropolitan area is the maximum of all daily maxima recorded in all stations of the monitoring network. PM\(_{10}\) daily measurement in a given station is the average results over the 24-hour period for that station. The PM\(_{10}\) daily maximum measurement for the metropolitan area is the maximum of the daily averaged results for all stations in the network. We will consider the overall measurements for the entire metropolitan area instead of splitting it into regions as it was made in the past when only ozone measurements were taken into account.

The thresholds considered are those specified by NOM (2014a, 2014b), i.e., 0.095 ppm for ozone and 75 \(\mu g/m^{3}\) for PM\(_{10}\).

4.1 Data description

In Fig. 1 we have the figure of the Mexico City’s ozone and PM\(_{10}\) measurements during the observational period considered here. In that figure we also have two horizontal lines representing the valid environmental threshold for each pollutant.

Fig. 1
figure 1

PM\(_{10}\) and ozone data from Mexico City monitoring network for the period ranging from 01 January 1995 to 31 December 2019. Horizontal lines correspond to the thresholds for each pollutant, i.e., \(L = 75 \) \(\mu g/m^{3}\) for the PM\(_{10}\) data and \(L = 0.095\) ppm for the ozone

Looking at Fig. 1 there are several episodes where high peaks have occurred in the PM\(_{10}\) data, especially in the beginning of the observational period. We may also observe, as we move towards the end, that even though peaks occur, they are not as high as in the beginning. In the case of ozone, we also see higher concentrations in the beginning of the observational period. However, towards the end, we may see that the values vary around the line 0.095. Also, from the plots given in Fig. 1, we may see that perhaps the series of measures implemented by Mexico and, in particular, Mexico City since the late 1980s have contributed to the decrease of ozone and PM\(_{10}\) levels, including the height of their peaks, in the metropolitan area.

If we analyse the maximum, minimum and mean values throughout the years, we have that the overall mean values were 86.92 \(\mu g/m^{3}\) and 0.128 ppm in the cases of PM\(_{10}\) and ozone, respectively. The corresponding minima were 14 \(\mu g/m^{3}\) and 0.02 ppm. In the case of maximum values, we have 379 \(\mu g/m^{3}\) and 0.35 ppm in the case of PM\(_{10}\) and ozone, respectively. In Fig. 2 we may see the behaviours of these quantities. In this figure (top two plots), we have the yearly behaviours of the mean (thick continuous lines), maximum (thick dashed lines) and minimum (thin dashed lines) of the measurements.

Fig. 2
figure 2

Top two plots: mean (thick continuous lines), maximum (thick dashed lines) and minimum (thin dashed lines) ozone and PM\(_{10}\) measurements for each year of the observational period. Bottom two plots: the number of exceedances of the corresponding environmental thresholds for ozone and PM\(_{10}\) for each year of the observation period

Looking at Fig. 2 (top two plots) we see that rapid year by year decreases occur on the mean and maximum ozone measurements from 1995 to about the year 2008. That might be a consequence of the several measures implemented by the environmental authorities in order to decrease those levels. Most of those measures were implemented aiming to decrease ozone levels since this pollutant was the one still causing many of the environmental alerts in the city. Nevertheless, some of those measures also contributed to decrease PM\(_{10}\) levels since they impact industries, cars, busses, taxis and trucks emissions. Hence, the decrease was also seen in the mean of the yearly PM\(_{10}\) measurements. However, it seems that some peaks of emission of PM\(_{10}\) still occur since the yearly maximum values do not decrease as much as the mean’s.

The number of yearly exceedances of the respective thresholds varies from 124 to 312 days in the case of PM\(_{10}\) and from 178 to 341 in the case of ozone. The highest number of exceedances occurred in the years 1997 and 1996 in the cases of PM\(_{10}\) and ozone, respectively, and the corresponding smallest values were in the years 2014 and 2012. In total we have 6659 and 5104 ozone and PM\(_{10}\) exceedances of the corresponding thresholds. These values account for approximately 72.93% and 55.9% of the observed days, respectively.

In Fig. 2 we also have the plots (bottom two plots) of the number of exceedances year by year for both pollutants. Looking at these plots we may see that in the year 2000 we also had a high number of ozone exceedances. That, however, is smaller than the number in 1996 if only by one. Looking at the two bottom plots in Fig. 2, we see that the number of ozone as well as PM\(_{10}\) exceedances also decrease rapidly until around 2008 and then vary around a lower value from that year on. That is also a reflex of the decrease in the measurements (maximum and mean) throughout the years shown in Fig. 1 and top two plots in Fig. 2. Hence, we see an improvement in the air quality related to ozone and PM\(_{10}\) in later years of the observational period. However, that does not mean the air quality is ideal since environmental alerts still occur in the city.

4.2 Programming details

In the model described in Sect. 2 we have the following specifications. The whole observational period has 9131 days for the total of \(N = 25\) years of which 6 are leap and 19 are non-leap. The dimension of our vector is \(d = 2\) corresponding to the two pollutants we are interested in studying. Hence, \(\textbf{Z}_{t} = (Z_{1, t}, Z_{2, t})\), \(t = 1, 2, \ldots , T = 366\), where \(Z_{1, t}\) and \(Z_{2, t}\) correspond to the ozone and PM\(_{10}\) measurements at day t, respectively, \(t = 1, 2, \ldots , T\). Hence, we have \(S_{1} = \{ (x_{1}, x_{2}) \in \{0, 1\}^{2} \}\), \(S_{2} = \{ 0, 1, 2, 3 \}\), \(S_{3}^{(K)} = \{ (\tilde{x}_{1}, \tilde{x}_{2}, \ldots , \tilde{x}_{K}) \in \{ 0, 1, 2, 3 \}^{K} \}\) and \(S_{4}^{(K)} = \{ 0, 1, 2, \ldots , 4^{K} - 1 \}\). The thresholds considered are \(L_{1} = 0.095\) ppm and \(L_{2} = 75\) \(\mu g/m^{3}\) in the case of ozone and PM\(_{10}\), respectively. In order to account for the possible dependence of few past days of maximum measurements, we are using \(M = 8\). Therefore, the state space of the order K of the Markov chain \(\textbf{X}^{(K)}\) is \(\mathcal{S} = \{ 0, 1, 2, \ldots , 8 \}\) and we have \(\lambda = 1\) in (5).

The hyperparameters of the Dirichlet prior distributions are assigned in a similar manner as in Morales-Morillón (2018). Therefore, the values of \(\alpha _{\overline{m} \tilde{i}}^{(K)}(t)\), \(\alpha _{\overline{m}}^{(0)}(t)\), and \(\alpha _{\overline{m}}^{(K)}\) are all equal to 1/16, for all \(\overline{m} \in S_{4}^{(K)}\), \(\tilde{i} \in S_{2}\), and time t.

Remarks

1. The choice of the hyperparameter \(\lambda = 1\) was based on previous works (see, for instance, Álvarez et al. 2005 and Rodrigues et al. 2019). As it is possible to see in those works, occurrences and selection of orders from 0 to 17 were allowed depending on the data and threshold used. Therefore, even though the value of \(\lambda \) is one, it allows the occurrences of order larger than one whenever the data ask for it.

2. The hyperparameters of the Dirichlet prior distributions are chosen to reflect the occurrences of the events to which they are associated. For instance, in the case where \(K = 1\), events that have not been observed, but are part of the space of possible events, will have a small probability assigned to it, whereas those that have been observed will have probabilities that absorb the contribution of both observed data and prior distributions.

3. Another option to assign the hyperparameters of the Dirichlet prior distributions is to proceed as in Álvarez et al. (2015), Álvarez and Rodrigues (2008), Rodrigues et al. (2015) where small values of the hyperparameters were assigned to rare events and larger values to those occurring more frequently. That, however, was not followed here because we wanted to see if not so specific prior distributions would produce good results as in Morales-Morillón (2018) where the model was used to study problems in genetics. Additionally, using different hyperparameters in the present case could imply in having to consider a large number of vectors of hyperparameters since we would have to take into account every possible transition at each time t.

4.3 Results

The selection of the order K of the Markov chain is made using the maximum a posteriori as well as the leave-one-out cross validation (see Appendix). Hence, in the case of the maximum a posteriori method we select the value of K that maximise its posterior distribution. The posterior probability \(P(K \, | \, \textbf{Y})\), given by (6), is proportional to \(L(\textbf{Y} \, | \, K) \, \lambda ^{K}/K!\) and may be analysed using the value of \(\log \left[ L(\textbf{Y} \, | \, K \right) \, \lambda ^{K}/K!] = \log [L(\textbf{Y} \, | \, K)] + K \, \log (\lambda ) - \log (K!)\), \(K \in \mathcal{S}\). These values are shown in Table 1 where we also have the values of LOO for each K (first two columns).

It is possible to see by looking at Table 1 that, in the case of MAP, the highest probability is achieved when \(K = 1\) followed by \(K = 2\). Hence, the selected order is \(K = 1\). When we use the LOO method to select the order of the chain, the order with the smallest value is selected. In the present case, that value is given by the case \(K = 0\) closely followed by \(K = 1\). For purpose of comparison, we have also obtained the estimated values of K using MAP and LOO for two additional vectors of hyperparameters of the Dirichlet prior distributions. In one of them all values of \(\alpha \) appearing in those prior distributions are equal to one and in the other they are all equal to 1/2. These values are also given in Table 1. Looking at Table 1 we see that in all cases there is a consistency in the selection of the order using MAP (the selected order is always \(K = 1\)). When we use LOO, the chosen order also is \(K = 1\) for values of \(\alpha = 1/2, 1\) in agreement with the choice made using MAP. Hence, the choice of \(K =1\) for the order of the chain.

Table 1 Values of MAP and LOO for each order in \({\mathcal {S}}\) and values of \(\alpha \)

Once the order is selected, we generate a sample of the corresponding initial and transition probabilities using their associated posterior distributions. We may also obtain the probabilities \(P(W_{t} = \tilde{i})\), \(t = 1, 2, \ldots , T\), \(\tilde{i} \in S_{2}\), applying (7) and using the generated samples and the recursive formula (8) since the chosen order is \(K = 1\).

Remark

Recall that because of the transformation considered we may write \((0, 0) \leftrightarrow \tilde{0} = 0\), \((1, 0) \leftrightarrow \tilde{1}= 1\), \((0, 1) \leftrightarrow \tilde{2} = 2\) and \((1, 1) \leftrightarrow \tilde{3} = 3\).

In Fig. 3 we have the estimated (blue lines) and the observed (dashed black lines) transition probabilities \(P_{ij}^{(1)}(t)\) as well as the 95% credible interval (green and maroon lines). Estimation of the probabilities was made using a sample of size 1000. (The values of these probabilities are given in the Supplementary Material file.)

Fig. 3
figure 3

Estimated (blue lines) and observed (dashed black lines) posterior transition probabilities \(P_{\overline{m} \tilde{i}}^{(1)}(t)\), \(\overline{m} \in S_{2}\), \(\tilde{i} \in S_{2}\), \(t = 1, 2, \ldots , T-1\), as well as the 95% credible interval (green and maroon lines), when we have \(K = 1\)

When we look at Fig. 3 it is possible to see that when the observed transition has value zero, i.e., either the transition was not actually observed or it was observed only a number of times that was negligible when compared to the number of observed years, the estimated posterior distribution gives a non-zero value to those transitions, oscillating between small and very small values. Hence, even though some transition events may not have been observed during the timeframe considered here, their posterior transition probabilities allow the possibility that they occur with small probabilities. That is useful because if a given transition has not happened during the observational period, it does not mean it will not happen in the future. Additionally, we may see that the largest value given to non-observable events is around 0.25. In the cases where we do have observed transitions, the estimated transition probabilities represent well the observed ones. We may also see from Fig. 3 that transitions from state 0 (i.e., no exceedances of either pollutant) to state 0 are higher when we consider the time from around the end of June to the beginning of December. This period corresponds to part of summer/autumn seasons. Summer season has rainy days most of the time, hence, we may have a low probability of both pollutants exceeding their corresponding threshold. Also, we see a high probability of going from a 0 state to a 1 state (corresponding to a (1, 0) state, i.e., an exceedance of ozone only) during the same period when high values for \(P_{0 0}^{(1)}(t)\) occur. Hence, if an exceedance occurs in that period is more likely to be caused by high levels of ozone than PM\(_{10}\)’s. That may be corroborated when we look at the transition probabilities from 0 to 2 (corresponding to the state (0, 1) indicating an exceedance of PM\(_{10}\) only). We may also see that if we are in the state 0, then there also exists a large probability of staying at that state in the beginning of the year and also during the summer/autumn seasons.

Other transitions that call attention because of their relatively higher values are from \(1 \leftrightarrow (1, 0)\) to 1 and from \(3 \leftrightarrow (1, 1)\) to 3. The high values of the former occur in a similar period of those from 0 to 1 only with higher values. The latter occurs practically the whole year. Hence, once we have exceedances of PM\(_{10}\) during spring/summer/autumn, there is a large probability of continuing to have it. Similarly, once both pollutants exceeded their corresponding thresholds, they are bound to continue to exceed them.

Another interest here is in estimating the probability of occurrence of one of the states in \(S_{2}\) throughout the year. The estimated values of \(P(W_{t} = \tilde{j})\), \(\tilde{j} \in S_{2}\), are given in the Supplementary Material files. In Fig. 4 we have the plots of the estimated (blue lines) and empirical/observed (black dashed lines) probabilities \(P(W_{t} = \tilde{i})\), \(\tilde{i} \in S_{2}\), \(t = 1, 2, \ldots , T\), with the 95% credible interval (green and maroon lines) when we have the selected order of the chain \(K =1\).

Fig. 4
figure 4

Estimated (blue lines) posterior and observed (black dashed lines) \(P(W_{t} = \tilde{i})\), \(\tilde{i} \in S_{2}\), \(t = 1, 2, \ldots , T\), with the 95% credible interval (green and maroon lines) when \(K = 1\)

Looking at Fig. 4, it is possible to see that the fit of the estimated and observed probabilities is very good. Additionally, non-zero probabilities with small values were assigned to events that had observed probability zero or approximately zero as in the case of \(P(W_{t} = 2)\) (left bottom plots) for the period ranging from mid April to mid October, and a small number of days in the case of \(P(W_{t} = 0)\) (left top plots). In that way, even if events have not been observed up to now, the corresponding estimated probabilities have positive and small values and they are allowed to happen, even if with a small probability.

Remark

In addition to the results for the selected model (\(K = 1\)), we have decided to include in the appendix an analysis when \(K = 2\) in order to illustrate how to proceed if the selected order of the chain is \(K \ge 2\) (see Section C of the Appendix).

5 Discussion and comments

In the present work a multivariate non-homogeneous Markov chain of order \(K \ge 0\) was used to jointly study occurrences of pollutants’ exceedances. The model was applied, in its two-dimensional version, to ozone and PM\(_{10}\) data from Mexico City. Estimation of the order K and the associated initial and transition probabilities is performed using the Bayesian point of view.

The approach used to study this multi-dimensional problem was to transform the vector recording of whether or not exceedances have occurred into a unidimensional indicator. These will constitute the values of each element of the observed data. On these data the state spaces \(S_{3}^{(K)}\) and \(S_{4}^{(K)}\) associated to the Markov chain \(\textbf{X}^{(K)}\) are defined and, also, the corresponding initial and transition probabilities.

We would like to call attention to the fact that, even though the series of measurements are not stationary, the non-homogeneous Markov chain model is applied to the transformed series whose states depend on whether or not threshold exceedances have occurred. Hence, the non-stationarity is diluted a little. That can be seen in Fig. 5.

Fig. 5
figure 5

States of the non-homogeneous Markov chain occurring during the period ranging from 01 January 1995 to 31 December 2019

Note that some non stationarity is still present. However, as mentioned earlier, even though this may produce higher exceedance probabilities, this increasing effect is only used in order to alert about the possibility that higher levels of pollution may occur.

Estimation of the order of the chain was performed using both the maximum a posteriori (MAP) and leave-one-out cross-validation (LOO) methods. The selected order was \(K = 1\) in the case of MAP and \(K=0\) closely followed by \(K=1\) in the case of LOO. Hence, we take \(K = 1\) as the selected order. The associated initial and transition probabilities were obtained by averaging the generated values obtained from the respective posterior distributions.

In order to see how well MAP and LOO estimate the order of the chain, we have performed a simulation study. Using the estimated initial and transition probabilities associated with the order \(K = 1\), we have generated 100 realisations of the non-homogeneous chain. We have considered the cases where \(N = 25, 50, 75, 100\) and used MAP and LOO to estimate the order of the chain. Results are given in Table 2 in Section B of the Appendix. As we see both methods agree on the selection of the order K giving higher/lower values to \(K = 1\) followed by \(K = 2\). This supports the selected order in the application considered in the present work.

Using the estimated initial and transition probabilities, the values of the probabilities of having exceedances of one, both or none of the pollutants were obtained. That was made taking advantage of the Markov property of the model as well as a recursive formula using the generated initial and transition probabilities. Results show a good fit of the estimated values to the observed. In order to certify that, in Fig. 6 we have the plots of \(|P(W_{t} = \tilde{i}) - \hat{P}(W_{t} = \tilde{i})|\) for \(t = 1, 2, \ldots , T\), \(\tilde{i} \in S_{2}\), where \(P(W_{t} = \tilde{i})\) and \(\hat{P}(W_{t} = \tilde{i})\) indicate the observed (empirical) and the estimated probabilities using the generated values of the initial and transition probabilities obtained from their respective posterior distribution. We present the cases \(K = 0\) (blue lines), \(K = 1\) (black lines) and \(K = 2\) (red lines) since the latter is the order with second largest probability, and the first case is the one showing, in general, the small differences.

Fig. 6
figure 6

Absolute values of the differences between the observed and estimated \(P(W_{t} = \tilde{i})\), \(t = 1, 2, \ldots , T\), \(\tilde{i} \in S_{2}\), in the cases \(K = 0\) (blue lines), \(K = 1\) (black lines) and \(K = 2\) (red lines)

Looking at Fig. 6 we may see that, when compared to the case \(K = 2\), a better fit is achieved when we have \(K = 1\). Even though the values of the differences in the case \(K = 2\) are also small, we may see that when compared to the differences when \(K = 1\), the latter presents a better overall fit. In the case of \(K = 0\), the differences are even smaller than those in the case of \(K = 1\). However, it is worth calling attention to the fact that when \(K = 0\), the estimated posterior probabilities receive more weight from the data which are what rule the empirical (observed) probabilities, hence, the best adjustment. We may also see that in the case \(K = 2\), the differences throughout the year are not large either. Larger misfits occurred for the periods ranging from June to October in the case of \(P(W_{t} = 2)\), and from January to June and November to the end of the year in the case of \(P(W_{t} = 3)\). However, even in these cases the differences are small.

If we take into account the information provided by the SAD, we have that the values, for K equal to zero, one and two are 3.75, 263.58 and 3393.404, respectively. Again, we may see that the case \(K =1\) outperforms the case \(K = 2\). As before, the case \(K = 0\) has the smallest SAD, but as already argued that is due to the form the empirical and estimated values of the corresponding probabilities are obtained.

Going back to Fig. 4, where the plots of the estimated \(P(W_{t} = \tilde{i})\), \(\tilde{i} \in S_{2}\), \(t = 1, 2, \ldots , T\), are presented, we may see that the probability of occurrence of states 0, 1 and 2 are mostly below 0.75 with values below 0.5 in the cases of states 0 and 2. Additionally, in the first half of the year the probabilities of occurrence of state 3 are larger than 0.25 and most of the time are in the interval (0.5; 0.75) which, compared to the probabilities of occurrences of the remaining states, are very high. Hence, the model and the data corroborate the fact that during winter and spring there are larger probabilities of both ozone and PM\(_{10}\) exceedances due to the dry season and clear skies. Additionally, in March/April there are large numbers of forest fires in the south of Mexico, with smoke travelling to Mexico City. Hence, it is likely to have occurrences of PM\(_{10}\) exceedances during that period. We may also see that exceedances of both ozone and PM\(_{10}\), i.e., the occurrence of state 3, are fewer during the rainy season and especially during the period ranging from the beginning of July to the end of October. Very low estimated and practically zero observed probabilities are also noticed when we consider state 2 (meaning an exceedance of PM\(_{10}\) only) during the period ranging from the beginning of April until mid-October. Additionally, in the case of the occurrences of state 1 (exceedances of ozone only) the probabilities are above 0.5 for the period ranging from mid-June to the beginning of September. In contrast, the probabilities of having exceedances of both pollutants (state 3) are low (around 0.25) for a similar period. Additionally, we may see that for the period ranging from 01 January to approximately the beginning of June, while the probabilities of having no exceedances, exceedances of either only ozone or PM\(_{10}\) are below 0.3 for most of the time, in the case of exceedances of both pollutants these probabilities are always above 0.5. Hence, is more likely to have exceedances of both pollutants at the same time than having either only of one of them or none at all.

The results presented here may be used to obtain information about the occurrences of exceedances of none, one or more pollutants in given days of the year. That is made using the estimated \(P(W_{t} = \tilde{i})\), \(\tilde{i} \in S_{2}\), \(t = 1, 2, \ldots , T\). We may do that as follows. If we want to know the probability of having both ozone and PM\(_{10}\) exceedances at a given time t, \(t = 1, 2, \ldots , T\), we just need to look at the values of \(P(W_{t} = 3)\). For instance, in the case of \(t = 90\), which would correspond to the 31st of March in a non-leap year and the 30th of March in a leap year, we have that this probability is approximately 0.777 whereas the probability of having an exceedance of ozone only is \(P(W_{90} = 1) \approx 0.127\), and of PM\(_{10}\) only is \(P(W_{90} = 2) \approx 0.009\) (see tables in the Supplementary Material file). We also have that \(P(W_{90} = 0) \approx 0.086\).

Remark

Note that the estimated initial and transition probabilities reflect averages behaviours. Also, note that even though there might be differences in the values of these probabilities if the day, say 30 March, falls in a weekday or in a weekend, it will fall on a weekend only at intervals of about six years, hence, the expected value of the corresponding probability will appear just 1/6 of the times with little effect on the total average.

Another quantity that may be obtained using the model is, as in Rodrigues et al. (2015), the probability of having the occurrence of exceedances of either none, one or both pollutants in a given time when we have information of a few days earlier. For instance, assume we want to know the probability that at time \(t = 100\) (a day at the beginning of April) we have exceedances of both pollutants, i.e., an occurrence of state 3, and that in the three previous days we have the following sequence of states: 2, 1, 2 (i.e., exceedances of PM\(_{10}\) only, ozone only and, again, PM\(_{10}\) only). Therefore, we want to know the probability of the following sequence of states,

$$\begin{aligned} W_{97} = 2, W_{98} = 1, W_{99} = 2, W_{100} = 3. \end{aligned}$$

Hence, we have,

$$\begin{aligned}{} & {} P(W_{97} = 2, W_{98} = 1, W_{99} = 2, W_{100} = 3)\\{} & {} \quad = P(W_{100} = 3 \, | \, W_{99} = 2) \times P(W_{99} = 2 \, | \, W_{98} = 1) \\{} & {} \qquad \times P(W_{98} = 1 \, | \, W_{97} = 2) \times P(W_{97} = 2) \\{} & {} \qquad \approx 0.917 \times 0.026 \times 0.045 \times 0.048 \approx 5.1499E-05. \end{aligned}$$

(Again, the values of the transition and absolute probabilities are given in the Supplementary Material files.)

If we compare to the other probabilities, i.e., having either state 0, 1, or 2 (no exceedances, ozone exceedances only, PM\(_{10}\) exceedances only) we have

$$\begin{aligned}{} & {} P(W_{97} = 2, W_{98} = 1, W_{99} = 2, W_{100} = i)\\{} & {} \quad \approx \left\{ \begin{array}{lll} 0.027 \times 0.026 \times 0.045 \times 0.048 \approx 1.516E-06, &{} \text{ for } i = 0 \\ 0.03 \times 0.026 \times 0.045 \times 0.048 \approx 1.685E-06, &{} \text{ for } i = 1 \\ 0.025 \times 0.026 \times 0.045 \times 0.048 \approx 1.404E-06, &{} \text{ for } i = 2 \\ \end{array} \right. \end{aligned}$$

Hence, the probability of having either no exceedances or exceedances of only one of the pollutants is of order ten times lower than having exceedances of both pollutants. Recall that March/April is spring and there are many forest fires and also clear skies.

If we want to know the probability of having an ozone exceedance on the day \(t = 250\), then we just need to use the values of \(P(W_{250} = 1)\) and \(P(W_{250} = 3)\) since the event \(\{ \) an ozone exceedance \( \}\) may be written as the union of the events \(\{ \) an exceedance of ozone only \(\}\) and \(\{ \) exceedances of both ozone and PM\(_{10}\) \(\}\), i.e., \(\{ \) an ozone exceedance \( \} = \{ \) occurrence of (1, 0) \(\} \cup \{\) occurrence of \( (1, 1) \}\). Therefore,

$$\begin{aligned} P(\{ \text{ an } \text{ ozone } \text{ exceedance } \text{ at } t = 250 \})= & {} P(W_{250} = 1) + P(W_{250} = 3) \\\approx & {} 0.509 + 0.126 = 0.635. \end{aligned}$$

(Again, the particular values of the absolute probabilities may be found in the Supplementary Material file.)

Using the model described here we may also jointly study more pollutants. For instance, we could also include PM\(_{2.5}\), nitrogen dioxide, carbon monoxide, among others.

As in Álvarez et al. (2005) and Rodrigues et al. (2015) we could also consider for each pollutant several threshold values and with that estimate the probability that their levels belong to a particular concentration interval. For instance, if we take the values to declare the several levels of emergency alerts in Mexico City related to ozone and PM\(_{10}\), i.e., if \(L_{i}^{(O)}\) and \(L_{i}^{(PM)}\) indicate the thresholds for ozone and PM\(_{10}\) in order to declare Phase i, \(i = I, II\), of the emergency alert, then we have \(L_{I}^{(O)} = 0.154\) ppm, \(L_{I}^{(PM)} = 214\) \(\mu g/m^{3}\), \(L_{II}^{(O)} = 0.204\) ppm and \(L_{II}^{(PM)} = 354\) \(\mu g/m^{3}\), which are well above the Mexican environmental thresholds for these pollutants. Hence, we could calculate the probability of having the ozone concentration between 0.095 and 0.154 while having the PM\(_{10}\) concentration in the interval (214, 354), i.e., the probability of having the ozone concentration above the Mexican environmental threshold, but below the ozone threshold used to declare Phase I of the environmental alert with the PM\(_{10}\)’s indicating that a Phase I could be declared. This would provide information on the possibility of having an environmental alert in the city. Based on this information, environmental authorities could take actions in order to avoid the occurrence of this alert.

One fact that is clear is that the estimated probabilities are not smooth functions of time. However, it is also clear that the empirical (observed) probabilities which are being represented are not smooth either. This could change if more years were considered in the analysis. This would increase the chance that under represented events occur. Just for the sake of comparison we have decided to check if with the 100 simulated chains of our simulated study, smother function would be obtained, for instance, for \(P(W_{t} = \tilde{i})\), \(\tilde{i} \in S_{2}\), \(t = 1, 2, \ldots , T\). However, that did not happen. Nevertheless, in spite of having only twenty five years of observations the estimated results represent well the observed. If smoothness of the functions is of interest, another option to produce smoother curves to model the observed probabilities and hence the probabilities \(P(W_{t} = \tilde{i})\), \(\tilde{i} \in S_{2}\), \(t = 1, 2, \ldots , T\) is to use cubic splines.

One may also argue what the gain is in using this type of model when it seems that the empirical results give all the information we need. One positive side is that in the cases where transition events which did not happen during the observational period, but are part of the sample space, would be allowed to be generated in case one wanted to simulate string of events and also if the interest resides in informing the possibility of having such transitions occurring (even though the probabilities might be overestimated).