Modeling air quality level with a flexible categorical autoregression

To study urban air quality, this paper proposes a novel categorical time series model, which is based on a linear combination of bounded Poisson distribution and discrete distribution to describe the dynamic and systemic features of air quality, respectively. Daily air quality level data of three major cities in China, including Beijing, Shanghai and Guangzhou, are analyzed. It is concluded that the air quality in Beijing is the worst among the three cities but is gradually improving, and its dynamics is also the most pronounced. Theoretically, the design of our model increases the flexibility of the probabilistic structure while ensuring a dynamic feedback mechanism without high computational stress. We estimate the parameters through an adaptive Bayesian Markov chain Monte Carlo sampling scheme and show the satisfactory finite sample performance of the model through simulation studies.


Introduction
Air quality has become a common concern, both for healthsensitive individuals and for the academics interested in it. The reason for concern is that air pollution is a major cause of death and disease, further posing a threat to economic development and inclusive prosperity. Exposure to air pollution is the fourth leading fatal health risk worldwide behind metabolic risks, dietary risks, and tobacco smoke, while in low-and middle-income countries, it is the third behind metabolic risks and dietary risks (World Bank and IHME 2016). In 2016, the global health cost of mortality and morbidity caused by exposure to ambient PM2.5 air pollution was $5.7 trillion, equivalent to 4.8% of global gross domestic product. By region, the cost in China and India is equivalent to 7.5-8% of GDP (World Bank 2020). Andrée (2020) showed that PM2.5 was a very important predictor of confirmed COVID-19 cases and associated hospital admissions. Because air pollution leads to the loss of productive labor, it is also an economic burden. In addition, air pollution disproportionately affects the poorest populations, which hinders the achievement of shared and inclusive prosperity.
Over the past two decades, China has been actively working to reduce average urban ambient air pollutant concentrations. However, challenges remain in managing air pollution according to either the World Health Organization guidelines or China's own grade I limit value. Therefore, it is necessary and valuable to conduct research on the air quality of major cities in China. In this paper, we focus on the air quality in recent years in three of China's most developed first-tier cities, including Beijing, Shanghai and Guangzhou. The air quality level is quantized into six categories in China: (1) excellent; (2) good; (3) slightly polluted; (4) moderately polluted; (5) heavily polluted; (6) severely polluted. Daily data on air quality levels of each city over time naturally form a categorical time series. This paper is about the analysis of such categorical time series X 1 ; . . .; X n with the ordered categorical range fm 1 ; . . .; m M g, where m 1 \ Á Á Á \m M . Such data can also be viewed as an ordinal time series, see Weiß (2020) for details, which expresses the dissimilarity of ordinal categories through a distance metric. This paper proposes an observation-driven model to study the data of air quality level, in which the observations are supposed to follow a novel distribution based on a linear combination of a bounded Poisson distribution and a discrete distribution. On the one hand, the dynamic structure relies on the intensity parameter of the bounded Poisson distribution, which is conditional on past information with the form of the autoregression. On the other hand, the discrete distribution characterizes systemic features in the observations that do not vary with time. This design increases the flexibility of the probabilistic structure of the model while ensuring the presence of a dynamic feedback mechanism. It is meaningful to distinguish between systemic and dynamic changes in air quality research. Specifically, in China, suspended dust, coal combustion, industrial dust, vehicle emissions, biomass burning and secondary particulate matter contribute to urban pollution sources (World Bank 2012), where suspended dust, coal combustion, biomass burning and secondary particulate matter are seasonal factors, but there exists a considerable non-seasonal part of them due to almost fixed climate and topography and the high maturity of the industrial structure in each city. Therefore, this part is treated as systemic, while industrial dust, vehicle emissions and the non-systemic part of seasonal factors are considered dynamic. According to data released by the Ministry of Public Security of China, 27.53 million new motor vehicles were registered in the first three quarters of 2021, an increase of 4.363 million units or 18.83% year-onyear, which supports the rationality of our classification in one way.
As for our proposed model, it is suitable for studying air quality data. The first advantage of our proposed model is its simplicity and practicality, mainly thanks to the fact that it does not require the conversion of observations into vector form and has fewer parameters to be estimated. There exist some categorical time series models that treat observations as a ðM À 1Þ-dimensional vector, and the conditional distribution given its past is multivariate naturally. Many other categorical time series models such as Markov chain models, generalized choice models, and spectral envelope models have also been studied in the literature. For earlier works, see, for example, Stoffer et al. (1993), Fokianos and Kedem (2003) and the references therein. For more recent ones, Kauppi and Saikkonen (2008), Moysiadis and Fokianos (2014), Fokianos and Moysiadis (2017) and Fokianos and Truquet (2019) conducted relevant studies. The theory of modeling categorical time series in vector form is gradually being refined, but there usually exist a large number of parameters to be estimated resulting in high computational costs. Therefore, the application of this type of model is difficult to implement. In contrast, our proposed model guarantees a time-varying feedback mechanism without high computational cost.
The second contribution of the proposed model is that it breaks, to some degree, the restriction of pre-defined distribution for models of bounded count time series or categorical time series. A large number of studies have been devoted to modeling bounded count time series with binomial distributions. For example, Weiß (2009), Cui and Lund (2010), Weiß and Pollett (2012) and Weiß and Kim (2013) modeled count data time series with a finite range based on the binomial thinning operator introduced by Steutel and van Harn (1979). And the binomial AR(1) model defined in Al-Osh and Alzaid (1991) based on the hypergeometric thinning operator is also available. Moreover, the integer-valued GARCH models with binomial marginals is an implementable approach, see Weiß and Pollett (2014) and Chen et al. (2020). Liu et al. (2022) proposed the zero-one-inflated bounded Poisson autoregressive model focusing on the normalcy-dominant phenomenon in the data of air quality levels, and achieved the ranking of air quality of major cities in China. However, the true probability structure of real data is complex and hard to determine, so the reliance on distribution type in the modeling process makes the risk of model mis-specification unavoidable, which in turn may result in the invalid estimation. In this paper, the introduction of the discrete distribution, which describes systemic features in the observations that do not vary with time, and weakens the restrictions on distribution to a certain extent. Compared with a fixed binomial marginal distribution or a bounded Poisson distribution, the linear combination of a bounded Poisson distribution and a discrete distribution makes the probability structure more adaptive to the data in practice.
For the estimation and inference, we develop Bayesian inference procedures via Markov chain Monte Carlo (MCMC) methods for the proposed model. Related Bayesian works can be found in Chen et al. (2016), Xu et al. (2020 and Gorgi (2020). The daily air quality level data for three major cities in China, including Beijing, Shanghai and Guangzhou, are analyzed. For this data set, we have two main concerns. The first is the overall difference in air quality in the three cities from 2016 to 2020. The second is the year-to-year change in air quality for each city. Accordingly, we draw conclusions separately.
The organization of this paper is as follows. Section 2 introduces the so called Novel Category model. Section 3 investigates Bayesian inference procedures. Simulations are provided in Sect. 4. Applications to the air quality level data are provided in Sect. 5.

Categorical time series models combining dynamic and systemic information
We start by considering a novel distribution with the following probability mass function, named the Novel Category (NC) distribution: where d is the proportion parameter satisfying d 2 ½0; 1Þ, . . .; Kg, a i Ã is set to be zero to ensure the identifiability of the model, i.e., there are at most K entries of a ¼ ða 0 ; . . .; a K Þ are non-zero. The above distribution is denoted as NCðk; a; d; KÞ. It can be observed that the NC distribution is a linear combination of the bounded Poisson distribution BPðk; KÞ and the discrete distribution satisfying The NC distribution is suitable to fit categorical data. One reliable reason is its finite states f0; 1; . . .; Kg, and the other is that the existence of parameter d and a i s breaks restrictions of the existing probability structure, and no longer limited to Poisson form or even any other fixed form. The flexible probability structure improves the credibility of fitting sundry categorical data in real life. Meanwhile, the existence of k facilitates the introduction of dynamic information, which will be explained in detail later.
Consider a categorical time series fY t g n t¼1 that is conditionally NC distributed with time-varying k t as follows: where F tÀ1 is the r-field generated by fY tÀ1 ; Y tÀ2 ; Á Á Ág. By the above specification, the conditional mean of Y t is: It can be clearly observed that E½Y t jF tÀ1 is composed of two parts, including the time-varying term and the constant P K i¼0 ia i , which represent dynamic and systemic information, respectively. Further, we assume autoregressive structure for fk t g n t¼1 : where x [ 0; w ! 0 and / ! 0. For ease of discussion, only the first-order autoregressive structure for fk t g is investigated. However, the generalization to higher-order autoregression is possible using similar stylized arguments. Note that the structure of (2.3) has been used by Chen et al. (2018) to examine the causal relationship between ambient fine particles and human influenza in Taiwan. Equations (2.2) and (2.3) imply that the current state Y t in (2.1) is comprehensively determined by two parts, including the time-varying part affected by past observations and the inherent part that does not change with time. The closer the proportion parameter d is to zero, the more stable the probability structure, and the lighter the proportion of the part that changes over time. Conversely, The d reflects the flexibility of the probability structure over time, with closer to 1 indicating higher flexibility. Next, we give an explicit definition.
Definition 2.1 A categorical time series fY t g n t¼1 is said to follow the flexible categorical autoregressive (FCAR) model, if fY t g n t¼1 satisfies (2.1) and (2.3). The FCAR model introduces an autoregressive feedback mechanism in the linear combination of bounded Poisson and discrete distributions, which lays the foundation for realizing the analysis of the dynamics and systemic features of air quality. The subsequent section is concerned with the estimation of parameters in the FCAR model.

Bayesian inference
Before proceeding formally with parameter estimation, it is necessary to specify the dimensionality of the parameters to be estimated. Because of the restriction P K i¼0 a i ¼ 1 and the identifiability condition a i Ã 0 for some i Ã 2 f0; . . .; Kg, only K À 1 parameters of a need to be estimated. For For simplicity of exposition, we rearrange h into three parts where h m is the vector of all unknown parameters except h m , and Pðh m Þ is the prior density.
The choices of priors are not unique, but usually noninformative ones are appropriate, see Chen et al. (2016) and Xu et al. (2020). Specially, for m ¼ 1; 2 and 3, we use indicator functions I fh m 2X m g as uniform priors for h m , where

Simulations
To examine the effectiveness of the proposed MCMC methods, we investigate the finite sample performance by Monte Carlo simulations in this section. The following three data generating processes (DGPs) of various sample sizes (T ¼ 300; 500; 1000) are considered: • DGP 1: Y t follows the FCAR model with K ¼ 5, a 1 ¼ 0 and ðx; w; /; a 0 ; a 2 ; a 3 ; a 4 ; dÞ ¼ ð0:3; 0:35; 0:2; 0:1; 0:55; 0:15; 0:1; 0:6Þ; • DGP 1: Y t follows the FCAR model with K ¼ 5, a 2 ¼ 0 and ðx; w; /; a 0 ; a 1 ; a 3 ; a 4 ; dÞ ¼ ð0:3; 0:35; 0:2; 0:1; 0:55; 0:15; 0:1; 0:6Þ; • DGP 3: Y t follows the FCAR model with K ¼ 5, a 1 ¼ 0 and ðx; w; /; a 0 ; a 2 ; a 3 ; a 4 ; dÞ ¼ ð0:3; 0:35; 0:2; 0:1; 0:55; 0:15; 0:1; 0:7Þ: We simulate 500 replications from each of the three DGPs. The sample of iterations in the random-walk Metropolis-Hastings is selected as M ¼ 10;000 and the total sample of iterations is N ¼ 30;000. Only N À M iterations of the independent-kernel Metropolis-Hastings in every sample period is used for inference. The simulation results of DGPs 1, 2 and 3 are reported in Tables 1, 2 and 3, respectively. The true value, the average posterior mean, median, standard deviation (Std.), the posterior 2.5 and 97.5 percentiles are reported in each column from left to right in tables, and the last two items constitute a 95% credible interval (CI). For all three DGPs and three sample sizes, the biases of the posterior means and the corresponding true values are reasonably small, as are the biases of the posterior median and the corresponding true values. This implies that both the posterior mean and posterior median estimators are applicable in the FCAR models. The standard deviations of the intercept parameter x and d are acceptably larger than that of other parameters. And as expected, all standard deviations decrease as the sample size increases. Moreover, all true values are covered by the 95% CI, and both the posterior 2.5 and 97.5 percentiles are closer to the true values with increase of the sample size. All the above results indicate that the Bayesian method is effectively applicable to the estimation of unknown parameters in the FCAR model.

Empirical analysis
In this section, we study the daily air quality level data for three major cities in China, including Beijing, Shanghai and Guangzhou. The air quality level is quantized by Chinese government into six categories: '0' stands for 'excellent'; '1' stands for 'good'; '2' stands for 'slightly polluted'; '3' stands for 'moderately polluted'; '4' stands for 'heavily polluted' and '5' stands for 'severely polluted'. Naturally, the air quality level of each city forms a categorical time series.
Each sample set we considered covers 5 years of data from January 1, 2016 to December 31, 2020, with a total of 1827 observations. We have two concerns about this data set. The first is the overall difference in the air quality of three cities from 2016 to 2020. The second is how the air quality of each city changes year by year. The distinction between systemic and dynamic features is focused on during the analysis. Systemic features we considered is the systemic part of seasonal factors including suspended dust, coal combustion, biomass burning and secondary particulate matter, while industrial dust, vehicle emissions and the non-systemic part of seasonal factors are considered to be dynamic.
In advance, we report the plots of the ordinal Cohen's jðhÞ of three cities in Fig. 1, which is a measure of serial dependence in categorical time series defined in Weiß (2020). The slow decay with increasing time lag h implies that the data are consistent with the autoregressive structure (2.3).
It must be emphasized that when facing with real data, we cannot determine in advance which i Ã satisfying a i Ã ¼ 0, otherwise the rationality of the model will be weakened. Therefore, in the empirical analysis, for each group of data, we will set a i Ã ¼ 0 in turn for i Ã ¼ 0; . . .; 5 and then generate 6 candidate models, from which the one with the largest likelihood function will be selected as the final model suitable for that group of data.
To study the first concern, we use the FCAR model to fit 1827 observations of each city, respectively, and summarize results in Table 4, including the posterior mean, standard deviation, the posterior 2.5 and 97.5 percentiles. Based on these results, we elaborated the following three conclusions: (1) The order of size of the proportion parameter d in the three models is that Beijing (0.7182) [ Guangzhou (0.5468) [ Shanghai (0.5088). This implies that the    probability structure of Beijing's daily air quality level has the most changes over time and is relatively unstable. Among the three cities, the probability structure of Shanghai is relatively stable. Beijing's air quality is most heavily influenced by these dynamics, including industrial dust, vehicle emissions and the non-systemic part of seasonal factors.
(2) From the comparison of the values of the three sets of parameters a i s (i ¼ 1; 2; 3; 4; 5), it can be seen that the overall air quality of Shanghai is better than that of Beijing, but second to that of Guangzhou. The reasonable principle is the more probability is concentrated on a i with smaller i, the better air quality in the city. It is well documented that soil dust and road dust contribute the most to PM10 in the urban atmosphere. In northern cities, the contribution of soil dust and road dust to PM10 concentration is higher than that in southern cities.    (2022) and a special FCAR model with a i ¼ 0 for 8 i named the bounded Poisson autoregressive (BPAR) model. In Table 5, it can be seen that FCAR model performs best in terms of Akaike information criterion (AIC) and Bayesian information criterion (BIC), which implies that the inclusion of systematic factors makes sense.
To check the adequacy of the specified model, we calculate estimated standardized Pearson residuals e t ¼ Y t ÀE½Y t jF tÀ1 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi Var½Y t jF tÀ1 p and report the ACF plots of the residuals in Fig. 2. Moreover, the Ljung-Box tests are also applied to check whether or not the residuals appear to be white noise, and the corresponding p-values are shown in Fig. 2. The results in Fig. 2 demonstrate that the fitted FCAR models are adequate. Next, we fit the data of each city for each whole year (2016-2020) by the FCAR model to characterize the annual changes in air quality in the last 5 years. To elaborate more clearly, we show the results in Figs. 3, 4, 5 and 6, and obtain the following conclusions: (1) Figure 3 shows ds of the three cities in the past 5 years. It can be seen that the ds in Beijing show a decreasing trend, which implies that the influence of systemic factors on air quality in Beijing is deepening. The ds in Guangzhou increases after 2017, implying that the air quality in Guangzhou is becoming more vulnerable to dynamic factors. The change of ds in Shanghai is relatively mild.
(2) It is obvious that the value of a 3 þ a 4 þ a 5 is decreasing year by year in Fig. 4, which shows that the air quality of Beijing is showing signs of improvement. Based on a 3 þ a 4 þ a 5 in Figs. 5 and 6, we can find that the air quality in Shanghai and Guangzhou has always been better than that in Beijing. From 2016 to 2019, a 0 þ a 1 of Shanghai is rising, while a 2 is declining. This indicates further optimization of Shanghai's air quality, but with a slight rebound in 2020. (3) Table 6 reports the posterior means and 95% CIs of x;/ andŵ, respectively. The year-on-year changes ofx once again imply that although the air quality of Beijing is the worst among the three cities, its air quality is gradually improving. The slight changes of x;/ andŵ over the 5 years indicate that the internal structure of the dynamic factors affecting air quality in Guangzhou is stable. This reflects the efficiency of Guangzhou's pollution control policy, which is well adaptive to changes in dynamic factors such as vehicle emissions and industrial upgrading.