Dynamic Non-parametric Monitoring of Air-Pollution

Air pollution poses a major problem in modern cities, as it has a significant effect in poor quality of life of the general population. Many recent studies link excess levels of major air pollutants with health-related incidents, in particular respiratory-related diseases. This introduces the need for city pollution on-line monitoring to enable quick identification of deviations from “normal” pollution levels, and providing useful information to public authorities for public protection. This article considers dynamic monitoring of pollution data (output of multivariate processes) using Kalman filters and multivariate statistical process control techniques. A state space model is used to define the in-control process dynamics, involving trend and seasonality. Distribution-free monitoring of the residuals of that model is proposed, based on binomial-type and generalised binomial-type statistics as well as on rank statistics. We discuss the general problem of detecting a change in pollutant levels that affects either the entire city (globally) or specific sub-areas (locally). The proposed methodology is illustrated using data, consisting of ozone, nitrogen oxides and sulfur dioxide collected over the air-quality monitoring network of Athens.


Introduction
In recent years, Statistical Process Control (SPC) has been proposed in environmental related monitoring problems (Pan and Chen 2008;Paroissin et al. 2016). Such problems typically involve processes in the presence of autocorrelation, and relevant SPC techniques used in conjunction with appropriate time-series models (Pan and Chen 2008;Triantafyllopoulos and Bersimis 2016). Over the recent years there has been a significant development of non-parametric monitoring methodology in SPC, as evidenced in Chakraborti et al. (2004), Qiu and Li (2011), Qiu (2018) and in references therein. Distribution-free statistical procedures for process-monitoring are favoured to parametric-based methods, as they relax the distributional assumption of the observed data. However, such non-parametric procedures usually focus on univariate i.i.d. processes. As it is well known in non-industrial processes several variables are often observed and exhibit significant autocorrelation, i.e. the observed process is a multivariate time series. Examples of such process include, but are not limited to, environmental and financial processes.
Air pollution consists of the introduction of chemicals, particulate matter and biological materials into the atmosphere, causing severe damage to the environment (Christodoulakis et al. 2017). The major air pollutants are the sulfur dioxide (SO 2 ), the nitric oxide (NO), also known as nitrogen monoxide, the nitrogen dioxide (NO 2 ), the carbon monoxide (CO) and the ozone (O 3 ). Recent research related to air pollution is focused on public health (Jiang et al. 2016;Raaschou-Nielsen et al. 2016). In the last twenty years, many epidemiological and medical studies, by showing the relation of air pollution to a series of serious and many times emergency health problems, have concluded that public health (especially in urban areas and large cities) is threatened (Mudway and Kelly 2000;Atkinson et al. 2001). As a result a continually monitoring mechanism is needed, to monitor jointly the levels of all major pollutants in a city, taking into account the contribution of climate related variables. This need is stressed by the United States Environmental Protection Agency, which launched in 2009 the Environmental Tracking Network in order to monitor air-quality in many locations in the United States. Furthermore, this mechanism should be able, to account for local (temporal) differences or even spatial variations of the levels of all major pollutants, realising in such a way an evidence-based monitoring system. The value of such a monitoring system lies in its capability of giving early alarms before high levels of pollution are realised and thus having a predictive nature. Multivariate statistical process (MSPC) methods is an ideal vehicle, in order to develop and implement such an automatic monitoring system. Over the last decade MSPC methods have become very popular for monitoring multivariate processes, see e.g. the review of Bersimis et al. (2007).
In this article we develop a statistical framework for automatic monitoring of airpollution levels, which enable real-time detection of aberrant pollution and issuing warnings, in order to permit the Environmental Public Administration to activate safeguarding mechanisms early in time. We propose a dynamic linear model (Prado and West 2010; Petris et al. 2010) that defines the "normal" or "expected" evolution of the process and deviations from this is measured using residuals. An unusual positive residual indicates aberrant pollution levels beyond the expected, taking into account climate changes,as we are interested in pollution which exceeds the expected levels, extreme negative residuals do not count as aberrant. We then propose distribution-free monitoring of such residuals, based on generalised binomial-type statistics as well as on rank statistics. The novelty of the proposed methodology is the real-time (at daily frequency) automatic monitoring of pollution levels and the detection of sudden shifts, which are linked to respiratory conditions and deaths (Rosenlund et al. 2008;O'Neill and Ebi 2009;Jiang et al. 2016). Indeed, there is an established link between such sudden shifts of pollution, which are not captured by the recommended thresholds issued by the environmental agencies and are based on long-term pollution dynamics rather than short-term temporal variation, which is considered in this paper.
The rest of the paper is organised as follows. Section 2 describes the data and provides some related background. Section 3 discusses the proposed methodology, consisting of inference of a multivariate time series model as well as of monitoring procedures applied to the residuals of that model. The framework proposed in this paper is applied in the available real data in Section 4. Finally, in Section 5, some concluding comments are given, while technical arguments are included in the Appendix.

Data Description
Athens is a city of almost 4 million people (census of March 2001) located in an oblong basin of approximately 450 km 2 area. It is surrounded by mountains and the Saronic Gulf. The data consists of 2922 mean measurements recorded in daily frequency, from January 1, 2001 to December 31, 2008 in Athens, using the air-quality monitoring network (AQMN) of the city, consisting of 17 active stations (locations) placed in urban and peri-urban sites around the city. In this study we used 13 out of the 17 active stations of Athens AQMN (4 stations were not active during the study period, depicted in Fig. 1 by the unnumbered squares). Each station consists of a number of sensors, dedicated to the acquisition of the main pollutants: carbon monoxide CO, nitrogen oxides NO and NO 2 , sulfur dioxide SO 2 , and ozone O 3 . Figure 2 plots daily mean measurements of O 3 , NO 2 and NO, for Stations 4 and 10 (this numbering refers to the map of Fig. 1). We observe that O 3 and NO exhibit annual seasonality with slight linear trend, while NO 2 has a more irregular variation, with negligible seasonality and no trend. There does not seem to be different effects of the dynamics of each of the three variables between the two stations. In addition to the above measurements, the stations record daily meteorological information, such as temperature, humidity and wind speed.
The original data are high frequency values, however, we use daily averages calculated over 24 hours in order to limit micro-spatial, micro-temporal sampling uncertainties as well as measurement error. In each day the collected data consist of hourly measurements from midday 12:00 until midday 12:00 hours of the next day; these measurements are averaged to produce a single measurement for each day. Daily averaging reinforces also the general hypothesis of symmetry of the data in hand (which will be assumed later). For the stations used in the analysis, missing data at a given site and time have been estimated by using standard imputation methods, basically obtained as weighted averages of neighbouring available data, so that to keep the annual seasonality in the original data; for a review of imputation methods in air quality data see Junninen et al. (2004).

Monitoring
Let y (i) j,t denote the daily mean measurement of the j th climate variable (j = 1, 2, . . . , p i ) at time t = 1, 2, . . . , N and location i = 1, 2, . . . , L. Then, at a given location i, we work with t ) ) , for t = 1, 2, . . . , N. The r-dimensional vector y t , which contains the original measurements for all locations, forms a multivariate time series, which is autocorrelated and cross-correlated (contains a part of spatial effect for correlations between station variables). The proposed framework combines the use of time series forecasting, multivariate SPC and non-parametric methods for monitoring and early diagnosing beyond the expected air pollution levels. In particular, the following two analysis components are highlighted: • Process modelling. We set-up a time series model which describes the "normal" long term behaviour of the time series y t . This is achieved at Phase I by using historical data as well as by allowing for some small temporal variation during Phase II (implementing a self-adaptive time series model). • Process monitoring. We establish a real-time monitoring procedure on the residual vector e t of the time series model by defining appropriate control procedures for identifying (i) an overall "aberrant" behaviour of the time series y t , (ii) an "aberrant" behaviour of the time series y t in one or more neighbouring locations, and (iii) an "aberrant" behaviour of the time series y t in specific components related to a certain air-quality variable.

Process Modelling
We consider a state space multivariate model for describing the temporal data variation in Phase I. In the sequel we discuss this model specification as well as inference. Let t be a time-varying d × r unobserved state matrix (for some positive integer d and r = L i=1 p i mentioned above), which drives the dynamics of y t in the following state space model The state matrix t is assumed to follow a Markov process, i.e. t = H t−1 + ζ t , where ζ t is an innovation matrix with zero mean matrix and left covariance matrix Z t and right covariance matrix , written as ζ t ∼ (0, Z t , ). The innovation term t is assumed to have zero mean and covariance matrix , written as t ∼ (0, ). It is noted that with vec(·) the column staking operator of a matrix, the covariance matrix of vec(ζ t ) is ⊗ Z t , where ⊗ denotes the Kronecker product of two matrices. Hence the covariance matrix of vec(ζ t ) is proportional to the observation covariance matrix . This is commonly adopted in matrixvariate dynamic models when we wish to estimate ; see Triantafyllopoulos (2008) and Petris et al. (2010). According to Petris et al. (2010) there is no loss of generality specifying such a covariance structure for ζ t , because the covariance matrix Z t will compensate any discrepancies of between ζ t and t . Furthermore, the innovation terms t and ζ t are assumed to be individually and mutually independent as well as independent of the initial state 0 . It is also assumed that the system is initialized at state 0 , with 0 ∼ (m 0 , P 0 , ), for some known matrix mean m 0 and covariance matrix P 0 . In the above model specification, the distributions of t , ζ t and 0 are left unspecified; only means and variances of these quantities are specified. Based on this partial specification of t and ζ t , the likelihood function is not available, however, approximate Bayesian inference is achieved by Bayes linear methods, which approximate the posterior means and variances by minimising the expected posterior risk (Petris et al. 2010). The components of the vector y t are likely to have the same distribution, albeit not a prespecified distribution such as the multivariate Gaussian distribution. Our study benefits by relaxing the distribution assumption and allowing a wider class of distributions, such as approximate Gaussian and Student t distributions; for a related discussion the reader is referred to Triantafyllopoulos and Harrison (2008).
Conditionally on the availability of model components F t , H, and Z t , we can use the celebrated Kalman filter, in order to obtain estimates of the state matrix and for forecasting. For the estimation of the covariance matrix , we adopt the procedure of Triantafyllopoulos (2007), while the rest of the components are specified by the user (see Section 4.1). This procedure enables us to compute the standardized residuals e t , which are used in the monitoring stage of Section 3.2), and exhibit the following properties: (a) they are serially independent, i.e. e i is independent of e j , for i = j ; (b) for a fixed t, e t = (e 1,t , e 2,t , . . . , e r,t ) is a cross-independent random vector, i.e. e j,t is independent of e k,t , for any j = k; (c) ] their distribution is approximately symmetric, since the limiting distribution of the residual vectors is Gaussian.

Process Monitoring
In this section the monitoring problem in Phase II is considered. The monitoring procedures, for the p i air quality measurements as well as for the L distinct data recording stations, which are developed in this section, are based on the residual vectors e t , and on assumptions (a)-(c) of the previous section. The main idea is that if we assume that the sequence of the vectors y t , is fitted well in an appropriate time series model (like the one presented in the previous section), describing the usual in-control state of the process, the residual vectors e t , hold information about the deviation from "normal" pollution level (expected levels or incontrol state of the process) to the actual values of the measurement vectors y t . In that way, large positive values for the components of e t , are implying deviation from the expected, or "normal", while small absolute values are implying a bond to the expected. In the sequel three monitoring procedures are discussed, each of which aiming at the monitoring of the overall area, as well as for identifying a possible cluster of sub-areas or a specific air quality variable that exhibit aberrant behaviour. We propose binomial/generalised binomial type statistics and rank-based statistics in order to monitor the entire area under surveillance, a sub-section of that area (sub-area) of interest, or a particular set of variables of interest.
The binomial-type statistics deployed in each of the above three procedures use the vector s t , with elements s i,t , defined as for t = 1, 2, . . . N and i = 1, 2, . . . , r. Basically, this construction dichotomises the individual residuals e i,t , into two classes: one for positive residuals, which correspond to potentially high levels of pollution, and the other with negative residuals, which correspond to low levels of pollution. In the sequel we propose particular binomial-based tests, exploiting the above dichotomy.

Overall Monitoring of the Area Under Surveillance
For the overall control of the area of interest we may use the statistic T B,1 which represents the number of the components e (i) k,t , k = 1, 2, . . . , p i ; t = 1, 2, . . . , N; i = 1, 2, . . . , L of e t , t = 1, 2, . . . , N that are greater than zero. Specifically, statistic T B,1 is defined as This statistic is based on the idea that if the process is in-control (which means that for all t, the vector y t is fitted well in the appropriate time series model) the components e (i) k,t of e t must be randomly distributed above and below 0 (point of symmetry of the distribution). In contrast, if the vector e t contains an extreme number of either positive or negative values we have evidence of a systematic out-of-control condition in most of the variables of interest and as well as in the whole area under inspection. This statistic under the null hypothesis that the process is in-control follows a binomial distribution with probability of success equal to 0.5 and a number of trials equal to the length of s t (r = L i=1 p i ). Statistic T B,1 is suitable for the overall control, since it is not sensitive to random or occasional outliers, as the summation in T B,1 is responsible of issuing out-of-control signals only in the case that enough components of e t , t = 1, 2, . . . , N being systematically positive. We remark that since the procedure is based on the number of positive components of e t , t = 1, 2, . . . , N it is independent of the ordering of the components e (i) k,t of e t . For implementation purposes and effective application, we propose to use the normal approximation to the binomial distribution (since r is large enough, here 54), and thus we define the statistic T B,1 = (2T B,1 − r)/ √ r, which follows the standard normal distribution N(0, 1). Using this statistic, a classical rule is to issue an alarm when the observed value of T B,1 is larger than a one sided 3 sigma control limit (UCL = 3).
In general, we can implement additional control procedures, in order to improve the ability of T B,1 to be responsive and to adapt quickly to changes, but keeping the procedure as simple as possible. Such a procedure can be implemented by applying multiple limits, for the positive values of T B,1 . We adopt three zones, each of which correspond to different statistical control decisions, i.e. Using these three zones, we can perform control at a weekly basis, or at another chosen period of time. Thus, for each week, we can track air pollution levels.
We propose that an alarm should be signalled, if either of the following rules apply: • Rule 1: a plotted value in interval I 3 (the appearance of an extreme value, with probability 0.001350). • Rule 2: 4 are plotted in interval I 2 in the last week (increasing trend for high deviation observations, with probability 0.000365). This is the probability to observe a particular event of the rule 4/7, e.g. I 2 , I 2 , I 1 , I 1 , I 2 , I 1 , I 2 , or 0, 157305 4 × 0, 841345 3 = 0, 000365; there are 35 such events, with total probability 0.012763227.
Rule 2 can be modified, depending on the sampling frequency, e.g. if data are collected every 6 hours, then Rule 2 could be amended to 3 out of 4 observations of T B,1 . The study of such rule can be carried out in a similar way as that of 4 out of 7 procedure of Rule 2, which is studied in Appendix A, using the Markov chain embedding technique; for a related discussion of Markov chain embedding methodology the reader is referred to Balakrishnan and Koutras (2002). The mean and the standard deviation of the run length (RL) distribution are 147.22 and 143.29, respectively. This is in agreement with Frisen (2008), who states that, considering annual data, the usual ARL of the control procedures is set in the interval 120-240 containing the one third and the half of the year. We note that instead of using a Shewhart-type control chart supplemented with runs rules, a CUSUM or EWMA control chart may be proposed as an alternative. However, the choice of Shewhart type control chart and specifically the choice of a Shewhart control chart based only on the signs of the residuals, while supplemented with runs rules, is somehow straightforward. In particular, three key advantages are pointed out: (a) this procedure is easy to implement, which is necessary since the proposed framework is aimed at monitoring pollution and should be accessible to environmental scientists; (b) it is very robust to outliers (even conservative), which is critical since the nature of the application demands a very low level of false signals (just consider the case of a monitoring system providing a large number of false announcements of a hypothesised problem); and (c) this procedure is easily interpretable by the practitioner (see e.g. Koutras et al. 2007).

(ii) Monitoring Statistics Based on Ranks
For the overall control of the area of interest we may use the statistic T R,1 which represents the sum of the ranks of the positive values e i,t , R + (e it ), of the vector e t . In that way, we define a Wilcoxon-type monitoring statistic which is where R + (·) represents the rank of an element. This statistic is approximated by normal distribution for fairly large values of r (see Gibbons and Chakraborti 2010) using the following standardisation T R,1 = R + (|e it |) − n(n + 1)/4 √ n(n + 1)(2n + 1)/24 , t = 1, 2, . . . ; i = 1, 2, . . . , r Under the hypothesis that the process is under control we may calculate the appropriate 3 sigma control limit (here UCL = 3, for α ≈ 0.00135).

Sub-area Control (i) Binomial and Generalised Binomial Type Monitoring Statistics
As already mentioned, the control of the whole area is not sufficient since many times the problem discussed appears in a small number of sub-areas (in general in a subset of sub-areas). Thus, the control chart proposed above, is inappropriate since it cannot detect clustering of sub-areas having extreme values. For example, this control procedure treats the same the case of a number of random components of e t being positive and the case that any such number of positive components of e t being observed at the same or nearby areas. Thus, we need an appropriate procedure that identifies clusters of extreme cases in the sequence of e (i) k,t , k = 1, 2, . . . , p i ; t = 1, 2, . . . , N; i = 1, 2, . . . , L. For this reason the elements of e t are ordered according to their spatial neighbouring. The ordering of the components of e t , can be roughly succeeded using the matrix of distances of the recording stations and one of the available techniques of multivariate ordering such as the minimal spanning tree. In other words, the components e it are ordered according to their smallest distance (areas which are classified by the minimal spanning tree as "close" place their corresponding residuals one after the other).
In this case, we may use a runs-based statistic, say T B,2 , similar to the one introduced by Antzoulakos et al. (2003), defined as where w p is a constant positive integer depending on p (usually w p is set to be slightly smaller than p) and for the r components of s t and s 0 = s r+1 = 0 by convention. This statistic is the sum of the run-lengths of length w p and more. If this statistic takes large values, this means that there are some sub-areas that experience high levels of pollution to all or to the majority of their variables. On the contrary if T B,2 takes small values then the 1s are randomly assigned to the vector, which means that no specific area has a major problem. The T B,2 sums the lengths of runs each having at least length w p or more w p . We do not count the number of runs of a specific length or the number of runs of length above a threshold. Instead, we sum the lengths of runs of length w p or more which enables the identification of clusters within a single sub-area or between two consecutive sub-areas. This is useful since, due to spatial correlation of neighbouring areas, pollution levels are expected to attain similar exposure. According to Antzoulakos et al. (2003), the performance of this statistic as a randomness test is shown to be significantly more powerful than competitors when the type I error must be kept low, e.g. α = 0.01, which is exactly our case. The control chart based on this statistic has also only an upper control limit since the clustering of areas with extreme values, naturally drives the T B,2 statistic to high values. The control limit for the chart based on the distribution of T B,2 , is calculated using the formula for x ≥ 0, n ≥ w p + 2, which is the result of appropriately adopting Theorem 3.2 provided by Antzoulakos et al. (2003). Table 1 shows upper control limits (UCL) for two values of α and for k = 4, 5, 6, 7. For r < w p + 2 the initial conditions of the same theorem can be explored appropriately. In this case the ARL of this control chart can be easily calculated using the geometric distribution with probability of success α. We note that the control chart based on T B,2 may be used independently of the control chart based on T B,1 . This means that the control chart based on T B,2 is not supplementary to the control chart based on T B,1 , since it can be used without using necessarily the T B,1 control chart. However, it should be noted that T B,1 and T B,2 are not independent.
(ii) Control Statistics Based on Ranks As we already mentioned the control of the whole area is not sufficient since many times the problem appears in only one or two sub areas (in general in a subset of subareas). Using the residual vector e t we may use the ranks of the components of the vector e t and test if one or more areas have extreme values. For example, we may find that in one area the sum of the ranks is too high while in another area the sum of the ranks is too low. Using this idea, we can define a Kruskall-Wallis type control statistic (e.g. Gibbons and Chakraborti 2010) which is given by where R(·) represents the sum of the ranks of the residuals related to the corresponding location. This statistic is approximated by a Chi-Square distribution with L − 1 degrees of freedom for fairly large values of r.

Variable Control (i) Binomial and Generalised Binomial Type Statistics
Reordering the components of e t in such a way that the same air quality variable from all the recording stations are placed together in blocks using the following format for e t = (e 1,t , e 2,t , e 3,t , e 4,t , e 5,t Air-Quality Variable 1 , e 6,t , e 7,t , e 8,t , e 9,t , e 10,t Air-Quality Variable 2 , . . . , e r,t ) , we can immediately define a control procedure which aims in identifying if a specific air quality characteristic is out-of-control in all of the areas. Specifically, we establish a similar procedure to that of sub-area monitoring, based on T B,3 , defined as where w L is a constant (similar to w p described in the previous section) and for the r components of s t and s 0 = s r+1 = 0 by convention. If the number of occurrences is high, then we assume that certain variables have an aberrant behaviour. The control chart based on this statistic has only an upper control limit since the same variable in many areas has extreme values, naturally drives the T B,3 statistic to high values.

(ii) Control Statistics Based on Ranks
Regrouping the components of e t in such a way that the same air quality variable from all the recording stations are placed together and establishing a similar in nature to the rule of Section 3.2.2(ii), we develop immediately a control procedure which aims in identifying if a specific air quality characteristic is out of control.

Joint Effects of the Control Procedure Based on the Binomial-Type Statistics
In this section we study some of the joint effects of the performance of the control procedures described above. We start by looking at the probability that T B,2 (sub-area control) or T B,3 (variable control) give an out of control signal, provided that T B,1 has signalled an out of control point. In Appendix B we show that the joint tail probability of {T B,2 > c 2 } and where a = j − i 1 + j 2 − w p (j 1 + j 2 ) and b = i − w p − i 1 − a, for some constants c 1 and c 2 . Therefore, the conditional probability of {T B,2 > c 2 } given {T B,1 > c 1 } is provided by the definition of the conditional distribution Equation 3 provides the probability that both T B,1 and T B,2 or T B,1 and T B,3 signal an outof-control point, while appropriate use of the conditional distribution can help us to control the overall probability of type I error (i.e. falsely declare the process as out-of-control while the process is actually in-control).
If the process is in-control and T B,1 does not provide a signal it is hoped that the probability of T B,2 or T B,3 not signalling (since the process is assumed to be in control) should be high. For specified values of k and UCL, Table 2 shows the probability the statistic T B,2 not to issue an out-of-control signal, provided that T B,1 failed to issue a signal. We observe that under the null hypothesis (in-control state) T B,2 has small probability to wrongly issue an out of control signal. Table 3 gives the probability that T B,2 or T B,3 signal an out-of-control point, given that T B,1 has signalled an out-of-control point. We remark that as the observed value of T B,1 increases, the probability that T B,2 or T B,3 issue and out-of-control signal is Below is a summary of the proposed process modelling and monitoring procedures, which are put into practice in Section 4.

Phase II (process monitoring)
. Continue to fit model (1) using the optimised parameters from Phase I. Obtain sequentially standardised residuals e t , for each t in Phase II. Chart using binomial-type statistics or rank-based statistics, for the three specific cases: • Overall area control. Use the statistics T B,1 or R t,1 of Section 3.2.1.
• Sub-area control. Use the statistics T B,2 or R t,2 of Section 3.2.2.
• Variable control. Use the statistics T B,3 or R t,3 of Section 3.2.3.

Modelling
In this section we consider the pollution data described in Section 2. In the first stage we use a state space model for modelling the temporal data variation in the time period January 1, 2001 to 4 December 2008 (Phase I); this corresponds to 2895 observations in Phase I. The proposed model is a time-varying regression model which comprises trend and seasonal components in order to take into account temporal variation. Specifically, three covariates (humidity, temperature and wind speed), denoted by x 1,t , x 2,t , x 3,t , respectively, enter in the time series model of y t as time-varying regressor variables, with corresponding regression parameters following a random walk evolution. The state space model adopted for y t is so that y t comprises three dynamic regression components R i,t = i,t x i,t , a trend component T t = 4 [1, 0] , a seasonal component S t = 5 [1, 0, 1, 0, 1, 0, 1, 0, 1, 0] and a random innovation term t , where t = [ 1,t , 2,t , 3,t , 4,t , 5,t ]. The state matrix and the rest of the state space model is as defined in Section 3.1. A weakly informative prior setting is adopted whereby P 0 = 1000I (nearly zero precision P −1 0 ≈ O) and m 0 = 0. The observation covariance matrix is estimated as discussed in Section 3.1, while Z t is specified using a single discount factor δ (Petris et al. 2010).
It is noted that model (4) is implied by adopting the following setting F t = x 1,t , x 2,t , x 3,t , 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0 T , where "block diag" indicates a block diagonal matrix with components I for the identity matrix, J 1 for a Jordan block with a unit eigenvalue, responsible for the trend variation, and J 2 (φ) for the harmonic component, responsible for the seasonal variation, i.e.
The above state space model is known as a reduced form trend-seasonal model (Petris et al. 2010), for the seasonal variation of which, a reduced form state space representation of the full Fourier expansion is used. For this data the cycle is c = 365 (daily data with annual seasonality) with frequency φ = 2π/c; here for computational efficiency, we use only the first 5 harmonics out of a total of (c − 1)/2 = 182, hence the reduced form. Assumptions (a)-(c) of Section 3.1 were validated in Phase I, after the first 100 observations were considered as training data and excluded from the tests. Assumptions (a)-(b) were validated by performing standard white noise tests (Petris et al. 2010). The symmetry of the residuals (assumption (c)) was established by first grouping positive and negative residuals and then performing a Kolmogorov-Smirnov test to ensure the two groups have the same distribution. The p-value of the test was 0.1979 and so assumption (c) was validated. We used a standard Chi-square test applied on the residuals to assess the goodness of fit (Prado and West 2010; Petris et al. 2010). We concluded there was overwhelming evidence in favour of the model fit, with the corresponding p-value being almost equal to 1.

Monitoring
Considering Phase I analysis, in the period 11 April 2001 to 4 December 2008 (the first 100 observations are used as attaining set), we report on Rules 1 and 2, described above; Fig. 3 gives the Phase I control chart for the last 60 days of Phase I (6 October 2008 to 4 December 2008). Considering the entire Phase I period, there are 3 points beyond the 3σ control limit (Rule 1) while 3.9 were expected. Additionally, by applying the second rule (Rule 2), 15 more signals are given by the proposed control chart. In total 18 signals were found in Phase I, while 19 were expected, hence it is consistent with the in-control ARL = 147. One of the signals corresponds to clusters of time where the deviation from the model is consistent. As this analysis reveals, the period in which the deviation from the model is consistently large, corresponds to a period that the measurements are systematically quite large in relation to expected pollution levels, but still not beyond the limits set by environmental agencies. In this period, as the study revealed, an extreme heat wave hit Athens, Greece's capital. at each time t we apply the Kalman filter and the related time series algorithm described above, in order to obtain sequentially residuals, which measure deviations from the normal process. We observe that no value of T B,1 exceeds the 3 control limit and so there is no signal issued under Rule 1. Considering Rule 2 (4 or more values of T B,1 exceeding 1 in last week) we observe that within the first 10 days in Phase II (5-14 December 2008) there are four days in a window of less than or equal to 7 with T B,1 exceeding 1 (observations of 9, 10, 11 and 14 December 2008). Thus, following Rule 2, on 14 December there is a signal issued (indicated by red point in Fig. 5), which highlights a sequence of large pollution levels, in this time interval. This suggests that in this week, large deviations from normal (average) pollution were present. This conclusion is supported by the data itself, since in that period a high pollution peak is present to most of the variables.
For example we observe that since e 1,t = 1.40 ≥ 0, we have s 1,t = 1; likewise as e 2,t = −0.47 < 0, we have s 2,t = 0 and so forth. T B,2 gives a signal if a large number of runs of length at least 4 are observed; in the above s t such consecutive units are indicated. Thus, for this point of time (10 December 2008), there is an out of control signal issued, since T B,2 = 23. The control chart, applies this control procedure for each time t in Phase II and it is shown in Fig. 6. We observe that on 10 December a signal is issued, as having high levels of pollution when sub-area ordering is taken into account. A careful study of the data in this period reveals that stations in the centre of Athens recorded high values across the pollutants.
In order to explore the performance of T B,2 as opposed that of T B,1 , we observe that the points t = 2900 and t = 2921 (corresponding to 9 December and 30 December 2008, respectively) yield the same value of T B,1 , i.e. T B,1 = 34 (T B,1 = 2), while T B,2 give respective values of 18 and 8. Thus, while the number of extreme values is the same for both points (same value of T B,1 , with same number of units in s t vector), the values of T B,2 have a notable difference (more than 100%), which could possibly translate to different control decisions (out-of-control and in-control). This simple example shows that the related correlation of T B,1 and T B,2 (both depend on the number of variables p), is not dominant, with regards to respective control decisions. Furthermore, T B,2 describes a local control procedure (defined by the station proximity), while T B,1 describes a global control procedure.
Here we can see the difference between s t of sub-area control and variable control. The control is similar as in T B,2 , but here we use w L = 5, i.e. 5 or more consecutive units account for a signal at time t. The control chart using procedure T B,3 is depicted in Fig. 7. We observe that dates 10, 14 and 30-31 December give out of control signals. We note that using the T B,3 statistic we may sum runs of excess length that belong to consecutive blocks of variables. The choice of w p and w L in the statistics T B,2 and T B,3 , respectively, is indicative. As suggested in Antzoulakos et al. (2003), it is usual practice to use lower values of these parameters (3 or 4), since we do not expect all variables to be positive in an area with high pollution levels.
Having discussed the binomial-type monitoring statistics it is noted that the application of rank statistics had a similar performance in both phases. However, their performance appear to be slightly deteriorated in the case of Athens considered in this paper. In particular, this is observed when comparing the performance of T B,2 based control chart with the one that is Phase II control chart for variable control based on Kruskall-Wallis test statistic (T R,2 ). This is due to the deterioration of the Kruskal-Wallis type statistic as the number of the different areas increases. The rules based on T B,1 supplemented with runs appeared to be better even from standard CUSUM-type procedure. For this reason we propose that for large number of variables and sub-areas the binomial and the generalised binomial control procedures are preferred than the rules based on ranks.

Discussion
By comparing the three Phase II control charts (overall control, sub-area control and variable control) we see that on the period 10-14 December all charts indicate high levels of pollution, although the overall control occurs a delay by only detecting this abnormal behaviour on the 14 December, while T B,3 chart is faster. The overall control does not signal any other aberrant behaviour (the same is true for T B,2 ), while the variable control issues a further signal on 30 December. It appears that when T B,1 signals an out-of-control point, this is the result either or both of out-of-control signals in T B,2 or/and T B,3 . To expand on this point, T B,1 fails to issue a signal on 10 December, while T B,2 and T B,3 captures this on 10 December. Thus, T B,2 and T B,3 signal locally the out-of-control signal, which then becomes global and it is captured by T B,1 later on 14 December. On 30 December T B,3 issues a clear outof-control signal, while T B,1 shows some tendency towards a signal, but it is not clearly issued. It seems that this implies that a local concentration of high pollution levels tends to increase the global pollution levels. Since the values of T B,2 are far from the limits for 30 December, it appears that the problem is captured well and on-time by T B,3 which means that the problem is associated with clustering of high values to specific pollutants. In conclusion in the period 10-14 December there is a certain deviation from the expected pollution levels. This deviation starts by a local problem (which is clearly captured with an out-of-control signals by T B,2 ) and becomes a global problem as some pollution levels are systematically high (T B,3 gives two out-of-control signals in these 5 consecutive dates). In the last two time points (30 and 31 December) T B,3 gives a clear out-of-control signal. As the analysis revealed, in Phase I we identified periods of time with extreme deviations from expected while in Phase II the proposed monitoring scheme captured extreme pollution deviation, both locally and globally.

Conclusions
This paper develops a unified framework for monitoring the mean effects of several airpollution variables, over a network of stations. This framework combines (a) the use of time series modelling, for the temporal description of mean pollution levels and estimation of cross-correlation of pollution variables, and (b) multivariate control chart methods, for the detection of deviations from the expected pollution levels. We develop a multivariate statistical process control approach, which allows the monitoring of the pollution levels in the overall area and in sub-areas, determined by rules of neighbouring similarity. Air-pollution data from the city of Athens are used to put in practice the proposed monitoring approach.
The application to Athens data reveals that the proposed methodology can provide remarkable evidence about the source of the pollutant temporal variation. In particular, it is able to identify whether temporal exceedance come from a specific station (location) or it is attributed to a specific pollutant. It is able to identify short periods of large variations that are related to excess numbers of hospital admissions even when the formal thresholds (issued by the environmental protection agencies) are not exceeded. This paper makes a contribution on multivariate SPC for non-industrial processes, which is currently in demand (see relevant discussion in the introduction), and is an area which is likely to receive even more attention in the near future.
The number of states in are 7 4 + 1, while the four coordinates may be interpreted as follows: • i 0 records the number of points fall into interval I 2 in a window of length 7 (at most) as the process evolves in time, and • i 1 , i 2 and i 3 specify the positions of the last three points fall into interval I 2 in the window of length 7 (at most) as the process evolves in time.