Outlier detection is based upon checking whether an observed concentration value falls within a given confidence interval, set by
$$ \mu \pm z\times \sigma $$
(1)
where μ is the mean NO2 concentration level in μg m−3, σ is the standard deviation, and z is an indicator of the size of the confidence interval. We consider Eq. (1) for grouped NO2 concentration observations within temporal, spatial, and spatio-temporal neighborhoods. Assuming independence and normality, then the value of z is set at 1.96 for a 95% confidence level (Kracht et al. 2014) or at 2.97 for a 99.7% confidence interval, depending on the required strictness of the outlier detection. We used z = 2.97, which in related studies has been rounded to z = 3 (Martínez Torres et al. 2011; Shamsipour et al. 2014).
NO2 concentrations in an urban setting, however, highly depend on the proximity of busy roads, and therefore, too much noise in concentrations is found within the neighborhood to detect values that are abnormally high given their location. Similarly, temporal neighborhoods have a highly temporally dependent variation in air pollutant concentrations over the day.
We propose to overcome this by classifying the locations and time periods into 16 spatio-temporal categories distinguished by different levels of air pollution. To do so, we divided the measurement locations into two categories: urban traffic and urban background locations. These take into account the positions of the airboxes near specific land use types, the presence of traffic, and distance from the center. We take four intervals: traffic hours (6:01–9:00 and 16:01–20:00 UTC time), off-peak hours (9:01–16:00 and 20:01–22:00 UTC time), transition periods (22:01–1:00 and 5:01–6:00 UTC time), and night hours (1:01–5:00 UTC time).
Days of the week were divided into two classes: weekdays (Monday to Friday) and weekend days (Saturday and Sunday). This all resulted into 16 classes: eight temporal classes and two spatial classes. For each spatio-temporal class K, the three steps described below are taken to detect outliers.
-
1.
We transformed the NO2 concentrations using the square root transformation to obtain approximately normally distributed values (Fig. 2), i.e., to justify the use of Eq. (1).
Before transforming the NO2 concentration values, in line with Kracht et al. (2013), we added a value of (1 − minimum value of all observations) to all observations to prevent values < 1 μg m−3 from increasing during square root transformation while values > 1 μg m−3 decrease:
$$ {x}_c=\sqrt{NO{2}_c+\left(1-\min \left( NO{2}_c\right)\right)} $$
(2)
where NO2
c
is an observation and x
c
is the transformed observation in spatio-temporal class K, where \( K={\bigcup}_{c\in C}\left({x}_c\right) \) and c is an observation index in C = {1…N
C
} for N
C
total number of observations in class K. Note that x
c
has coordinates in space and time.
-
2.
As a result of the transformation in Eq. (2), the distribution of NO2 concentrations is truncated at the left at 1 μg m−3. The resulting distribution thus showed a truncated normal distribution (Fig. 3).
For each square-root-transformed NO2 observation xc, i, we temporarily excluded the ith observation from the NO2 concentration dataset in order to avoid impact of the observation, a potential outlier, on the standard deviation and mean. We then obtained the mean and standard deviation of the remainder of the dataset as
$$ {m}_K^{-i}=\frac{\sum_c\left({x}_c\right)-{x}_{c,i}}{\left({N}_C-1\right)} $$
(3)
$$ {s}_K^{-i}=\sqrt{\frac{\sum_c{\left({x}_c-{m}_K^{-i}\right)}^2-{\left({x}_{c,i}-{m}_K^{-i}\right)}^2}{\left({N}_C-2\right)}} $$
(4)
where summation extends over all hourly NO2 observations x
c
in one spatio-temporal class K and \( {m}_K^{-i} \) and \( {s}_K^{-i} \) are the mean and the standard deviation of all hourly NO2 observations excluding the ith observation xc, i, respectively. Note that c, i ∈ C and N
C
is the total number of observations in class K.
Equations (3) and (4) provided both the mean and the standard deviation of the truncated normal distribution of NO2 concentrations, referred to as \( {m}_K^{-i} \) and \( {s}_K^{-i} \). Equation (1) requires a normal distribution, and therefore, we are more interested in the mean and standard deviation of the underlying normal distribution, referred to \( {n}_K^{-i} \) and \( {t}_K^{-i} \), respectively, rather than the mean and standard deviation of the truncated normal distribution. We use a maximum likelihood estimator to obtain estimated values \( {n}_K^{-i} \) and \( {t}_K^{-i} \). The log likelihood function is given as
$$ {\sum}_c\ln \left(f\left({x}_c|\theta \right)\right) $$
(5)
where f(x
c
| θ) is the probability density function of the truncated normal distribution of NO2 concentrations, returning the probability of observing x
c
given a set of parameters \( \theta =\left({m}_K^{-i},{s}_K^{-i},a,b\right) \), for a ≤ x ≤ b. In our case of left truncation, we have a = 1 and b = ∞. Then, the probability density function is given as
$$ f\left({x}_c|\theta \right)=\frac{\phi \left(\frac{x_c-{n}_K^{-i}}{t_K^{-i}}\right)}{t_K^{-i}\left(1-\Phi \left(\frac{a-{n}_K^{-i}}{t_K^{-i}}\right)\right)} $$
(6)
Imputing Eq. (6) into the log likelihood function and taking\( {\theta}_1=\left({n}_K^{-i},{t}_K^{-i}\right) \) gives
$$ L\left({\theta}_1\right)={\sum}_c\left(\ln \left(\phi \left(\frac{x_c-{n}_K^{-i}}{t_K^{-i}}\right)\right)-\ln \left({t}_K^{-i}\left(1-\Phi \left(\frac{a-{n}_K^{-i}}{t_K^{-i}}\right)\right)\right)\right) $$
(7)
where ϕ(∙) is the probability density function of the normal distribution and Φ(∙) is the corresponding cumulative distribution function. Optimization of the log likelihood function Eq. (7) using Nelder and Mead (1965) gives maximum likelihood values for \( {n}_K^{-i} \) and \( {t}_K^{-i} \). We used the parameters \( {m}_K^{-i} \) and \( {s}_K^{-i} \) as starting values.
For each observation xc, i removed from the dataset, \( {n}_K^{-i} \) and \( {t}_K^{-i} \) are computed on the remainder of the spatio-temporal class dataset as described above.
-
3.
Next, Eq. (1) is adapted to find the lower and upper thresholds of values considered outliers:
$$ {n}_K^{-i}\pm z\times {t}_K^{-i} $$
(8)
which is computed for each individual observation. If the ith observation xc, i falls outside this interval, it is considered to be an outlier. The observations of spatio-temporal class K are backtransformed after the outlier detection:
$$ NO{2}_c={\left({x}_c\right)}^2-\left(1-\min \left({x}_c\right)\right) $$
(9)
returning the NO2 concentrations in μg m−3. Depending upon the purpose of the outlier detection, the outlying observations can then be removed or further investigated.
We further computed the thresholds for the entire dataset, without removal of observation xc, i in Eqs. (3) and (4). The mean and standard deviation of the underlying normal distribution are then expressed by n
K
and t
K
, respectively, which results in the following thresholds:
$$ {n}_K\pm z\times {t}_K $$
(10)
which are also back-transformed using Eq. (9). These thresholds are not used for actual outlier detection, but as an approximation of the thresholds for each spatio-temporal class. This allowed us to compare the thresholds of the 16 spatio-temporal classes. Given the large number of observations in each class, the thresholds are not highly affected by removing one of the observations.
For comparison with conventional monitors, the same analysis was repeated with data from the two NO2 monitors in Eindhoven which are part of the national air quality monitoring network. Both conventional monitors are located in an urban traffic location and therefore considered as the same spatial class. We used the temporal classification similar to the one used in the analysis of the airbox data.