Introduction

The French law on air quality and rational energy using, dated from December 30th, 1996, specifies that State has to assure, with the supports of local authorities and companies, air quality monitoring. In this way, France gives to AASQA (French Approved Association of Air Quality Monitoring), a survey and information mission about atmospheric pollution. These are the regional air quality observatories. There is an observatory for each region. Over Provence-Alpes-Côte d’Azur Region (PACA region), air quality monitoring network is managed by AtmoSud. It informs the public of any increase in pollution levels and provide data to activate prefectural information-recommendations and alert procedures for the population. To do this, AtmoSud develops modeling tools adapted to these missions.

AZUR is a modelling platform that creates daily high-resolution concentration maps for pollutants such as PM10, PM2.5 and NO2 over millions of grid cells in a reduced calculation CPU time compared with a deterministic model. It produces maps up to day + 2 at 25 m resolution, integrating measurements and forecasts. AZUR has been operating for 3 years at AtmoSud for daily forecasts and has been adapted for hourly forecast (available at www.atmosud.org). The AZUR platform is used for the monitoring missions entrusted to AtmoSud, as well as for studies requiring high-resolution data (Allouche et al. 2022).

Usually, to produce daily air quality maps, chemistry-transport models such as Chimere (Menut et al. 2021) are applied at regional and continental scales. These deterministic models simulate the temporal evolution of 3D concentration fields of gaseous or particulate species. They take into account the complete atmospheric cycle of each species, as well as the chemical transformation of pollutants along their transport path. These models are fed, among other things, by a cadaster of ground-level emissions. It is often difficult to use them with resolutions less than one kilometer. At finer scales, dispersion models such as ADMS (Carruthers et al. 1997, Seaton et al. 2022, Tognet 2015, Tognet 2016) or SIRANE (Soulhac et al. 2011) are used to produce daily maps. For large areas, they require significant computing resources and regular updates of emissions inventories. The AZUR platform uses a statistical approach that is less costly in terms of IT resources. The constraint on input data is lower, since the main data required is an annual map of concentrations of the pollutant studied, updated each year.

To operate, AZUR platform needs two types of data: the first is spatial information provided by maps of annual concentrations from ADMS-Urban model (Carruthers et al. 1997, Seaton et al. 2022, Tognet 2015, Tognet 2016) mixed with a geostatistical method (Malherbe and Cárdenas 2005, Lichternstern 2013). ADMS-Urban's ability to reproduce strong NO2 concentration gradients in the vicinity of major roads at resolutions of 25 m is a major advantage. The ADMS-Urban model is used in preference to a more complex CFD model or those using a Street-in-Grid approach, due to the computational time and machine resources available relative to domain size and grid resolution (Kadaverugu et al. 2019; Silveira et al. 2019; Lugon et al. 2020). The second is the temporal variation provided by punctual measurements or forecast simulated by the Eurlerian Chemistry Transport model CHIMERE (Menut et al. 2021).

This work concerns the spatial part of the modeling platform and focuses on the pollutant NO2 over the whole PACA region. In the first part of the paper, we study the relationships between daily and annual values. Based on these results, we propose a spatial statistical model able of estimating daily concentrations from annual values. In a second section, we demonstrate that the AZUR model is an exact interpolator. In the third part, we compare its estimation quality with a standard kriging method.

Annual maps

Every year, AtmoSud produces annual high-resolution maps, providing an overview of air pollution concentrations across the region at a final resolution of 25 m. They concern the regulatory pollutants nitrogen dioxide (NO2) and fine particles (PM10 and PM2.5). These maps are used to feed the AZUR day and AZUR hour air quality forecasting platforms.

Annual mapping is carried out using the ADMS Urban dispersion model developed by CERC [Cambridge Environmental Research Consultant] (Carruthers et al. 1997, Seaton et al. 2022, Tognet 2015, Tognet 2016). It reproduces the dispersion of pollutants emitted into the atmosphere by different types of sources (industrial, road, residential, etc.) as a function of meteorological conditions. Its Gaussian formulation is suited for studies conducted at fine spatial resolutions, allowing considerable freedom in the positioning of calculation points. It is then possible to distribute these points at greater or lesser distances from the emission sources, to reproduce as faithfully as possible, the variations in concentration in the areas of interest.

The raw ADMS output is then corrected by data assimilation. In order to correct annual spatial variations, we use a large number of temporary campaigns using passive tubes in addition to the 22 fixed stations in the network. The first step is to use a linear regression method, which eliminates the overall bias. A second step consists of kriging with external drift which corrects the mappings locally (Beauchamp et al. 2017 and Gressent et al. 2020).

Air quality data

The air quality data produced by AtmoSud are free of charge and available online via our APIs. The NO2 measurement sites are spread throughout the PACA region of France (Fig. 1). They are unevenly distributed over the territory, with the south-west and the coast providing the most data.

Fig. 1
figure 1

Study area and location of traffic stations (red) and background stations (blue) measuring NO2 in the PACA region

Operated by AtmoSud, these stations have several devices measuring at least NO2 and PM10. Depending on their location, they characterise different influences (urban, traffic…). The measurements are transmitted in real-time every 15 min by the on-line measuring devices (NO2: chemiluminescence analyser, O3: photometric analyser and PM10: FIDAS and CPC).

In this study, we worked on all 22 nitrogen dioxide measurement stations in the PACA region. There are 6 stations under the influence of traffic and 16 rural, urban, and suburban background stations. A nitrogen dioxide measuring station is denoted \({s}_{i}\). Its annual average is \(y\left({s}_{i}\right)\). The daily value considered is the daily hourly maximum. It is seen as the \(p\) th percentile within the annual distribution of daily values and will be noted \({q}_{p}({s}_{i})\). Its rank, \(p\), is the proportion of daily values below the \(p\) th percentile. Throughout this article, "annual value" refers to the average annual concentration and "daily value" to the daily hourly maximum concentration.

The study area is based on a regular grid. For a grid point \({s}_{0}\), its annual value is noted \(y({s}_{0})\) and its daily value \({q}_{p}({s}_{0})\) also seen as the \(p\) th percentile in its annual distribution at grid point \({s}_{0}\).

Relationship between annual average and hourly maximum

In this section, we carry out an analysis of the concentration pairs observed at the measuring stations, for which we calculate the ratio of their daily and annual values respectively. On the basis of these results, we show that this relationship depends on the range of daily values considered. This range of values is represented by the rank of the daily measurements when they are considered as percentiles of their annual distribution. In this article, we work with deciles instead of percentiles, as this is sufficient to describe the method and fit the model. But in practice, any percentile can be considered.

Consider an annual history of daily values of nitrogen dioxide concentrations from a measuring station. For all the pairs of stations\({s}_{i}\).and \({s}_{{i}{\prime}}\) and for a fixed rank p (from 0 to 100 by 10), we calculate the ratio of their daily deciles \({q}_{p}\):

$$\frac{{q}_{p}({s}_{{i}^{\mathrm{^{\prime}}}})}{{q}_{p}({s}_{i})}$$

As well as the ratio of their annual average \(y\):

$$\frac{y({s}_{{i}^{\mathrm{^{\prime}}}})}{y({s}_{i})}$$

The Fig. 2 shows that for the daily deciles \({q}_{80}\) their ratios are lower than the ratios of the means, in fact the spline adjustment (grey curve) is below the bisector represented in dotted line. By calculating these ratios for several deciles, we can observe how these relationships vary (Fig. 3).

Fig. 2
figure 2

Relationship between the ratios of the annual means and the ratios of daily deciles of rank 80 over the years 2016–2017 (grey curve is a spline function fit)

Fig. 3
figure 3

Relationship between the ratios of annual averages and daily deciles for 6 deciles represented by their spline fit (over the years 2016–2017)

The representation of these relationships for different daily deciles (Fig. 3) shows that they evolve as follows:

  • For the deciles of low rank (q {10, 10, 30}) the daily ratios are higher than the annual ratios.

  • For the deciles of higher rank (q {50, 70, 100}) the daily ratios are lower than the annual ratios.

Consider a pair of stations with an annual ratio of 4 (Fig. 3). On days when NO2 levels in the air are high, the daily ratio is equal to 2.5 (q 100 curve). On days when NO2 levels are low, the daily ratio is equal to 6.5 (q 10 curve). It can be seen that as NO2 levels increase, the daily ratios decrease. For two measuring stations, the ratio of their daily deciles, therefore, depends on the ratio of their annual averages as well as the rank p. The following relationship can be written:

$$\frac{{q}_{p}({s}_{{i}^{\mathrm{^{\prime}}}})}{{q}_{p}({s}_{i})}=f\left(\frac{y\left({s}_{{i}^{\mathrm{^{\prime}}}}\right)}{y\left({s}_{i}\right)} , p\right)$$
(1)

We suggest a polynomial of degree n for this function \(f\). Equation (1) then becomes:

$$\frac{{q}_{p}({s}_{{i}^{\mathrm{^{\prime}}}})}{{q}_{p}({s}_{i})}=\sum_{j+k<n}{\beta }_{j,k} {\left(\frac{y({s}_{{i}^{\mathrm{^{\prime}}}})}{y({s}_{i})}\right)}^{j} {p}^{k}$$
(2)

With \({\beta }_{j,k}\) coefficients and \(p\in \left[\mathrm{0,100}\right]\).

Fitting model

In Eq. (2), the \({\beta }_{j,k}\) coefficients are calculated from all the pairs \({s}_{i}\) and \({s}_{{i}{\prime}}\) formed by all the stations in the domain. All the pairs thus formed for the 10 deciles considered constitute a sample of 6250 data over a 2-year period (2016 and 2017). Annual values and ranks are calculated over the same years. A training sample representing two-thirds of the data is used to fit the model, while the rest of the data is used to calculate the model's performance: correlation, mean error, standard deviation of error and RMSE (roots mean square error). The estimated coefficients are presented in Table 1. For each term of the polynomial, Table 1 gives the degrees j and k, the estimated coefficients, the standard deviation and the result of Student test.

Table 1 Coefficients values estimated with training data for each term of the polynomial Eq. (2)

The Fig. 4 shows estimates of deciles ratios on the test sample. The correlation is 0.93, the mean error—0.001, the standard deviation of error 0.246, and the rmse 0.246. We note that the largest errors are obtained for the lowest ranks and that the mean bias is close to zero for each of the value ranges. We now move on to the estimating of deciles. Equation (2) becomes:

Fig. 4
figure 4

a) Estimation of the daily ratio on the test sample of 2083 observations b) distribution of errors according to the ranges of value. Error is computed by subtracting the observed value from the estimated value

$${\widehat{q}}_{p}\left({s}_{{i}^{\mathrm{^{\prime}}}}\right) ={q}_{p}\left({s}_{i}\right)\sum_{j+k<n}{\beta }_{j,k} {\left(\frac{y\left({s}_{{i}^{\mathrm{^{\prime}}}}\right)}{y\left({s}_{i}\right)}\right)}^{j} {p}^{k}$$
(3)

The estimate of the pth decile of station \({s}_{{i}{\prime}}\) is obtained by multiplying the value of the polynomial by the pth decile of station \({s}_{i}\). These results are shown in Fig. 5, where the mean bias per range of values is also close to zero (mean error -0.77, correlation 0.95, standard deviation of error 8.63, rmse 8.66).

Fig. 5
figure 5

a) Estimation of daily NO2 deciles in µg/m3 on the test sample of 2083 observations b) distribution of errors according to value ranges. Error is computed by subtracting the observed value from the estimated value

The degree of the polynomial is set to 3 to reduce the CPU time of the AZUR modeling platform. We compared the difference between using the polynomial of degree 3 and 4. The results are shown in Table 2. The RMSE is 8.66 with the polynomial of degree 3 and 8.54 with the polynomial of degree 4. Increasing the polynomial degree from 3 to 4 does not imply a sufficient gain compared to the increase in CPU time.

Table 2 Scores on the test sample of two fitted polynomial with degree 3 and degree 4

Using the model

Let \({s}_{0}\) be a point of the grid. We have an estimate of its annual value \(y({s}_{0})\). The daily value at point \({s}_{0}\) is the unknown to be determined. We note \({\widehat{q}}_{{s}_{i}}\left({s}_{0}\right)\) the estimate of this value made from the station \({s}_{i}\). Equation (3) then becomes:

$${\widehat{q}}_{{s}_{i}}\left({s}_{0}\right)={q}_{p}\left({s}_{i}\right)\sum_{j+k<n}{\beta }_{j,k} {\left(\frac{y\left({s}_{0}\right)}{y\left({s}_{i}\right)}\right)}^{j} {p}_{{s}_{i}}^{k}$$
(4)

Thus, for a given grid point \({s}_{0}\), each station \({s}_{i}\) produces an estimate \({\widehat{q}}_{{s}_{i}}\left({s}_{0}\right)\). The rank \(p\) is determined by the concentration measured at station \({s}_{i}\), based on the distribution of its own daily values over previous year. To use Eq. (4), it is necessary to make an assumption about the ranks of the daily values at grid points. An estimate \({\widehat{q}}_{{s}_{i}}\left({s}_{0}\right)\) at grid point \({s}_{0}\) corresponding to the percentile with the same rank as that of the measurement at station \({s}_{{\text{i}}}\). This implies that the concentration at the station and the unknown concentration at the grid point must have the same rank in their own distribution.

This assumption is assumed to be true for the grid points close to the measurement stations. For the other points, we assume that their rank depends on the distance from the surrounding stations (see section discussion). To take these two cases into account, we propose a global estimate at \({s}_{0}\), denoted \(\widehat{z}\left({s}_{0}\right)\), given by Eq. (5), calculated with the inverse distance-weighted average of the estimates \({\widehat{q}}_{{s}_{i}}\left({s}_{0}\right)\).

$$\widehat{z}\left({s}_{0}\right)=\sum_{{s}_{i}\in {E}_{{s}_{0}}}{\lambda }_{i}{\widehat{q}}_{{s}_{i}}\left({s}_{0}\right) with \sum_{{s}_{i}\in {E}_{{s}_{0}}}{\lambda }_{i}=1$$
(5)

With:

  • \({E}_{{s}_{0}}\): all stations in the whole domain,

  • \({\lambda }_{i}\): weights depending on the distance of \({s}_{i}\) to \({s}_{0}\)

The weights \({\lambda }_{i}\) are calculated from the inverse square distance between the stations \({s}_{i}\) to the grid point \({s}_{0}\).

Model property

The model has the characteristic of an interpolator passing through the measurement points. To verify this, an estimate is made for each measuring station, at the station itself. With \({s}_{i}\). = \({s}_{0}\), the ratio of annual values being equal to 1, Eq. (4) becomes:

$${\widehat{q}}_{{s}_{i}}\left({s}_{i}\right) ={q}_{p}\left({s}_{i}\right)\sum_{j+k<n}{\beta }_{j,k} {p}^{k}$$
(6)

The model is then applied to all deciles, i.e. \(p\in \left[\mathrm{0,100}\right]\) with a step size of 10. The estimated daily values produce errors of less than ± 1 µg/m3 for deciles above q30. For the lower deciles, the average error is 2 µg/m3 and can be as high as 5 µg/m3 (Fig. 6). The largest errors are reached in the lowest deciles (Fig. 6b) and for daily values estimated between 30 µg/m3 and 60 µg/m3 (Fig. 6a). Error is computed by subtracting the observed value from the estimated value, which means that for the lowest deciles, the model overestimates.

Fig. 6
figure 6

a) Auto-estimation of daily NO2 deciles in µg/m3 on the mesurement station points b) distribution of the errors according to the value ranges. Error is computed by subtracting the observed value from the estimated value

Results

In order to assess the performance of the model for NO2, we calculate cross-validation estimates for the 22 measurement stations over the year 2019. To ensure the independence of the test data from the estimated parameters used by the model, the annual values used correspond to the year 2018, and the ranks are calculated from the distribution of daily values from 2016 to 2017 inclusive.

To compare the results obtained with the suggested method, we perform a kriging with external drift using the annual mean (Beauchamp et al. 2017, Gressent et al. 2020, Lichternstern 2013), in global neighbourhood on the same set of stations. The daily variogram is automatically adjusted with a zero-nugget effect. In both methods, the background and traffic stations are used in the same computation.

Table 3 presents the results of the leave-one-out-cross-validation by group of stations, on the one hand the 16 background sites, and on the other the 6 sites under the influence of road traffic. For the background sites, the suggested method has an advantage of 4.8% with an RMSE of 15.0 compared with 15.8 for the kriging method. The correlation coefficient is 0.81 for AZUR method and 0.79 for kriging method.

Table 3 NO2 scores of the two methods by leave-one-out-cross-validation for all stations in the fleet over the year 2019 (correlation coefficient, mean error, standard deviation, root-mean-square-error, normalized root-mean-square error, observed annual mean)

For the sites under the influence of road traffic, the RMSE is 9.9% better for kriging method with 17.4 compared with 19.1 for AZUR method. The correlation coefficients are close with 0.78 for Azur method and 0.79 for kriging method. The mean error indicates minor differences from the annual values measured.

Background and traffic stations do not have the same range of values. We have therefore calculated the relative error given by the NRMSE (Table 3). The relative error gives the advantage to the traffic station for both methods. This is due to the presence of a background station close to each traffic station. All cities with traffic stations also have a background site.

An example of the visual outputs for the whole region and a zoom on an urban area is shown in Fig. 7 and Fig. 8. With AZUR method, 100 000 000 grid cells are computed for the map below over the whole PACA region (25 m of resolution). 7 min are necessary with 1 CPU (Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10 GHz, 8 cores) with RAM 128 Go.

Fig. 7
figure 7

Example of daily NO2 concentrations computed with AZUR method on 28 February 2019 for the PACA region

Fig. 8
figure 8

Example of daily NO2 concentrations computed with AZUR method on 28 February 2019 zoomed in on the city of Marseille

Discussion

Aera of representativeness

Using the AZUR model imply that the closer a grid point is to a station, the closer its daily rank is to the daily rank of the station. This is reflected in Eq. (5) by a greater weight of the station for the daily estimate at the grid point. This raises the question of the representativeness of the ranks in the neighbourhood of the stations. To study this representativeness, we carried out a variogram analysis of NO2 ranks (Beauchamp et al. 2010, Beauchamp et al. 2018, Kracht and Gerboles 2019). We study the range of the variogram, the distance beyond which measuring stations no longer have any influence on their surroundings. To do this, we calculate the daily variogram of ranks for background sites only, for each day of 2019 (Table 4). The range of the variogram varies from day to day, with a median over 2019 of 16 km. In 90% of cases, the range is greater than 6 km. Most of the time, the cities lie within the zone of representativeness of their measuring station.

Table 4 Percentiles of the daily distribution of ranks variogram range in km for NO2 over the year 2019, calculated with background stations only and with background and traffic stations (nugget forced to 0)

Some grid points are most of the time outside the area of representativeness of any station. In this case, several stations have a significant weight in the inverse distance-weighted average given by Eq. (5). To illustrate the quality of the estimation of these grid points, we present the cross-validation results for the most isolated background site in the region, called Manosque, located 45 km from the nearest station (Table 5). The NRMSE is 0.39. It is downgraded by 22% compared to the AZUR NRMSE of all background stations.

Table 5 NO2 scores for AZUR method for 2 groups of stations, cities with background station only and cities with background and traffic stations

Influence of traffic sites

Taking traffic sites into account when calculating the variogram of daily ranks reduces the median range to 8 km (Table 4). This is probably due to the sensitivity of this type of site to variations in road emissions, which implies a reduced area of representativeness for traffic sites (Malherbe and Cárdenas 2005, Minet et al. 2018). To show the sensitivity of the method to the inclusion of traffic sites, we studied the scores for cities with a background site only and cities with a traffic site and a background site (Table 5). The average NRMSE for cities with a traffic site is 0.3, while cities with background sites only is 0.33. There is therefore no significant difference between the two sets of cities as regards the NRMSE for background stations. This score seems even improved where a traffic station is present.

Conclusion

In a first analysis, we showed that daily concentrations between two measurement sites depend not only on the ratio of their annual value but are also linked by their rank in their respective distribution. A variographic analysis enabled us to show that this rank has a spatial correlation over several kilometers around the measurement sites. It makes this approach relevant for spatial estimation.

We therefore proposed a statistical method for high-resolution mapping of ambient air quality over a large area. It constructs daily maps for NO2 from an annual map and a network of measuring stations. The advantage of this method is that it does not consume much CPU time and uses little input data compared to classic dispersion models. Moreover, this estimation method has the property of being an exact interpolator. It allows the calculated map to remain faithful to the measurements used to construct it.

The proposed method was compared to a kriging method with external drift. For background sites, the results are similar. Even if there are differences on traffic stations in terms of standard deviation error, correlations are close. The mean error shows minor differences from annual values measured. AZUR approach compares well to kriging which is a widely used and proven approach.

AZUR was adapted by AtmoSud for forecast mapping. The forecast mode is used by replacing the daily value measured at the stations with the forecast value. The rank is then calculated based on this predicted value in the distribution of measured daily values. AZUR has also been developed for PM10 and PM25. In this case, the daily value to be spatialized is the daily mean. The comparison with external drift kriging was carried out. The results show that the models are suitable to these pollutants.

The definition of the representative zones for measurement stations is currently being improved. This work is carried out by deploying low-cost sensors in large numbers and thus studying their correlation with the stations already in place (Schneider et al. 2017; Gressent et al. 2020). In addition, a tool for detecting outliers has been put in place in order to modulate the representativeness zones of the stations on a day-to-day basis according to measurements from neighbouring stations.

This method has been transposed to the Corsica and Hauts-de-France regions (https://ressources.atmo-hdf.fr/mod_jour/polluants/NO2.html). It is easily transposable for observatories that don’t have high-resolution and operational daily mapping tools.