1 Introduction

Pollen forecasts allow allergy sufferers to minimize exposure and thus symptoms (Tummon et al., 2021). Area-covering pollen forecasts can be provided by numerical pollen forecast models such as COSMO-ART (Vogel et al., 2009) or SILAM (System for Integrated modeLling of Atmospheric coMposition) (Sofiev et al., 2015). However, operational pollen forecast models run freely over the whole pollen season, i.e. the models are not adapted according to the actual pollen measurements. This reduces the quality of the forecasts.

The reason for the lack of data assimilation is the well-established pollen monitoring technology. Since the 1950s Hirst-type pollen monitoring devices (Hirst, 1952; Galán et al., 2014; PD CEN/TS 16868, 2015) represent the standard of pollen measurements. A shortcoming of this technology is that the pollen grains are determined and counted manually under the microscope. As a consequence, data become available with a delay of 1 to several days which is too late for pollen assimilation in numerical pollen forecast models.

In the past years, new monitoring technologies allowing real-time measurements of pollen concentrations witnessed breakthroughs (Huffman et al., 2020). These technologies emerged from research fields such as robotics, image recognition, laser and fluorescence. An overview of the available devices can be found in Clot et al. (2020). There is an increasing number of European countries where real-time pollen monitoring networks are being established (Tummon et al., 2022) supported by the EUMETNET AutoPollen Programme (Clot et al., 2020).

These developments open new opportunities for numerical pollen forecast models as pollen data assimilation becomes possible. The aim of this study is to develop and test a method to improve numerical pollen forecasts by using actual pollen concentration data in COSMO-ART. We focus on the research question: Is it possible to substantially improve the numerical pollen dispersion model COSMO-ART by incorporating pollen data in hindcast runs as if they were real-time pollen data?

We restrict the study to COSMO-ART but the method may be implemented in other numerical pollen forecast models as well. Additionally, we neither investigate lead-time dependencies nor ensembles. Instead, we focus on deterministic analysis cycles giving the paper a case study character.

2 Data and methods

2.1 Rationale for the calibration approach

The most obvious way to use real-time pollen data in numerical pollen forecast models would be tuning the initial conditions (i.e. starting concentrations). However, Sofiev (2019) found that benefits of classical data assimilation are limited to the first few hours of the forecasts. Using SILAM runs, he showed that half of the positive effect is gone within \(\approx\) 5 h, while 90% vanishes in less than 24 h. However, lead times larger than 5–24 h are of great interest to many of the user groups.

Consequently, we propose not to adapt the initial conditions but rather to adapt suited parameterizations (denoted “calibration” hereafter) of the pollen model. This way it is possible to maintain the benefit from real-time data to the end of the forecast run no matter how long this will be. One crucial question is which parameterization parameters are suitable for tuning.

All numerical pollen forecast models include a number of parameterizations when explicitely calculating phenology (the state of the season), pollen emission, transport and sedimentation. Naturally, the most effective way to improve model output would be to adapt parameterizations with the largest uncertainties. However, uncertainty attribution is not trivial and uncertain itself. Errors in the phenology model’s ability to predict the start of the pollen season will result in significant forecasting errors as concentrations may increase sharply at the beginning of the season. In addition, errors at the start of the season tend to negatively influence the correctness of both the peak and the end of the pollen season.

2.1.1 Calibration of the phenology model

The binary character of season timing (emission vs. no emission) and the impact on end users led us to focus in a first step on model phenology. In COSMO-ART, this module is based on growing degree day models (GDD) that are described in Pauling et al. (2014) and Zink et al. (2013). This module involves a field of temperature thresholds that are variable in space but constant in time.

We suggest to make the field of temperature sum thresholds variable also in time. If the temperature thresholds are exceeded but no pollen has been observed up to that date, a temperature increment corresponding to 1 day is added to the temperature thresholds at pollen stations. Similarly, the temperature sum thresholds can be lowered. Then, the new, station-based temperature sum thresholds are extrapolated onto the whole domain using inverse distance weighting.

2.1.2 Calibration of the tuning factor

The remaining error of the pollen forecast is hard to attribute. Major sources of uncertainty arise from the plant distribution, the emission parameterization, the quality of the weather parameter forecasts and sub-scale processes. Hence, it is hard to adapt specific parameterizations in a conceptually sound way. That is why we opted to implement a bias reduction scheme that accounts for most remaining sources of uncertainties.

In all numerical pollen forecast models, there is a tuning factor that aggregates the various influence factors to pollen emission flux. In COSMO-ART, this is a single species-specific number that was determined based on past pollen seasons at pollen stations and then averaged to minimize the overall bias. This number was held constant and applied to the whole model domain. It did neither account for mast years nor for regional differences. Adapting this number is supported by Sofiev (2019) who found that correction of the total seasonal pollen emission field has a positive impact. However, as whole-season data are not available before or during the pollen season, we use the tuning factor instead that has a similar role as the total seasonal pollen emission.

We suggest to replace this number by a variable field that is updated whenever new real-time pollen data become available (e.g. every hour, as is the case in our hindcast runs). Initial tests have shown that this number, called debiasing factor F hereafter, calculated by

$$\begin{aligned} F = (obs/mod)^{1/24} \end{aligned}$$
(1)

where

$$\begin{aligned} obs= & {} \text {sum of hourly observed concentrations of the last 5 days} \\ mod= & {} \text {sum of hourly modelled concentrations of the last 5 days} \end{aligned}$$

yields promising results. F values are calculated for each station and are subsequently extrapolated onto the whole domain using inverse distance weighting. Main considerations include that the change must not take effect too quickly to avoid over-reaction (i.e. due to a very strong, but temporally limited local source). The twenty-fourth root was chosen so that F takes full effect after 24 h (if the routine is called hourly). In addition, species-specific upper and lower limits of F are implemented to ensure that F stays in a predefined range. The results are sensitive to these implementation details and are discussed in Sect. 3.

2.2 Data

The analyses are based on three data sets in measurement space.

  1. 1.

    Measurements: hourly pollen concentration from 13 Swiss pollen stations (Basel, Bern, Buchs, Lausanne, Genève, La-Chaux-de-Fonds, Luzern, Locarno, Lugano, Münsterlingen, Neuchâtel, Visp and Zürich). The measurements were made using Hirst pollen traps because no real-time pollen monitoring device was available yet. Species include Alnus, Betula, Corylus and Poaceae.

  2. 2.

    Baseline model runs: This data set includes modelled concentrations at pollen stations from COSMO-1E which is an operational configuration of the COSMO-ART model at MeteoSwiss. The COSMO-1E configuration is based on a horizontal mesh of 1.1 km and 80 vertical levels. It covers the greater Alpine domain (approx. 1280 km x 860 km). For Alnus (January–March), Corylus (January–March) and Betula (March–June), the database includes the pollen seasons 2020 and 2021. For Poaceae (March–October), the pollen seasons 2019–2020 are available. The runs were performed as 24-h cycles without the calibration module. Hourly concentrations are retrieved from the model output.

  3. 3.

    Calibrated model runs: same as the baseline model configuration but with the calibration module.

The analysis is carried out for each pollen species individually, but for all stations and both years combined.

The temporal resolution for this analysis is daily (i.e. hourly concentrations are converted into daily averages). Sensitivity studies with 12 hourly averages have been carried out, but are not shown as the results were similar.

Measurements of low pollen concentrations have a high measurement uncertainty (Adamov et al., 2021). Therefore, comparison of modelled with measured data for concentrations \(<=\) 10 Pollen/\(m^{3}\) should be considered with a grain of salt. To calculate the log and relative error metrics, 0.001 was added to the modelled and measured values for numerical reasons.

Missing data occur in the measurements (not in model data). Daily averages from hourly values were calculated without missing values if at least 50% of hourly values are present. Otherwise, that specific day was excluded from the analysis for all three data sets. The impact of missing data was evaluated and can be considered minimal.

2.3 Statistical analyses

The analysis and the preprocessed data can be freely retrieved on github in reproducible form: https://github.com/sadamov/realtime_calibration.git

Data preparation and visualization was carried out in R 4.2.1 (R Core Team, 2022) using various packages from the Tidyverse (Wickham et al., 2019). The statistical analysis presented in the next paragraphs is based on the methodologies from Adamov et al. (2021).

After an initial residual analysis, the measurements were converted into logarithmic concentrations for statistical comparison with log-transformed model data (base 10 log). Even the log concentrations did not always fulfil the requirements of standard statistical methods (i.e. assuming normality of independent errors with constant variance and mean zero). These assumptions were tested with an analysis of variance (ANOVA) and quantile–quantile plots of the residuals, Tukey–Anscombe plots of residuals vs. fitted values and index plots. As the data did not fulfil the assumptions mentioned above, robust and/or simple statistical methods were applied. For every metric and figure, it is mentioned whether log-transformed or original concentrations were used.

For the numerical assessment, standard metrics such as bias, standard deviation (SD), mean absolute error (MAE) and root-mean-squared logarithmic error (RMSLE) were calculated. Therewith, the most typical metrics used in the weather community are covered. As the data do not fulfil many of the assumption mentioned above, categorical verification will allow for a more detailed verification. For the assessment of health impact classes, accuracy, Cohen’s kappa, mean absolute error of ordered levels and a pairwise comparison were carried out. The pairwise comparison and simultaneous confidence intervals for the estimated effects were calculated using the nparcomp package (Konietschke et al., 2015) with the Dunnet method, where the Hirst measurements were chosen as the reference level (see Table 2). The resulting estimator can be interpreted as a proxy for the relative difference in median between modelled and measured concentrations.

Additional categorical metrics are shown in Appendix A.2. These metrics together allow the reader to get a full picture of the improvements and shortcomings of the new calibration module. The combination of health impact-based and concentration-based metrics allows a good understanding of the model performance and the main areas of further improvement.

The validation at station level (i.e. modelled vs. measured values) is carried out using the same stations as were used for the model calibration. Given that our calibration targets parameterizations of the pollen module and not concentrations directly, the effect of data leakage can be considered reasonably low. Still, as soon as more stations are available, a station split for model validation should be performed.

3 Results and discussion

We present detailed analysis, metrics and graphs for Alnus pollen. For Corylus, Betula and Poaceae, an overview of several statistical metrics is shown in Appendix A.1. Figure 1 shows the raw data as time series for Alnus on a log scale. In January, the level of the calibrated runs is more elevated than the baseline runs and thus closer to the measurements. The main pollen season occurs in February in all three data sets. Around February 10, the calibrated runs seem to be higher than both the baseline runs and the measurements, whereas in March, a number of baseline runs appear to be too high. On a whole, the baseline and the calibration runs are quite similar.

Fig. 1
figure 1

Time series of daily Alnus pollen concentrations 2020 and 2021. Each line shows the data for one of the 13 stations in one of the observed years of the study

The positive effect of the calibration method is best visible at the beginning (increasing the pollen level) and at the end of the season (decreasing the pollen level). Both the adaptation of the phenology and the debiasing factor have contributed to this result.

The strong peak in the calibrated run (around February 10) is enhanced by the debiasing factor F (cf. Eq. 1) because of the preceding days with too low model values. A variable tuning factor F increases the range of possible modelled values, making it more likely that the whole range of measurements can be modelled. On the other hand, the forecasts can also be further away from the measurements.

If the airborne pollen concentrations fluctuate on a frequency similar to the memory of the debiasing scheme (5 days in the current implementation), a shortcoming of the proposed debiasing method is revealed. Compared to the measurements, the modelled fluctuations tend to be shifted by half of the wavelength resulting in alternating over- and underestimations. In mid-latitudes, the weather often changes on a timescale of a few days making such fluctuations plausible.

This potential issue has to be weighed against the significant benefits in terms of robustness induced by the 5 days memory. Sofiev et al. (2017) also opted for a 5-day assimilation window when fusing model predictions and measurements into an optimized ensemble. They reported a "robust combination of the models" when this time window was applied. Short-term, local fluctuations of the measurements may not be representative for a large area. When using the 5 days memory, these local fluctuations average out. This is important because the calibrated model values at stations are subsequently extrapolated onto the model grid. The more stations are available the more robust results can be achieved by the proposed calibration scheme. However, the stations should be well distributed over the area of interest.

We have tested the sensitivity of the calibration scheme to the length of the memory and finally optimized this parameter. In the end, it is a trade-off between delayed reaction time inducing shifted wavelengths on the one side and robustness on the other side. This study presents the framework that worked best out of many sensitivity tests carried out for the same years and model domain (see Appendix A for more details about other experimental setups).

Figure 2 illustrates the overall effect of the calibration method. The median of the calibrated run is closer to the median of the measurements than the median of the baseline model. Also, as indicated by the interquartile range, the variation of the calibrated run is more similar to the measurements than the baseline model.

Fig. 2
figure 2

Boxplot of the daily differences between modelled and measured pollen concentrations. The plot was zoomed in along the y-axis to better visualize the numerous small concentration differences. There would be some differences as large as \(2000~Pollen/m^3\) in both time series that are not shown in this plot

Figure 3 (Bland–Altman plot) shows the dependence of the model/measurement differences on the model/measurement mean. This allows assessment of the model error for different concentration levels separately. Comparison of the two panels in Fig. 3 reveals that overestimation of the baseline model becomes larger with increasing concentration values as shown by the Loess smoother (blue line). This feature is much less pronounced in the calibrated run. The blue confidence band shows that the uncertainty of the Loess smoother becomes larger with increasing values as well. The conical shape of the point cloud is typical for a comparison of two time series where their difference strongly depends on the magnitude of at least one of the time-series absolute values. Note that the difference (y-axis) can maximally be twice as large as the mean (x-axis) for any given pair of modelled and measured values. Further note that the model/measurement mean values \(>~300~Pollen/m^3\) were hidden from this plot for better visibility of all health impact classes.

Fig. 3
figure 3

Scatterplot of the means of modelled and measured values versus the differences between modelled and measured values (Bland–Altmann plot); left panel: baseline model; right panel: calibrated model; the blue line shows the Loess smother and the red line shows a theoretical perfect agreement between model and measurements. The dashed red lines show the \(\pm 1.96 * sd\) interval of the differences, where the points are expected to lie within to denote good agreement between modelled and measured values. The closer to the solid red line the point cloud resides, the better the agreement between model and measurements. This plot was zoomed in along the x-axes to display the range where most daily measurements lie between. For both modelled time series, some concentration means would reach up to \(2000~Pollen/m^3\)

In the calibrated and baseline run, most data points stay within the dashed red lines (\(\pm 1.96 * sd\) interval in Fig. 3). In the calibrated run, the data points are also closer to symmetry. These results also illustrate the effect of the calibration scheme. Nevertheless, model/measurement differences can still be large and will be subject to future improvements.

Figure 4 depicts the logarithmic (base 10) measurement/model data as scatter plot. Both models overestimate the concentrations for values \(<~100~Pollen/m^3\). This feature is more pronounced in the baseline model data. For both models, the spread around the red line is quite large resulting in rather low correlation coefficients especially for the baseline model. For values \(>~500~Pollen/m^3\), where the uncertainty of the Loess smoother is large, the baseline model underestimates the concentrations less than the calibrated run.

Fig. 4
figure 4

Scatter plot of measurements versus model data (left—baseline and right—calibration) based on logarithmic data; the blue line shows the Loess smoother; the red line shows a theoretical perfect correlation of 1. The text boxes show the 95% confidence intervals of the correlation coefficients calculated by different methods (adjusted for multiple comparison). This plot is zoomed in along both axes to focus on log concentrations \(>=~0~Pollen/m^3\) (i.e. concentrations \(>=~1~Pollen/m^3\)). The large number of days with zero measured pollen would otherwise make the plot hard to read given the discrete property of Hirst measurements

The noticeable range of the correlation coefficients (Kendall, Spearman and Pearson) indicates considerable uncertainty of the correlation estimate (Fig. 4). Still, the correlation based on the calibration run is consistently higher than the ones based on baseline data indicating improved results. The absolute increase in the correlation may appear relatively low (roughly 0.06 for Pearson R). However, as logarithmic data were used, the improvement for Pearson R is still considerable.

It is crucial to know the performance of the models for each health impact class separately. Overall, error metrics are dominated by days with low pollen occurence, simply because they are so numerous. For health applications, higher categories are arguably much more important. Figure 5 shows boxplots for measurements and both models for each impact class. For the impact classes "weak", "medium" and "strong", the calibrated run outperforms the baseline model both in terms of median and variation. On the other hand, the baseline model performs better for the impact class "very strong". Across all categories, the spread of the error is reduced in the calibration run.

Fig. 5
figure 5

Model/measurement comparisons of Alnus pollen concentrations stratified by health impact classes according to health impacts of Alnus pollen (weak: \(1-10~Pollen/m^3\), medium: \(11-69~Pollen/m^3\), strong: \(70-250~Pollen/m^3\) and very strong: \(>~250~Pollen/m^3\)). Measurements were used for class assignment. The number of measurements in each class is annotated in the top right corner. Note that the x-axes were adjusted to the extent of the health impact class. Some outliers are not shown for better visibility of the boxes

Table 1 shows four distinct metrics for testing the model performance. The large reduction of the bias from 30.2 Pollen/\(m^{3}\) (baseline run) to 10.2 Pollen/\(m^{3}\) (calibrated run) is mainly the result of the debiasing method (cf. Eq. 1). However, the standard deviation that is independent of the bias was reduced much less. The mean absolute error (MAE), an overall measure that incorporates both systematic (bias) and random errors (standard deviation), was also reduced substantially. The reduction of the root-mean-square error of the logarithmic data (RMSLE) looks only marginal. However, one should keep in mind that it is based on logarithmic data. Hence, the reduction is also considerable.

Table 1 Numerical metrics to compare baseline and calibration runs using Alnus pollen concentrations

The metrics in Table 2 show that the calibrated run outperforms the baseline model also based on categorical data. The impact class is correctly forecast by the calibrated run with an accuracy of 50.1%, the error amounts to only 0.571 impact classes on average (MAE). Both metrics were substantially improved compared to the baseline model (accuracy = 41.8% and MAE = 0.719). The kappa metric confirms these findings (Table 2).

Table 2 Categorical metrics to investigate the performance of the models with respect to health impact classes

The estimator is a robust method that should be considered the best choice to compare these three time series from a statistical point of view. Multiple testing, non-normality and other factors are accurately treated. The confidence interval of the estimator shows that these results are statistically significant because it does not include 0.5 (Konietschke et al., 2015), i.e. both the baseline and the calibration runs are different from the measurements. The difference between the calibrated model run and the measurements is smaller than the difference between the baseline model run and the measurements.

Additional metrics for evaluating health impact classes as well as additional species can be found in Appendix A.1. Most metrics show similar improvements as for Alnus. However, as shown from Table 7, the results for Corylus and Betula are mixed. Reasons may include that the model of Corylus was recently developed and already its baseline profits from developments that are not yet included in the modelling of the other species (i.e. improved season description curves). In addition, inter-annual variability may substantially influence the results. Early flowering species such as Corylus or Alnus experience more inter-annual variability of season start date than later flowering species such as Betula or Poaceae. For Betula, the largest errors come from the frequently occurring days falling into the "nothing" or "weak" category. The end of the season was extended too much by the calibration module in both years. As a result, the calibrated model produced pollen concentrations that were too high during the full month of June. This highlights the need for further research and fine-tuning of the calibration of both the phenology and the strength of the pollen season.

4 Conclusions

We present a method to continuously adapt the parameterization of the numerical pollen forecast model COSMO-ART. This method was tested in hindcast runs using Hirst-type pollen measurements as a proxy for real-time pollen data. A number of metrics consistently show that the proposed method leads to a substantial improvement of COSMO-ART which was the main research question of this study. This applies to analyses based both on metric and ordinal data.

Tuning the phenology and subsequently debias, the pollen emission flux includes a number of parameterisations itself. This study presents only one implementation of this conceptual approach. Fine-tuning of the approach (especially the 5-day memory in debiasing) is likely to further improve the performance of COSMO-ART. The sensitivity of the tuning factor F on the number of measurement stations and the spatial interpolation method should be investigated systematically.

To our knowledge, the implementation of the described method is the first ever online incorporation of pollen measurements into a numerical pollen forecast model. The forecast improvements were achieved by regular corrections of phenological model parameters and the overall tuning factor. As real-time pollen data become increasingly available, further improvements could be achieved in future.