A real-time calibration method for the numerical pollen forecast model COSMO-ART

Technologies for monitoring pollen concentrations in real-time made substantial advances in the past years and become increasingly available. This opens the possibility to calibrate numerical pollen forecast models in real-time and make a significant step forward regarding the quality of pollen forecasts. We present a method to use real-time pollen measurements in numerical pollen forecast models. The main idea is to calibrate model parameterizations and not to assimilate measurements in a nudging sense. This ensures that the positive effect persists throughout the forecast period and does not vanish after a few forecast hours. We propose to adapt in real-time both the model phenology scheme and the overall tuning factor that are present in any numerical pollen forecast model. To test this approach, we used the numerical pollen forecast model COSMO-ART (COnsortium for Small-scale MOdelling-Aerosols and Reactive Trace gases) on a mesh size of 1.1 km covering the greater Alpine domain. Test runs covered two pollen seasons and included Corylus, Alnus, Betula and Poaceae pollen. Comparison with daily measurements from 13 Swiss pollen stations revealed that the model improvements are large, but fine-tuning of the method remains a challenge. We conclude that the presented approach is a first valuable step towards comprehensive real-time calibration of numerical pollen forecast models.


Introduction
Pollen forecasts allow allergy sufferers to minimize exposure and thus symptoms (Tummon et al., 2021).Area-covering pollen forecasts can be provided by numerical pollen forecast models such as COSMO-ART (Vogel et al., 2009) or SILAM (System for Integrated modeLling of Atmospheric coMposition) (Sofiev et al., 2015).However, operational pollen forecast models run freely over the whole pollen season, i.e. the models are not adapted according to the actual pollen measurements.This reduces the quality of the forecasts.
The reason for the lack of data assimilation is the well-established pollen monitoring technology.Since the 1950s Hirst-type pollen monitoring devices (Hirst, 1952;Galán et al., 2014;PD CEN/TS 16868, 2015) represent the standard of pollen measurements.
Vol:. ( 1234567890) A shortcoming of this technology is that the pollen grains are determined and counted manually under the microscope.As a consequence, data become available with a delay of 1 to several days which is too late for pollen assimilation in numerical pollen forecast models.
In the past years, new monitoring technologies allowing real-time measurements of pollen concentrations witnessed breakthroughs (Huffman et al., 2020).These technologies emerged from research fields such as robotics, image recognition, laser and fluorescence.An overview of the available devices can be found in Clot et al. (2020).There is an increasing number of European countries where real-time pollen monitoring networks are being established (Tummon et al., 2022) supported by the EUMETNET AutoPollen Programme (Clot et al., 2020).
These developments open new opportunities for numerical pollen forecast models as pollen data assimilation becomes possible.The aim of this study is to develop and test a method to improve numerical pollen forecasts by using actual pollen concentration data in COSMO-ART.We focus on the research question: Is it possible to substantially improve the numerical pollen dispersion model COSMO-ART by incorporating pollen data in hindcast runs as if they were real-time pollen data?
We restrict the study to COSMO-ART but the method may be implemented in other numerical pollen forecast models as well.Additionally, we neither investigate lead-time dependencies nor ensembles.Instead, we focus on deterministic analysis cycles giving the paper a case study character.

Rationale for the calibration approach
The most obvious way to use real-time pollen data in numerical pollen forecast models would be tuning the initial conditions (i.e. starting concentrations).However, Sofiev (2019) found that benefits of classical data assimilation are limited to the first few hours of the forecasts.Using SILAM runs, he showed that half of the positive effect is gone within ≈ 5 h, while 90% vanishes in less than 24 h.However, lead times larger than 5-24 h are of great interest to many of the user groups.
Consequently, we propose not to adapt the initial conditions but rather to adapt suited parameterizations (denoted "calibration" hereafter) of the pollen model.This way it is possible to maintain the benefit from real-time data to the end of the forecast run no matter how long this will be.One crucial question is which parameterization parameters are suitable for tuning.
All numerical pollen forecast models include a number of parameterizations when explicitely calculating phenology (the state of the season), pollen emission, transport and sedimentation.Naturally, the most effective way to improve model output would be to adapt parameterizations with the largest uncertainties.However, uncertainty attribution is not trivial and uncertain itself.Errors in the phenology model's ability to predict the start of the pollen season will result in significant forecasting errors as concentrations may increase sharply at the beginning of the season.In addition, errors at the start of the season tend to negatively influence the correctness of both the peak and the end of the pollen season.

Calibration of the phenology model
The binary character of season timing (emission vs. no emission) and the impact on end users led us to focus in a first step on model phenology.In COSMO-ART, this module is based on growing degree day models (GDD) that are described in Pauling et al. (2014) and Zink et al. (2013).This module involves a field of temperature thresholds that are variable in space but constant in time.
We suggest to make the field of temperature sum thresholds variable also in time.If the temperature thresholds are exceeded but no pollen has been observed up to that date, a temperature increment corresponding to 1 day is added to the temperature thresholds at pollen stations.Similarly, the temperature sum thresholds can be lowered.Then, the new, stationbased temperature sum thresholds are extrapolated onto the whole domain using inverse distance weighting.

Calibration of the tuning factor
The remaining error of the pollen forecast is hard to attribute.Major sources of uncertainty arise from the plant distribution, the emission parameterization, the quality of the weather parameter forecasts and subscale processes.Hence, it is hard to adapt specific parameterizations in a conceptually sound way.That is why we opted to implement a bias reduction scheme that accounts for most remaining sources of uncertainties.
In all numerical pollen forecast models, there is a tuning factor that aggregates the various influence factors to pollen emission flux.In COSMO-ART, this is a single species-specific number that was determined based on past pollen seasons at pollen stations and then averaged to minimize the overall bias.This number was held constant and applied to the whole model domain.It did neither account for mast years nor for regional differences.Adapting this number is supported by Sofiev (2019) who found that correction of the total seasonal pollen emission field has a positive impact.However, as whole-season data are not available before or during the pollen season, we use the tuning factor instead that has a similar role as the total seasonal pollen emission.
We suggest to replace this number by a variable field that is updated whenever new real-time pollen data become available (e.g.every hour, as is the case in our hindcast runs).Initial tests have shown that this number, called debiasing factor F hereafter, calculated by where yields promising results.F values are calculated for each station and are subsequently extrapolated onto the whole domain using inverse distance weighting.Main considerations include that the change must not take effect too quickly to avoid over-reaction (i.e.due to a very strong, but temporally limited local source).The twenty-fourth root was chosen so that F takes full effect after 24 h (if the routine is called hourly).In addition, species-specific upper and lower limits of F are implemented to ensure that F stays in a predefined range.The results are sensitive to these implementation details and are discussed in Sect.3.

Data
The analyses are based on three data sets in measurement space.
(1) F = (obs∕mod) 1∕24 obs =sum of hourly observed concentrations of the last 5 days mod =sum of hourly modelled concentrations of the last 5 days The analysis is carried out for each pollen species individually, but for all stations and both years combined.
The temporal resolution for this analysis is daily (i.e.hourly concentrations are converted into daily averages).Sensitivity studies with 12 hourly averages have been carried out, but are not shown as the results were similar.
Measurements of low pollen concentrations have a high measurement uncertainty (Adamov et al., 2021).Therefore, comparison of modelled with measured data for concentrations <= 10 Pollen/m 3 should be considered with a grain of salt.To calculate the log and relative error metrics, 0.001 was added to the modelled and measured values for numerical reasons.
Missing data occur in the measurements (not in model data).Daily averages from hourly values were calculated without missing values if at least 50% of hourly values are present.Otherwise, that specific day was excluded from the analysis for all three data sets.
The impact of missing data was evaluated and can be considered minimal.

Statistical analyses
The analysis and the preprocessed data can be freely retrieved on github in reproducible form: https:// github.com/ sadam ov/ realt ime_ calib ration.git Data preparation and visualization was carried out in R 4.2.1 (R Core Team, 2022) using various packages from the Tidyverse (Wickham et al., 2019).The statistical analysis presented in the next paragraphs is based on the methodologies from Adamov et al. (2021).
After an initial residual analysis, the measurements were converted into logarithmic concentrations for statistical comparison with log-transformed model data (base 10 log).Even the log concentrations did not always fulfil the requirements of standard statistical methods (i.e.assuming normality of independent errors with constant variance and mean zero).These assumptions were tested with an analysis of variance (ANOVA) and quantile-quantile plots of the residuals, Tukey-Anscombe plots of residuals vs. fitted values and index plots.As the data did not fulfil the assumptions mentioned above, robust and/or simple statistical methods were applied.For every metric and figure, it is mentioned whether logtransformed or original concentrations were used.
For the numerical assessment, standard metrics such as bias, standard deviation (SD), mean absolute error (MAE) and root-mean-squared logarithmic error (RMSLE) were calculated.Therewith, the most typical metrics used in the weather community are covered.As the data do not fulfil many of the assumption mentioned above, categorical verification will allow for a more detailed verification.For the assessment of health impact classes, accuracy, Cohen's kappa, mean absolute error of ordered levels and a pairwise comparison were carried out.The pairwise comparison and simultaneous confidence intervals for the estimated effects were calculated  (Konietschke et al., 2015) with the Dunnet method, where the Hirst measurements were chosen as the reference level (see Table 2).The resulting estimator can be interpreted as a proxy for the relative difference in median between modelled and measured concentrations.
Additional categorical metrics are shown in Appendix A.2.These metrics together allow the reader to get a full picture of the improvements and shortcomings of the new calibration module.The combination of health impact-based and concentration-based metrics allows a good understanding of the model performance and the main areas of further improvement.
The validation at station level (i.e.modelled vs. measured values) is carried out using the same stations as were used for the model calibration.Given that our calibration targets parameterizations of the pollen module and not concentrations directly, the effect of data leakage can be considered reasonably low.Still, as soon as more stations are available, a station split for model validation should be performed.

Results and discussion
We present detailed analysis, metrics and graphs for Alnus pollen.For Corylus, Betula and Poaceae, an overview of several statistical metrics is shown in Appendix A.1.Figure 1 shows the raw data as time series for Alnus on a log scale.In January, the level of the calibrated runs is more elevated than the baseline runs and thus closer to the measurements.The main pollen season occurs in February in all three data sets.Around February 10, the calibrated runs seem to be higher than both the baseline runs and the measurements, whereas in March, a number of baseline runs appear to be too high.On a whole, the baseline and the calibration runs are quite similar.
The positive effect of the calibration method is best visible at the beginning (increasing the pollen level) and at the end of the season (decreasing the pollen level).Both the adaptation of the phenology and the debiasing factor have contributed to this result.
The strong peak in the calibrated run (around February 10) is enhanced by the debiasing factor F Fig. 2 Boxplot of the daily differences between modelled and measured pollen concentrations.The plot was zoomed in along the y-axis to better visualize the numerous small concentration differences.There would be some differences as large as 2000 Pollen∕m 3 in both time series that are not shown in this plot (cf.Eq. 1) because of the preceding days with too low model values.A variable tuning factor F increases the range of possible modelled values, making it more likely that the whole range of measurements can be modelled.On the other hand, the forecasts can also be further away from the measurements.
If the airborne pollen concentrations fluctuate on a frequency similar to the memory of the debiasing scheme (5 days in the current implementation), a shortcoming of the proposed debiasing method is revealed.Compared to the measurements, the modelled fluctuations tend to be shifted by half of the wavelength resulting in alternating over-and underestimations.In mid-latitudes, the weather often changes on a timescale of a few days making such fluctuations plausible.
This potential issue has to be weighed against the significant benefits in terms of robustness induced by the 5 days memory.Sofiev et al. (2017) also opted for a 5-day assimilation window when fusing model predictions and measurements into an optimized ensemble.They reported a "robust combination of the models" when this time window was applied.Shortterm, local fluctuations of the measurements may not be representative for a large area.When using the 5 days memory, these local fluctuations average out.This is important because the calibrated model values at stations are subsequently extrapolated onto the model grid.The more stations are available the more robust results can be achieved by the proposed calibration scheme.However, the stations should be well distributed over the area of interest.
We have tested the sensitivity of the calibration scheme to the length of the memory and finally optimized this parameter.In the end, it is a tradeoff between delayed reaction time inducing shifted wavelengths on the one side and robustness on the other side.This study presents the framework that Figure 2 illustrates the overall effect of the calibration method.The median of the calibrated run is closer to the median of the measurements than the median of the baseline model.Also, as indicated by the interquartile range, the variation of the calibrated run is more similar to the measurements than the baseline model.
Figure 3 (Bland-Altman plot) shows the dependence of the model/measurement differences on the model/measurement mean.This allows assessment of the model error for different concentration levels separately.Comparison of the two panels in Fig. 3 reveals that overestimation of the baseline model becomes larger with increasing concentration values as shown by the Loess smoother (blue line).This feature is much less pronounced in the calibrated run.The blue confidence band shows that the uncertainty of the Loess smoother becomes larger with increasing values as well.The conical shape of the point cloud is typical for a comparison of two time series where their difference strongly depends on the magnitude of at least one of the time-series absolute values.Note that the difference (y-axis) can maximally be twice as large as the mean (x-axis) for any given pair of modelled and measured values.Further note that the model/measurement mean values > 300 Pollen∕m 3 were hidden from this plot for better visibility of all health impact classes.
In the calibrated and baseline run, most data points stay within the dashed red lines ( ±1.96 * sd interval in Fig. 3).In the calibrated run, the data The large number of days with zero measured pollen would otherwise make the plot hard to read given the discrete property of Hirst measurements points are also closer to symmetry.These results also illustrate the effect of the calibration scheme.Nevertheless, model/measurement differences can still be large and will be subject to future improvements.
Figure 4 depicts the logarithmic (base 10) measurement/model data as scatter plot.Both models overestimate the concentrations for values < 100 Pollen∕m 3 .This feature is more pronounced in the baseline model data.For both models, the spread around the red line is quite large resulting in rather low correlation coefficients especially for the   4).Still, the correlation based on the calibration run is consistently higher than the ones based on baseline data indicating improved results.The absolute increase in the correlation may appear relatively low (roughly 0.06 for Pearson R).However, as logarithmic data were used, the improvement for Pearson R is still considerable.
It is crucial to know the performance of the models for each health impact class separately.Overall, error metrics are dominated by days with low pollen occurence, simply because they are so numerous.For health applications, higher categories are arguably much more important.Figure 5 shows boxplots for measurements and both models for each impact class.For the impact classes "weak", "medium" and "strong", the calibrated run outperforms the baseline model both in terms of median and variation.On the other hand, the baseline model performs better for the impact class "very strong".Across all categories, the spread of the error is reduced in the calibration run.
Table 1 shows four distinct metrics for testing the model performance.The large reduction of the bias from 30.2 Pollen/m 3 (baseline run) to 10.2 Pollen/m 3 (calibrated run) is mainly the result of the debiasing method (cf.Eq. 1).However, the standard deviation that is independent of the bias was reduced much less.The mean absolute error (MAE), an overall measure that incorporates both systematic (bias) and random errors (standard deviation), was also reduced substantially.
The reduction of the root-mean-square error of the logarithmic data (RMSLE) looks only marginal.However, one should keep in mind that it is based on logarithmic data.Hence, the reduction is also considerable.
The metrics in Table 2 show that the calibrated run outperforms the baseline model also based on categorical data.The impact class is correctly forecast by the calibrated run with an accuracy of 50.1%, the error amounts to only 0.571 impact classes on average (MAE).Both metrics were substantially improved compared to the baseline model (accuracy = 41.8% and MAE = 0.719).The kappa metric confirms these findings (Table 2).
The estimator is a robust method that should be considered the best choice to compare these three time series from a statistical point of view.Multiple testing, non-normality and other factors are accurately treated.The confidence interval of the estimator shows that these results are statistically significant because it does not include 0.5 (Konietschke et al., 2015), i.e. both the baseline and the calibration runs are different from the measurements.The difference between the calibrated model run and the measurements is smaller than the difference between the baseline model run and the measurements.
Additional metrics for evaluating health impact classes as well as additional species can be found in Appendix A.1.Most metrics show similar improvements as for Alnus.However, as shown from Table 7, the results for Corylus and Betula are mixed.Reasons may include that the model of Corylus was recently developed and already its baseline profits from developments that are not yet included in the modelling of the other species (i.e.improved season description curves).In addition, inter-annual variability may substantially influence the results.Early flowering species Table 2 Categorical metrics to investigate the performance of the models with respect to health impact classes The better score in bold font.Accuracy: calculates the percentage of the modelled class matching the measured class.Kappa: similar to accuracy, but without random hits.MAE: mean absolute error when the health impact classes are regarded as an ordered factor with integer values from 0: nothing to 4: very strong.Estimator: robust contrast calculation with the Dunnet method from the R-package nparcomp.The estimator can be interpreted as a proxy for the relative difference in median between model and measurements.If the estimator is > 50% , then the modelled values are larger than the measurements.Lower and upper: the 95% confidence interval of the estimator.If this interval does not include 0.5, the results are statistically meaningful/significant.

Conclusions
We present a method to continuously adapt the parameterization of the numerical pollen forecast model COSMO-ART.This method was tested in hindcast runs using Hirst-type pollen measurements as a proxy for real-time pollen data.A number of metrics consistently show that the proposed method leads to a substantial improvement of COSMO-ART which was the main research question of this study.This applies to analyses based both on metric and ordinal data.
Tuning the phenology and subsequently debias, the pollen emission flux includes a number of parameterisations itself.This study presents only one implementation of this conceptual approach.Fine-tuning of the approach (especially the 5-day memory in debiasing) is likely to further improve the performance of COSMO-ART.The sensitivity of the tuning factor F on the number of measurement stations and the spatial interpolation method should be investigated systematically.
To our knowledge, the implementation of the described method is the first ever online incorporation of pollen measurements into a numerical pollen forecast model.The forecast improvements were achieved by regular corrections of phenological model parameters and the overall tuning factor.As real-time pollen data become increasingly available, further improvements could be achieved in future.
Acknowledgements This paper is a contribution to the EUMETNET (EUropean METeorological NETwork) AutoPollen Programme, which develops a prototype automatic pollen monitoring network in Europe.It covers all aspects of the information chain from measurements to communicating information to the

Declarations Compliance with Ethical Standards
The authors received support from MeteoSwiss for this article, where both authors are employed.Internal review has been carried out by Dr. Regula Gehrig, an editor of Aerobiologia.No funding was received to assist with the preparation of this manuscript.No funding was received for conducting this study.No funds, grants or other support were received.The authors have no relevant financial or non-financial interests to disclose.The authors have no competing interests to declare that are relevant to the content of this article.All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.The authors have no financial or proprietary interests in any material discussed in this article.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http:// creat iveco mmons.org/ licen ses/ by/4.0/.

A.1 Time series at station level for alnus
The first two Figs.(6 and 7) show daily Alnus pollen concentration differences between modelled and measured values (Hirst measurements).
The four Figs.(8)(9)(10)(11) show the temporal evolution of the tuning factor F for different calibration model variants.For three model variants, both years 2020 and 2021 are available, for model variant 3, only 2020 is available.Below are the respective experimental settings for the four model variants.In the current model implementation, four main factors were subject to sensitivity studies.
Four In the operational setup at MeteoSwiss, the calibration is run every hour.In the final variant, the adjustment is set to: Therefore, the adjustment is carried out in hourly intervals, whereas a full adjustment would be reached after 24 h. 4. The data used for the calculation of the change ratio.In the final variant, the past 120 h are used.

Four Model Variants
For each model variant, the respective differences with the final model are denoted.
-Final variant (Fig. 8): This is the variant described in this paper.-Variant 1 (Fig. 9): The tuning factor adjustment is slower, with a 120th root adjustment every hour: Climatological limits for the tuning factor F were not implemented yet.-Variant 2 (Fig. 10): The tuning factor adjustment is faster, with a full adjustment every hour: The change to the tuning factor is calculated based on the past 24 h of measured and modelled values, instead of the past 120 h: A.2 Additional metrics for alnus See Table 3.    A.6 Metrics definitions See Table 13.
The numerical metrics were calculated as follows: -Bias = mean(error) -SD = sd(error) -MAE = mean(abs(error)) -RMSLE = sqrt(mean((log 10 (1 + modelled conc.)−log 10 (1 + measured conc.)) 2 )) Where the error is defined as modelled concentration − measured concentration .For the categorical metrics, the concentrations were transformed to an ordinal scale ranging from 0: nothing to 4: very strong.The error was then calculated between those pseudo-numeric values of modelled and measured impact categories.
The following formulas define the categorical metrics using the most simple binary scenario.In our scenario, there are of course five health impact classes, and the formulas have to be extended accordingly.

Fig. 1
Fig. 1 Time series of daily Alnus pollen concentrations 2020 and 2021.Each line shows the data for one of the 13 stations in one of the observed years of the study

Fig. 3
Fig. 3 Scatterplot of the means of modelled and measured values versus the differences between modelled and measured values (Bland-Altmann plot); left panel: baseline model; right panel: calibrated model; the blue line shows the Loess smother and the red line shows a theoretical perfect agreement between model and measurements.The dashed red lines show the ±1.96 * sd interval of the differences, where the points

Fig. 4
Fig. 4 Scatter plot of measurements versus model data (leftbaseline and right-calibration) based on logarithmic data; the blue line shows the Loess smoother; the red line shows a theoretical perfect correlation of 1.The text boxes show the 95% confidence intervals of the correlation coefficients calculated by different methods (adjusted for multiple comparison).This

Fig. 6
Fig. 6 Time series of daily Alnus pollen concentration differences 2021 at all Swiss stations

Fig. 8
Fig. 8 Time series of daily tuning factor F values at all Swiss stations for Alnus for the final model variant

Fig. 9 Fig. 10
Fig. 9 Time series of daily tuning factor F values at all Swiss stations for Alnus for model variant 1

Fig. 11
Fig. 11 series of daily tuning factor F values at all Swiss stations for Alnus for model variant 3

Table 1
Numerical metrics to compare baseline and calibration runs using Alnus pollen concentrations This table shows the mean values over all stations and two seasons.The model with lower errors in boldfont.

Table 4
Numerical metrics to compare baseline and calibration runs using Betula concentrations This table shows the mean values over all stations and two seasons.The model with lower errors in boldfont.

Table 5
Categorical metrics to investigate the performance of the models with respect to health impact classes for BetulaThe better score in bold font.Accuracy: calculates the percentage of the modelled class matching the measured class.Kappa: similar to accuracy, but without random hits.MAE: mean absolute error when the health impact classes are regarded as an ordered factor with integer values from 0: nothing to 4: very strong.Estimator: robust contrast calculation with the Dunnet method from the R-package nparcomp.The estimator can be interpreted as a proxy for the relative difference in median between model and measurements.If the estimator is > 0.5 , then the modelled values are larger than the measurements.Lower and upper: the 95% confidence interval of the estimator.If this interval does not include 0.5, the results are statistically meaningful/significant

Table 13
Definition of predicted and reference events