1 Introduction

Increasing exposure to solar ultraviolet (UV) radiation raises the incidence of skin cancer and other UV-related human diseases, for example cataracts and photosensitivity disorders [1]. Climate change affects the parameters influencing the ground-level solar UV radiation [2]. However, the relationship between climate change (and ozone depletion/recovery) and solar UV radiation is complex. Its impacts on humans and ecosystems are multifaceted [3, 4].

Precise measurements of the ground-level solar UV radiation as well as a series of measurements over several decades are essential for understanding the correlations between ground-level solar UV radiation and the parameters influenced by climate change as well as for quantitative statements on trends. They are also important for validating models calculating the ground-level solar UV radiation, e.g., for larger areas on the Earth’s surface or for projecting.

Solar UV monitoring networks are operated in many regions of the world, for instance, in Europe [5], USA [6], Canada [7], India [8], Australia [9], and New Zealand [10]. One of the challenges of solar UV monitoring (networks) is the permanent and continuous data collection over long periods. Data gaps can occur due to incorrect and thereby discarded measurements, maintenance, or longer downtimes of the measurement systems. It is not unusual that data series on erythemal UV radiation measurements contain data gaps in the order of 30–50 days per year on average [11, 12]. Such data gaps hamper the formation and comparison of the monthly or annual means. In the worst case, the gaps can result in incorrect conclusions. In trend analysis, gaps can prevent finding a trend or may cause overestimation and underestimation of a trend or even generate a non-exiting (artificial) trend. Previous studies analyzing data series of erythemal ultraviolet radiation mostly define different conditions for considering or omitting months with gaps [13,14,15,16,17]. However, dealing with data gaps is often a trade-off between possible bias in calculating monthly average values and months not being considered. Both can lead to a bias of the annual mean or further time series analysis. A promising approach to circumvent this trade-off and avoid bias could be the imputation of estimated data to the gaps.

Numerous methods have been published for estimating missing data on ground-level solar UV radiation. They can be roughly divided into two groups: radiative transfer models [18] and empirical or semi-empirical models [19]. Radiative transfer models simulate the radiative transfer interactions of light scattering and absorption through the atmosphere to estimate UV radiation on the earth’s surface. For these models, sufficient atmospheric data are required in particular (e.g., [20,21,22,23,24,25,26]). Empirical models describe a relationship between UV radiation and available meteorological variables. For both groups, the most relevant factors determining UV radiation are solar elevation, total column ozone (TCO), clouds, aerosols, altitude, and albedo [27, 28]. Machine learning techniques may use many more potential predictors (e.g., total column water vapor and geopotential height at daytime) [29]. Some studies investigate the relationship between a single parameter and UV radiation, such as global solar radiation [30], TCO [31], cloud cover conditions [32], and optical air mass [33]. For modeling UV data, several predictors are often used and a two-step procedure is applied: In a first step, UV radiation values are modeled under clear sky conditions. In a second step, the effect of cloudiness on UV radiation is implemented [13, 32, 34,35,36,37].

This study presents and validates an innovative empirical method to estimate missing daily maximum UV-Index values (\(\text {UVI}_{\text {max}}\)) [38] and daily erythemal radiant exposure (\(\text {H}_{\text {er,day}}\)) values (integral over the day) [39]. Local correlations of predictors to available \(\text {UVI}_{\text {max}}\) and \(\text {H}_{\text {er,day}}\) values are used to estimate these values at all sky conditions in one step. In addition to this model-based imputation, we present an average-based imputation method for days when predictors are missing and a combination of the methods (mixed imputation) for the best practice. Furthermore, it is investigated if the model-based imputation method can also be used as a tool to identify systematic errors at and between calibration steps in long-term UV data series.

2 Materials and methods

2.1 Data material

2.1.1 Data series of daily maximum UV Index and daily erythemal radiant exposure

The imputation methods presented in this paper are applied to patchy long-term measurement-based data series of \(\text {UVI}_{\text {max}}\) and \(\text {H}_{\text {er,day}}\) values. \(\text {UVI}_{\text {max}}\) values in context of the used data series are the daily maxima of a 30-min moving time average of the measured erythemal irradiance multiplied by a constant equal to \(40 m^2/W\) [38]. The \(\text {H}_{\text {er,day}}\) values are given in SED. One SED is equivalent to an erythemal radiant exposure of \(SED=100 J/m^2\) [39]. Even though \(\text {UVI}_{\text {max}}\) and \(\text {H}_{\text {er,day}}\) values are not directly measured, the abbreviated term ’measured data’ as opposed to ’modeled data’ is used in this study. For the application and validation of the imputation methods, primarily, the data series of a UV measuring station (51.5 N, 7.5 E, 130 m a.s.l.) in Dortmund is used. This measuring station has been operated for over 25 years by the German Federal Institute for Occupational Safety and Health (Bundesanstalt für Arbeitsschutz und Arbeitsmedizin, BAuA) and is part of the German Solar UV Monitoring Network under the direction of the German Federal Office for Radiation Protection (Bundesamt für Strahlenschutz, BfS). The station is equipped with a double monochromator system (spectroradiometer) which detects the solar UV spectrum every 6 min. The erythemal irradiance at each measurement time point is obtained by weighting the spectra with the erythema function and subsequent integrating over the wavelength. The data have been subjected to a strict quality assessment by BfS, which includes scientific quality control and correction of the raw data based on calibration and measurement system data as well as a Fraunhofer lines comparison for a wavelength shift. If a correction of erroneous data was not reliably possible, the data were not used to calculate \(\text {UVI}_{\text {max}}\) and \(\text {H}_{\text {er,day}}\). The time series covers the period from August 1996 to December 2022. The values for \(\text {UVI}_{\text {max}}\) and \(\text {H}_{\text {er,day}}\) are missing on 11.6% and 12.5% of the days, respectively, due to either measuring system failure or measurement values not passing quality control. The percentage differences result from the lack of daily integral values on days when the measurement system was offline in the morning or afternoon. However, if sufficient measurement data are available between the downtimes, it is possible to accurately determine the \(\text {UVI}_{\text {max}}\) value as the daily maximum on these days. The gray colored bars in Fig. 1 illustrate days without data in the time series of \(\text {UVI}_{\text {max}}\) values. It can be noticed that there are periods with data gaps, ranging from single days to several weeks.

Fig. 1
figure 1

Period of the Dortmund \(\text {UVI}_{\text {max}}\) dataset with data gaps (gray bars)

To demonstrate that the imputation method presented here can also be applied to UV data series from stations of other UV monitoring networks, the UV data series (\(\text {UVI}_{\text {max}}\) and \(\text {H}_{\text {er,day}}\)) of the Royal Meteorological Institute (RMI) of Belgium in Uccle (district of Brussels, 50.8 N, 4.3 E, 100 m a.s.l.) is used for the same period [13]. The Uccle station is equipped with two Brewer instruments (#016 and #178) that are integrated into the European network EUBrewNet [40]. The measurement interval is 30 min. In contrast to the Dortmund data set, in which the \(\text {UVI}_{\text {max}}\) is calculated from the averaging of 5 erythemal irradiance values due to the higher measurement frequency, the daily \(\text {UVI}_{\text {max}}\) value of the Uccle data set thus corresponds to the daily maximum value without averaging. If data from brewer #016 were available, these data were used; otherwise, the data from brewer #178 were taken (measuring in parallel from 2002 to 2022). Due to the two measurement systems, the values for \(\text {UVI}_{\text {max}}\) and \(\text {H}_{\text {er,day}}\) are missing on just 0.5% and 2.8% of the days, respectively.

2.1.2 Data series of additional variables

In addition to the time series on \(\text {UVI}_{\text {max}}\) and \(\text {H}_{\text {er,day}}\), values of global radiation, and TCO are used as predictors for estimating UV data. For daily TCO values, we use data from the Tropospheric Emission Monitoring Internet Service (TEMIS) [41]. The TEMIS data set is available without gaps for both locations − Dortmund and Uccle − and periods. For the global radiation data at the Dortmund location, we use a data set measured by the German Meteorological Service (Deutscher Wetterdienst, DWD) at a station in Bochum (10 km away from the UV monitoring station). This data set contains data on the hourly sum of global radiation as well as data on the hourly sunshine duration in minutes [42]. After a plausibility check, the daily sum of sunshine hours, the daily maximum, and the daily integral of the global radiation were calculated. The global radiation data for the location Uccle were given by the RMI and originate from the same station as the UV data. The global radiation data from Uccle are available with a temporal resolution of 30 min. Therefore, a 30-min value can be used as the daily maximum value instead of an hourly value as with the global radiation data at the Dortmund location. For Uccle, analogous to Dortmund, the daily integral is calculated over time of the global radiation.

2.2 Methods

For imputation, the data of long-term data series of \(\text {UVI}_{\text {max}}\) and \(\text {H}_{\text {er,day}}\) themselves are used to estimate the values at the data gaps. We distinguish between three different methods: a model-based imputation, utilizing predictors that are correlated with the local UV values in an empirical model; an average-based imputation based on a statistical approach of averaging available local UV measurement data without predictors; and a mixed imputation combining both methods.

2.2.1 Model-based imputation

The goal of the developed model-based imputation is to consider the local solar radiation conditions on the days with missing values for the estimation. This is achieved using two selected parameters as predictors correlated to \(\text {UVI}_{\text {max}}\) and \(\text {H}_{\text {er,day}}\) values: First, the TCO, which directly influences the UV radiation passing through the atmosphere. Second, the global radiation, which primarily considers the cloud conditions. For \(\text {UVI}_{\text {max}}\) estimation, the daily maximum of global radiation is used as the global radiation parameter; for the \(\text {H}_{\text {er,day}}\) estimation, the daily integral of global radiation is used instead. In addition, two variables on time are considered in the model: the day in the calendar year, which can be seen as a proxy for the course of the solar elevation during the year, and the information about the month of a particular day as effect modifier. Other potential parameters, such as sunshine duration or aerosol optical depth (AOD) are not included in the model. These variables are closely related to or partially captured by global radiation.

More precisely, for model-based imputation of \(\text {UVI}_{\text {max}}\) and \(\text {H}_{\text {er,day}}\) values, an additive model is estimated by the function gam{mgcv} [43] in the software R [44]. In contrast to radiative transfer models, which aim to replicate the physical processes of UV radiation passing through the atmosphere, our statistical regression model estimates the relationships between the variables purely based on the available data at the Earth’s surface. The formulas given for \(\text {UVI}_{\text {max}}\) also apply to \(\text {H}_{\text {er,day}}\). For \(uvi_t\), the \(\text {UVI}_{\text {max}}\) on day t, a normal distribution with a time-independent variance \(\sigma ^2\) is assumed under the condition that the values for all predictors \(x_t = (DayInYear_t, TCO_t, Global_t, Month_t\)) are known

$$\begin{aligned} uvi_t|x_t&\sim N(\mu _t,\sigma ^2) \end{aligned}$$
(1)
$$\begin{aligned} \mu _t&= f(DayInYear_t) + TCO_t*Month_t + Global_t*Month_t. \end{aligned}$$
(2)

The expected value \(\mu _t\) of this distribution is additively composed of several components. First, the amount of the \(\text {UVI}_{\text {max}}\) depends significantly on the variable DayInYear which specifies the xth day of the year. For example, March 2 is the 62nd day of the year in a leap year and the 61st day of the year otherwise. The relationship between the \(\text {UVI}_{\text {max}}\) and the DayInYear cannot be assumed to be linear, since the \(\text {UVI}_{\text {max}}\) initially tends to increase over the course of the year, typically peaking during summer days, and then tends to decrease again. Therefore, for the effect of DayInYear on \(\text {UVI}_{\text {max}}\) a function f(DayInYear) is assumed which is estimated as a penalized spline with basis functions \(B_1,\ldots ,B_{20}\) of degree 3, basis coefficients \(\gamma _1,\ldots ,\gamma _{20}\), and a penalty term of order 2 [45]

$$\begin{aligned} f(DayInYear)=\sum _{j=1}^{20} \gamma _jB_j(DayInYear). \end{aligned}$$
(3)

In addition, the effect of TCO and global radiation on the \(\text {UVI}_{\text {max}}\) is assumed to be linear conditional on all other predictors. Since the correlations of UV data (\(\text {UVI}_{\text {max}}\) and \(\text {H}_{\text {er,day}}\)) with TCO and global radiation vary over the year, these variables are included in the model as an interaction with variables indicating the calendar month. In this way, the effects of TCO and global radiation may be different in each month. Therefore, in Eq. (2), the term \(TCO*Month\) is a shorthand notation and means that TCO, all monthly variables, and the interaction of these variables are included linearly in the model (correspondingly for \(Global*Month\))

$$\begin{aligned}{} & {} \beta _1 \cdot TCO + \beta _2 \cdot Jan + \ldots + \beta _{12} \cdot Nov \nonumber \\{} & {} \quad + \beta _{13} \cdot TCO \cdot Jan + \ldots + \beta _{23} \cdot TCO \cdot Nov. \end{aligned}$$
(4)

Here, \(\beta _1,\ldots ,\beta _{23}\) symbolize the regression coefficients. Note that the calendar month is effect-coded [46]. That means that actually 11 variables \(Jan,\ldots , Nov\) are used to represent the variable calendar month. For example, if the particular day is in January, then \(Jan=1\) and all other month variables Feb to Nov are 0. A special case is the month December: If it is a day in December, then all variables Jan to Nov are \(-1\).

The model and its parameters are estimated based on all days for which all variables are available. Finally, this estimation model is used to predict \(\text {UVI}_{\text {max}}\) values based on TCO and global radiation in the case of missing \(\text {UVI}_{\text {max}}\) values

$$\begin{aligned} \widehat{uvi}_t = \hat{\mu }_t. \end{aligned}$$
(5)

Here, \(\hat{\mu }_t\) denotes the estimated value of \(\mu _t\) which serves as the estimated \(\text {UVI}_{\text {max}}\) value \(\widehat{uvi}_t\).

2.2.2 Average-based imputation

The model-based imputation presented above can only be performed for days where the predictors TCO and global radiation are available. The average-based imputation addresses this problem. It provides an approach for estimating \(\text {UVI}_{\text {max}}\) and \(\text {H}_{\text {er,day}}\) values on days when the predictors are missing. The idea of averaged-based imputation is to estimate the missing \(\text {UVI}_{\text {max}}\) values by averaging similar available \(\text {UVI}_{\text {max}}\) values: If the \(\text {UVI}_{\text {max}}\) is missing on a particular day, the available \(\text {UVI}_{\text {max}}\) values of d days before and after that day are used. In addition, the available \(\text {UVI}_{\text {max}}\) values of the same period of y years days before and after that year are also used to consider the typical local annual variation of the UV data. Thus, a missing \(\text {UVI}_{\text {max}}\) value \(uvi_t\) is estimated as follows:

$$\begin{aligned} \widehat{uvi}_t(y,d) = \frac{1}{\mid U(y,d) \mid }\sum _{u \in U(y,d)}uvi_u. \end{aligned}$$
(6)

Here, U(yd) is the set of all available \(\text {UVI}_{\text {max}}\) values due to the parameters d and y and \(\mid U(y,d) \mid \) is the number of these values. As an example for \(y=2\) and \(d=1\): If the \(\text {UVI}_{\text {max}}\) is missing on 1st September 2007, the available \(\text {UVI}_{\text {max}}\) values on days 8/31/2005, 9/1/2005, 9/2/2005, \(\ldots \), 8/31/2009, 9/1/2009, and 9/2/2009 are averaged. This imputation strategy is also applicable when values for a longer period are missing instead of a single value. The mean is then calculated based on the \(\text {UVI}_{\text {max}}\) values from other years.

The parameters y and d are optimized separately for each variable (\(\text {UVI}_{\text {max}}\) in Dortmund, \(\text {H}_{\text {er,day}}\) in Dortmund, \(\text {UVI}_{\text {max}}\) in Uccle, \(\text {H}_{\text {er,day}}\) in Uccle). The variable-dependent optimization of both parameters has the advantage that the site-related variability of \(\text {UVI}_{\text {max}}\) or \(\text {H}_{\text {er,day}}\) but also the different variability between \(\text {UVI}_{\text {max}}\) and \(\text {H}_{\text {er,day}}\) is optimally taken into account. For optimization, various values of \(y=2,3,4,\ldots \) and \(d=0,1,2,\ldots \) are tested on a random training data set Train-Av. Here, \(y\ge 2\) is assumed, and thus, at least 5 years are always included to ensure that sufficient values are available to determine the mean. The values for y and d with the minimal mean square error (MSE, see Sect. 2.3.3) for a validation data set Val-Av are selected. When constructing the data sets Train-Av and Val-Av, the test data set Test-Imp and the validation data set Val-Imp are also taken into account. These are used to compare the imputation methods and are described in Sect. 2.3. Table 1 provides an overview of the sizes of all these data sets.

Table 1 Number of available and missing values and size of the various training, test and validation data sets for selected variables. The sum of the values in Val-Imp and Test-Imp gives the number of available values in the total data set Total from 14.08.1996 to 31.12.2022 (9,636 days). Train-Av and Val-Av together form the test data set Test-Imp

The validation data set Val-Av is constructed as follows: First, 2000 days are selected from all 9636 days (14.08.1996 − 31.12.2022) using a simple random sample. These days form the validation data set Val-Av if the values are not missing and if they are contained in the test data set Test-Imp. This ensures that both validation data sets Val-Imp and Val-Av are disjoint, and thus, the imputation methods and the parameters of the average-based imputation are validated on different data sets. All remaining non-missing values of Test-Imp represent the training data set Train-Av. Thus, Train-Av and Val-Av together form the test data set Test-Imp.

2.2.3 Mixed imputation

Finally, we propose a third imputation method that is a mixture of the average-based and the model-based imputation: If the \(\text {UVI}_{\text {max}}\) or \(\text {H}_{\text {er,day}}\) value is missing on day t, we use the model-based imputation if the information on TCO and global radiation is available on day t. If no information on TCO or global solar radiation is available on day t, we use the average-based imputation.

2.3 Approach for validation of imputation methods

The three imputation methods were validated in two different ways. One approach uses a randomly selected validation data set from the station in Dortmund (Sect. 2.3.1). To show the applicability of the imputation methods to other UV monitoring data sets (with different measurement system and measurements interval), we additionally performed the validation with a data set from the station in Uccle (Sect. 2.3.2). For each of these data sets, selected measured UV values (\(\text {UVI}_{\text {max}}\) and \(\text {H}_{\text {er,day}}\)) were deleted and then imputed using each of the three imputation methods. The validation data set Val-Imp is composed of the days with deleted values. The data from all other days form the test data Test-Imp and are used by the imputation methods to estimate the deleted values. Thus, the sum of the values in Test-Imp and Val-Imp gives the number of available values in the total data set Total (Table 1).

2.3.1 Random data deletion

The data deletion method for Dortmund aims to delete value ranges of different lengths randomly. Therefore, the UV values (\(\text {UVI}_{\text {max}}\) and \(\text {H}_{\text {er,day}}\)) were deleted for randomly determined periods during the entire observation period of the UV data set. The deletion periods have a length of \(x=1,2,\dots ,20\) days. How often a deletion period of length x is chosen is specified by the frequency h(x), which is calculated from

$$\begin{aligned} h(x)=round(100^{1-(x-1)\cdot 0.05}). \end{aligned}$$
Fig. 2
figure 2

Frequency of the length of the deletion periods for Dortmund \(\text {UVI}_{\text {max}}\) data set: intended for deletion (red) and actually deleted \(\text {UVI}_{\text {max}}\) values (black shaded) due to gaps in the data set

Thus, a single UV value has to be deleted 100 times, the UV values of two consecutive days have to be deleted 79 times, and so on. Altogether 482 periods with a total of 2255 days are intended to delete. For each of these deletion periods, the last day was randomly selected in the entire observation period and the associated UV value was deleted. Depending on the length of the period, the UV values of the days before were also deleted. Since, sometimes, a value to be deleted was missing anyway, deletion periods can be shorter than originally planned. On the other hand, the drawn last days of deletion periods can be so close to each other that a fusion of deletion periods occurs, resulting in longer deletion periods than originally planned. These two aspects lead to even more variation and are therefore considered positive. The \(\text {H}_{\text {er,day}}\) values are deleted on the same days as the \(\text {UVI}_{\text {max}}\) values. Since the days with originally missing \(\text {UVI}_{\text {max}}\) values are not always the same as these with originally missing \(\text {H}_{\text {er,day}}\) values, the deletion method gives slightly different results for \(\text {UVI}_{\text {max}}\) and \(\text {H}_{\text {er,day}}\). Finally, a total of 379 periods with 1828 daily \(\text {UVI}_{\text {max}}\) values and 384 periods with 1807 \(\text {H}_{\text {er,day}}\) values were effectively deleted. Figure 2 shows the actual frequency of the length of the deletion period for the \(\text {UVI}_{\text {max}}\) values (black shaded bars) compared to the intended deletion periods (red bars). Figure 3 illustrates the deletion periods (red bars) in the Dortmund \(\text {UVI}_{\text {max}}\) data set along with the gaps in the measurement data (gray bars).

Fig. 3
figure 3

Period of the Dortmund \(\text {UVI}_{\text {max}}\) data set with data gaps (gray bars) and intended periods for deletion (red bars)

2.3.2 Systematic data deletion

With the second data deletion approach, we want to validate with a real case of gaps in a UV data set. For this purpose, we select the dates with missing data in the Dortmund data set (Fig. 1, gray bars) as the dates to be deleted in the Uccle data set. This results in 1104 deleted \(\text {UVI}_{\text {max}}\) values and 1161 deleted \(\text {H}_{\text {er,day}}\) values in the Uccle data set.

2.3.3 Validation criteria

The deleted measured values \(\theta _t\) are imputed by each of the three imputation methods. The imputed values \(\hat{\theta }_t\) are then compared to the measured values by determining prediction errors. For this purpose, the following measures are used:

$$\begin{aligned}&\text {Bias/Mean} = 100 \cdot \frac{\frac{1}{N}\sum _{t=1}^N(\hat{\theta }_t - \theta _t)}{\bar{\theta }} \\&\quad \text {Mean Square Error (MSE)} = \frac{1}{N}\sum _{t=1}^N(\hat{\theta }_t - \theta _t)^2 \\&\quad \text {Root Mean Square Error (RMSE)/Mean} \\&\quad = 100 \cdot \frac{\sqrt{\frac{1}{N}\sum _{t=1}^N(\hat{\theta }_t - \theta _t)^2}}{\bar{\theta }} \\&\quad \text {Mean Bias Error (MBE)} = 100 \cdot \frac{1}{N}\sum _{t=1}^N\frac{\hat{\theta }_t - \theta _t}{\theta _t} \\&\quad \text {Mean Absolute Bias Error (MABE)} = 100 \cdot \frac{1}{N}\sum _{t=1}^N\frac{\mid \hat{\theta }_t - \theta _t \mid }{\theta _t} \\&\quad \text {Diff0.5} = 100 \cdot \frac{1}{N}\sum _{t=1}^N I(\mid \hat{\theta }_t - \theta _t\mid \le 0.5). \end{aligned}$$

Here, N denotes the number of values, \(\bar{\theta }\) is the mean of measured values \(\theta _t\), and I(boolean) is the indicator function which is one if the boolean is true and zero otherwise. The additional measure Diff1 corresponds to Diff0.5 but with condition of \(\le 1\).

Criteria such as Bias/Mean or MBE examine whether an estimation approach exhibits a systematic bias. Both criteria are not appropriate for assessing the estimation accuracy, as even very imprecise estimators can have a bias of zero if the positive and negative deviations cancel each other out. However, MSE, RMSE/Mean, and MABE, which take absolute or quadratic deviations into account, are suitable for describing the estimation accuracy. The cruder measures Diff0.5 and Diff1 describe the estimation error more clearly.

2.4 Further application: identifying systematic errors in UV data sets

If the validation of the developed model-based imputation method indicates that local data can be estimated with a very low bias, the method might also be suitable for checking a data set to determine whether measurement or calibration errors are present and what type of errors they are. Basically, in a calibration step (CS), deviations/errors of the measuring system are disclosed by comparison with a reference source. If a deviation is determined, it is a challenging task to get enough information for a reliable correction of a systematic error in previous data. There are several possible cases illustrated in Fig. 4. One possibility is that the measuring device was not calibrated correctly at the previous calibration step (CS1) and a systematic deviation occurred (case a). Another possibility is that an event (e.g., unintended change of the measuring device) arose between the current calibration step (CS2) and the previous calibration step (CS1) that led to systematic measurement errors from this point on (case c). It is also conceivable that the measuring device still worked correctly at the beginning, but then a continuously increasing deviation from the true values occurred (cases b+d).

Fig. 4
figure 4

Different cases of systematic errors in a data set determined by deviation at a calibration step (CS)

Model-based imputation in combination with a structural break analysis [47] could be used to investigate which of these errors is most plausible. This will be illustrated by an application case study: A comparison of the annual courses of the UV data in Dortmund with those in Uccle showed relatively similar courses for all years. Only in 2015 and the first half of 2016, the values for both \(\text {UVI}_{\text {max}}\) and \(\text {H}_{\text {er,day}}\) in Dortmund were significantly lower than the values in Uccle, which could not be explained by differences in local influencing parameters, such as global radiation and ozone. However, these deviations could be related to the calibrations of the measurement system, which took place on 21.01.2015 (CS1) and on 06.07.2016 (CS2). The calibration data from CS2 confirm that there is a deviation from the reference, but this information is not sufficient for a reliable correction of previous data. It could be that either an error occurred at the measurement system between these two dates (Fig. 4, cases b, c, d) or that the calibration on 21.01.2015 was already faulty (Fig. 4, case a).

To investigate this, the values for \(\text {UVI}_{\text {max}}\) and \(\text {H}_{\text {er,day}}\) in the period between CS1 and CS2 were deleted and replaced by the estimated values according to the model-based imputation. Thus, values for \(\text {UVI}_{\text {max}}\) and \(\text {H}_{\text {er,day}}\) are available for the contemplated days that would have been expected according to the global radiation and ozone situation on these days. These values were compared with the actual − possibly erroneous − original values. Together with a structural break analysis, it can now be evaluated whether there is a data error and which type of error is most plausible (Fig. 4).

3 Results

3.1 Validation of imputation methods

In the Dortmund validation data set, the values of the predictors TCO and global radiation are missing on 5.3% (\(\text {UVI}_{\text {max}}\)) and 5.4% (\(\text {H}_{\text {er,day}}\)) of the selected validation days. On these days, model-based imputation is not possible and the averaged-based imputation is used for the mixed imputation. Since these number of missing values are low, the results of mixed imputation are similar to the results of model-based imputation. There are no missing values of the predictors in the validation data for Uccle. Thus, there is no need to use the mixed imputation.

For the average-based imputation, the following optimal values were obtained for the parameter y and d which define the number of years and days considered for averaging (see Sect. 2.2.2): \(y=2, d=6\) (\(\text {UVI}_{\text {max}}\) in Dortmund), \(y=2, d=5\) (\(\text {H}_{\text {er,day}}\) in Dortmund), \(y=3, d=4\) (\(\text {UVI}_{\text {max}}\) in Uccle), and \(y=2, d=2\) (\(\text {H}_{\text {er,day}}\) in Uccle).

Fig. 5
figure 5

Density estimation of all measured patchy daily \(\text {UVI}_{\text {max}}\) values (8522 values) in the complete data set for Dortmund (gray curve), density estimation of the 1828 measured daily \(\text {UVI}_{\text {max}}\) values which were selected for validation (black curve), density estimation of the estimated 1828 values by the averaged-based (green), the model-based (red), and the mixed imputation (blue); description of the data sets in Table 1

To show that the selection of the validation data can be considered as representative, the density functions of all measured patchy daily \(\text {UVI}_{\text {max}}\) values of the Dortmund data set and the corresponding validation data are compared in Fig. 5. There is hardly any difference between the estimated density function of the original patchy data set (black curve) and the data selected as validation data (gray curve). The other curves show the estimated density functions of the data estimated by model-based (red curve), average-based (green curve), and mixed (blue curve) imputation method. All three imputation methods show a very good overall approximation to the original data. Only the average-based imputation tends to slightly underestimate high \(\text {UVI}_{\text {max}}\) values. When using the averaged-based method, estimated values in the range from 5 to 6 occur more frequently than would be expected. This phenomenon is discussed in detail in Sect. 4.

Fig. 6
figure 6

Histogram of the estimation error for daily \(\text {UVI}_{\text {max}}\) values of the Dortmund data set. Also plotted are the 5% and 95% quantiles in which 90% of the 1731 non-missing values lie

The prediction quality of the imputation methods is illustrated in Fig. 6. The histogram shows the distribution of the estimation error of the average-based and model-based imputation for the \(\text {UVI}_{\text {max}}\) validation data of Dortmund. The distribution for model-based imputation is much more tightly concentrated around zero than that for average-based imputation, i.e., lower estimation errors are more common for model-based imputation than for average-based imputation.

Table 2 Prediction errors for \(\text {UVI}_{\text {max}}\) of the three imputation methods (average-based, model-based, mixed) for the Dortmund validation data (random data deletion) and the Uccle validation data (systematic data deletion)

To quantify the results of the validation, the measures set out in Sect. 2.3.3 are calculated. Table 2 shows the results for \(\text {UVI}_{\text {max}}\) for both Dortmund and Uccle, while Table 3 illustrates the corresponding results for \(\text {H}_{\text {er,day}}\). For Uccle, only the averaged-based and the model-based results are presented. Since TCO and global radiation are available on all validation days, model-based and mixed imputation results are identical.

Table 3 Prediction errors for \(\text {H}_{\text {er,day}}\) of the three imputation methods (average-based, model-based, mixed) for the Dortmund validation data (random data deletion) and the Uccle validation data (systematic data deletion)

3.2 Applications

3.2.1 Filling data gaps

The main objective of the developed imputation methods is to prevent a bias of the monthly and annual means due to data gaps. Using the example of the \(\text {UVI}_{\text {max}}\) data set from Dortmund, it is shown how the imputation methods fill the data gaps and how this affects the formation of monthly and annual means.

Fig. 7
figure 7

Annual course of \(\text {UVI}_{\text {max}}\) at Dortmund in 1999

Figure 7 illustrates the functioning of the respective imputation methods for the \(\text {UVI}_{\text {max}}\) values from Dortmund in 1999. The black curve shows the available daily \(\text {UVI}_{\text {max}}\) values. The gray colored areas mark the days on which \(\text {UVI}_{\text {max}}\) values are missing. The data gap between 15th of April and the 6th of July is clearly visible. The available data result in an April monthly mean of 2.4 and an annual mean for 1999 of 2.2. These values are too low, as the values are systematically missing in the second half of April and in the early summer months. The average-based imputation leads to considerably higher means (2.9 for April and 2.7 for the whole of 1999). Model-based and mixed imputation leads to even higher means of 3.1 for April and 2.8 for the whole year which is 29% and 27% compared to the means calculated with patchy data. In September, a 9-day gap occurred during a sunny weather phase. Therefore, the monthly means calculated with average-based imputation (3.2) and model-based imputation (3.5) differ more than in April.

Fig. 8
figure 8

Annual mean of \(\text {UVI}_{\text {max}}\) (left axis) and proportions of missing daily values per year (right axis) at Dortmund from 1997 to 2022

Significant deviations between the means based on the available values and those based on the completed imputed values can also be observed for other years. Figure 8 shows the annual mean of the \(\text {UVI}_{\text {max}}\) in Dortmund (left axis) and the proportions of missing daily values per year (right axis) from 1997 to 2022. Besides 1999, larger deviations can be observed in 2004 and 2007. In 2007, the gap is similar to that in 1999 (Fig. 1). In 2004, data from winter are missing, which leads to a large bias in the annual mean for patchy data. Both the average-based imputation (green curve) and the model-based imputation (red curve) make the outliers in the annual means disappear. In practice, the mixed imputation is a good compromise. For the Dortmund data set, model-based imputation is used for 97% of the missing values, and averaged-based imputation for the others.

Fig. 9
figure 9

Annual course of \(\text {UVI}_{\text {max}}\) at Dortmund in 2021 with monthly mean values of the original data (black) and estimated data (red)

Figure 9 presents another example of the annual course of \(\text {UVI}_{\text {max}}\) at Dortmund in 2021. Here, the focus is not on comparing the imputation methods among themselves and with patchy data (as in Fig. 7 and 8), but rather on illustrating the difference between model-based imputation and the original data in the actual course of the year. For the imputation process, quarterly data were systematically removed. Subsequently, for each day, the \(\text {UVI}_{\text {max}}\) (based on all remaining data in the measurement series) was estimated using the model-based method. The year 2021 was chosen as a representative example, because 99% of predictors were available and there was very changeable weather during the summer months, with sunny/cloudy weather phases alternating frequently. The model-based imputation was able to replicate this well. Below the annual course, Fig. 9 shows the resulting monthly mean values of the estimated and original data.

3.2.2 Identifying systematic errors in UV data sets

To show a possible further application of the imputation method for identifying possible systematic errors in UV data sets, a case study is described in Sect. 2.4. The − possibly erroneous − original values of \(\text {UVI}_{\text {max}}\) and \(\text {H}_{\text {er,day}}\) in the period between CS1 and CS2 were compared to the estimated values according to the model-based imputation.

A structural break analysis [47] revealed no break points neither for the \(\text {UVI}_{\text {max}}\) deviations nor for the \(\text {H}_{\text {er,day}}\) deviations. It can therefore be assumed that cases c and d did not occur within the CS1–CS2 period. Linear regressions of the relative deviations of the original values to the predicted values as a function of time yielded very low adjusted R-squared for both the \(\text {UVI}_{\text {max}}\) deviations and the \(\text {H}_{\text {er,day}}\) deviations (\(R^2=0.02\) for \(\text {UVI}_{\text {max}}\), \(R^2=0.05\) for \(\text {H}_{\text {er,day}}\)). This indicates that there is no temporal pattern in the deviations. Thus, there is also no evidence that the deviations tend to increase over time (exclusion cases b and d).

However, evidence of a general underestimation in the measurement is determined. The comparison showed that the original values were underestimated by about 8% (BIAS/MEAN = \(-\)7.9% for \(\text {UVI}_{\text {max}}\) and \(-\)8.4% for \(\text {H}_{\text {er,day}}\)). These deviations are too high to be explained by general measurement inaccuracy. If the deviation is in the range of the deviation determined at calibration point 2 (CS2), a calibration error at CS1 can be assumed (case a). In addition, a similar analysis of the previous and subsequent calibration periods can support this conclusion. Supported by the model-based imputations method, a reliable data correction can be made for the entire period between CS1 and CS2.

4 Discussion

Previous studies analyzing long-term \(\text {UVI}_{\text {max}}\) or \(\text {H}_{\text {er,day}}\) data series handled data gaps quite differently: For analyses based on monthly means, months are often excluded if data of less than 10 days [13, 14, 48] or less than 15 days [15, 16] are available. However, a bias in monthly means still can occur if the values are systematically missing over longer periods. In a recent publication, [17] therefore introduce the additional condition, which requires at least 5 days from each third of the month. This excludes, in particular, months with longer outages at a stretch for further analysis. In the study of [48], months with more than 10 values that would fall under such additional criterion are not excluded but corrected with a gap correction factor derived from a climatological observation. This corrects mainly the effect of changing maximum daily solar altitude during a month. However, if bad/cloudy weather or sunshine phases occur during the gap periods, the method still leads to a bias that is not considered with such a climatological approach.

The application of the developed imputation methods show how the described challenges in data gap handling can be solved. The average-based imputation (Fig. 7, green curve) works like to a moving average resulting in a smooth curve. The method is similar to the climatological method of [48]. In contrast to [48], the averaged-based imputation has been optimized for the included days and years for averaging. The use of daily information on global solar radiation and ozone values in model-based imputation (Fig. 7, red curve) results in a much more realistic curve considering daily weather conditions. A comparison of estimated and original \(\text {UVI}_{\text {max}}\) data in 2021 (Fig. 9) shows that the model-based imputation can replicate sunny/cloudy weather phases well. Therefore, bias introduced by these weather phases can be avoided when filling gaps with the model-based imputation method.

The quantitative validation study confirms that all imputation methods estimate the measured values quite well (Tables 2 and 3). The low BIAS/MEAN for all three imputation methods applied to both data sets indicates nearly unbiased estimation. The criteria RMSE/MEAN, MBE, and MABE are significantly lower for the model-based and the mixed imputation than for the averaged imputation. The MABE values indicate that \(\text {UVI}_{\text {max}}\) and \(\text {H}_{\text {er,day}}\) values are overestimated or underestimated by about 15 to 16% on average by the model-based imputation, while the \(\text {H}_{\text {er,day}}\) values are overestimated and underestimated by 11–12% on average. This implies a reduction of the MABE values by a factor of 4–5 (\(\text {H}_{\text {er,day}}\)) and 2–3 (\(\text {UVI}_{\text {max}}\)) compared to the MABE values that occur with an average-based imputation. The difference between MABE \(\text {UVI}_{\text {max}}\) and MABE \(\text {H}_{\text {er,day}}\) is mainly attributed to the predictor of global radiation. The different measurement interval for global radiation and solar UV measurement has a greater influence on the daily maximum than the daily integral. According to Diff0.5, for model-based imputation, the absolute difference between measured and estimated \(\text {UVI}_{\text {max}}\) is lower than 0.5 in about 80% of the cases. Nearly, in every case (Diff1\( =94/96\)%), the absolute difference is at most 1. In Dortmund, the validation shows similar results for the mixed-model imputation and the model-based imputation, because in 95% of the validation days, all predictor data were available and the model-based imputation was applied. Overall, the comparison of the three proposed methods shows that the model-based imputation performs better than the average-based imputation. The model-based imputation benefits from considering the local solar conditions on the days with missing values. For example, sunny/cloudy weather phases [49, 50] or low/high ozone [51] is considered, which the average-based imputation cannot account for. Another drawback of the average-based imputation is its tendency to slightly underestimate high \(\text {UVI}_{\text {max}}\) values, as averaging in this imputation tends to converge toward the middle. Very high values can, therefore, not be reproduced. Model-based and mixed imputation has no difficulty in this regard. However, the average-based imputation has the advantage of not requiring any additional predictor data. The imputation is therefore in principle always guaranteed with a value typical for the season.

Finally, we recommend using model-based imputation if the necessary predictors are available. To get a sense of the predictive quality of this model-based imputation, the estimation results were compared to other approaches from the literature. Specifically, the proposed model-based imputation was compared to two empirical models for solar UV data estimation by [34] and [13]. [34] proposed an exponential decay model for ultraviolet erythemal (UVER) irradiance, depending on optical air mass, for the first step. In the second step, a power relation between the UVER irradiance under clear sky conditions multiplied by global radiation and the actual UVER irradiance is used. [13] used the covariates global radiation, TCO, and AOD in a linear model. Compared to the validation results (Table 3), the model-based imputation has a lower BIAS/MEAN (\(-\)0.7 to 1.6%) than the model of [34] of below 4%. With an MBE of about 2–4%, there is a low overestimation of the model-based imputation, while the model of [13] has a slight tendency to underestimate (MBE = -3%). As with [34], the RMSE/MEAN values of the proposed model-based imputation are below 18%. The relative estimation error MABE is in a range of around 11 to 12% for the model-based imputation indicating better prediction compared to the model of [13] (MABE = 18%).

Another comparison was made with an advanced radiative transfer model estimation of the \(\text {UVI}_{\text {max}}\). The state-of-the-art procedure UV-Index Operating System (UVIOS) developed by [26] combines information on geophysical input parameters from various modeled and satellite-based data sources to provide real-time \(\text {UVI}_{\text {max}}\) estimates. [26] demonstrated that UVIOS can predict \(\text {UVI}_{\text {max}}\) values less than 0.5 from the actual measured value 80% of the time under clear-sky conditions. For the comparison, the Diff0.5 and Diff1 results of the UVIOS from different locations in Europe were compared with the validation results of this study in Table 4. It has to be remarked that not the same measured values were considered in the methods. Nevertheless, a rough comparison is possible. By the model-based imputation, the absolute deviation between measured and estimated \(\text {UVI}_{\text {max}}\) is at most 0.5 in 80% of the days. In 94–96% of cases, it is at most 1. This shows that the model-based imputation offers similar good prediction quality overall as UVIOS, although much less information is used in the model-based imputation. In some cases, model-based imputation is even better than UVIOS. Particularly, for Uccle, the prediction error of model-based imputation is lower than that of UVIOS.

Table 4 Prediction errors for \(\text {UVI}_{\text {max}}\) in %: comparison of the proposed model-based imputation for Dortmund and Uccle (first two rows) to UVIOS for several European stations (all other rows) in [26]

Other approaches, such as radiation transfer models or hybrid deep learning methods (see, e.g., [29]), taking more parameters as predictors have the potential of lower estimation errors. However, implementation requires significantly more effort and some more general approaches cannot consider local conditions, which are typically crucial. For the presented model-based imputation, only the data of two predictors (TCO, global radiation) must be available. Thus, the imputation methods offer a sufficient approximation for data gap filling with far less effort.

As a further application, the model-based imputation method was successfully tested as a tool for detecting calibration errors and systematic measurement errors. While the method does not replace a thorough and regular measurement data check and calibration, it can be applied to identify possible systematic deviations in long-term series of UV measurement data, to confirm suspected errors quasi independently of the calibration step, or to validate corrections. To demonstrate the limits but also the potency of the model-based imputation, each year was deleted from a 26-year \(\text {UVI}_{\text {max}}\) measurement data series, imputed, and compared with the measured data (Fig. 10). For 2015 and 2016, the results before the correction of the calibration error described above are additionally shown hatched (Sect. 3.2.2).

Fig. 10
figure 10

Comparison of the measured data to the estimated data for \(\text {UVI}_{\text {max}}\) and \(\text {H}_{\text {er,day}}\) in Dortmund. For estimation, first, all measured values in a specific year were deleted. Second, these values were imputed by model-based imputation. Finally, the bias from measured to estimated values is calculated and set in relation to the mean value of the estimated values. This procedure was performed successively for each year. The hatched bars show the errors before the correction of the calibration error

The resulting annual mean variations fall within a range typically achieved with measurement comparisons of spectral UV monitoring systems [52]. The range is sufficient to detect most measurement or calibration errors. For the years that have deviations greater than 4% in Fig. 10 (highlighted in gray), it can be determined that, with one exception, these are years in which the predictors (global radiation or TCO) had unusually high or low values or/and relatively large amounts of UV data are missing for comparison with measured data. However, a more accurate analysis can be achieved if the exact time periods between calibration steps are deleted, as in the presented case study. Considering primarily only the days when sun is shining or clear sky days, further improvement can be expected, since the systematic error is more prominent on these days than on cloudy days. Finally, it should be noted that the results in Fig. 10 are not representative of the gap treatment use case (Sect. 3.2.1), which would always provide significantly higher accuracy based on the available data.

The available data also define the limits of the imputation methods. With large data gaps, few suitable predictor values (global radiation and TCO), and just short-term UV data series, the accuracy of data estimation and imputation decreases. However, the examples chosen here demonstrate that imputation methods are very well suited for filling data gaps in typical long-term UV measurement series with typical data gap proportion.

5 Conclusion

Long-term UV data series from ground-based measurements are essential to evaluate whether and how climate change affects parameters influencing the UV exposure of the population. In this context, it is of great importance to avoid a bias that may result from data gaps. To fill gaps in the data series of the daily maximum of UV Index and the daily erythemal radiant exposure, we have developed and successfully validated a simple but effective strategy of data imputation based on available local data and their statistical correlations. The model-based imputation method performs significantly better than the average-based method; nonetheless, the average-based method can be used without additional data available. The combined method is considered the best in practice. The imputation strategy with a simplified approach achieves sufficient performance for the use case of filling data gaps. The model-based imputation method can additionally be used as a tool to identify systematic errors at and between calibration steps in long-term erythemal UV data series.