1 Introduction

High-resolution, regularly gridded temperature maps are essential for the construction of climatologies (Newman et al. 2015; Mohr and Tveito 2008; van den Hurk et al. 2006). Several scientific fields use these maps, examples are: the calculation of evapotranspiration (Enku and Melesse 2014; Lofgren et al. 2011; Hiemstra and Sluiter 2011), ecological relationships with bird and plant species occurrence (Vasseur et al. 2014), and the timing of pollen release (van Vliet et al. 2002). On the larger scale, homogenized gridded time series have been constructed to study temperature changes (van der Schrier et al. 2011). For example, the central Netherlands has warmed approximately 1 °C over the twentieth century (van Oldenborgh and Van Ulden 2003). These authors have used data from the automatic weather stations (AWSs) to estimate a regional representation of the warming trends. Western Europe has been warming much faster than climate models projected (van Oldenborgh et al. 2009). The resolution of climate models is increasing and therefore there is a need for high resolution temperature products for their evaluation.

For the construction of gridded temperature maps, air temperature observations from AWSs alone are often not sufficient to capture local variations, such as temperature differences between cities and the countryside. In the Netherlands, the temporal resolution of temperature measurements is high, but the spatial density is limited. The 34 AWSs are approximately 30km apart (Fig. 1), which allows for a reasonable representation of the spatial variation on country scale. Temperature variations at the scale of cities and forests are not captured. Although the Netherlands has a mild maritime climate, the average daily maximum canopy urban heat island (UHI) amounts to 2–3 °C, and in some cases above 5 °C (Steeneveld et al. 2011; Dirksen et al. 2019). Previous studies have established a physically based semi-empirical relationship to calculate the UHI. Theeuwes et al. (2017) have correlated the UHI to the sky view factor, vegetation fraction, solar irradiance, diurnal temperature range (DTR), and wind speed. Also several previous interpolation studies on country scale have used predictors in addition to AWS temperature measurements. The temperature within the Netherlands has been interpolated using the distance to the sea as variable (Sluiter 2012). The Norwegian meteorological institute used station elevation and average and lowest height of the surroundings, station latitude (which in this case is an approximation of the distance to the sea), and station longitude as independent predictors for the temperature interpolation (Mohr and Tveito 2008). Additionally, the Finnish meteorological institute added the relative land/water cover as a predictor (Aalto et al. 2013). Within Croatia soil temperature interpolation algorithms used latitude, longitude, distance from the sea, elevation, time, solar irradiance, and the MODIS Land Surface Temperature (LST) product (Hengl et al. 2012). Satellite datasets are of additional value as they provide local spatial information between the measurement locations. In addition to the variables used for temperature interpolation in these different countries, Carlson and Boland (1978) stated that surface roughness and moisture availability are essential to determine the correct temperature.

Fig. 1
figure 1

Meteorological observations in the Netherlands used for the interpolation and validation of the methods. In red the 34 KNMI AWSs which were used for the temperature interpolation. The stations are spread almost equally throughout the country, although they slightly favor the western part. Meteorological stations networks used for the validation: in blue the Wunderground stations and in orange the Wageningen University Observatory (WUR) station. For the GGD network, see Fig. 2. A full overview of the metadata can be found in Table 1

Different interpolation techniques have been used to grid the air temperature with several predictors, these include: regression kriging, residual kriging, space-time kriging, and generalized additive models (Sluiter 2012; Mohr and Tveito 2008; Aalto et al. 2013; Hengl et al. 2012). One of the problems with the kriging interpolation techniques is the fitting of the variogram model, which is in most cases fitted for the whole domain (Haas 1990). This would assume the spatial correlation is similar for the whole domain and is in that sense domain dependent. Kuhn and Johnson (2016) introduced predictive modeling with different techniques such as linear regression and multiple adaptive regression splines which can handle more local variations. To prevent over-fitting of the models, it is optional to add feature selection (Kuhn and Johnson 2016). Cross-validation has been commonly used to validate the interpolation technique (Sluiter 2012; Aalto et al. 2013; Hengl et al. 2012; Mohr and Tveito 2008). The temperature will be predicted at the left out locations, which assumes that the training and test points are spatially independent. This assumption is not always justified. Ideally the temperature predictions would not only be validated at the representative AWS locations. Crowd-sourced weather observations networks like Wunderground have a high spatial coverage within the Netherlands, but long-term records are not available. These citizen weather observations are often used in urban heat island studies, since the coverage of traditional AWS is often not sufficient (Theeuwes et al. 2017; Chapman et al. 2017). These networks are not calibrated and therefore the stations need to be selected carefully (Bell et al. 2015; Theeuwes et al. 2017).

Extending these previous ideas of spatial predictors, our goal is to reconstruct daily temperature patterns in the Netherlands using eight different predictors: DTR, population density, elevation, albedo, surface solar irradiance, roughness, precipitation, and normalized difference vegetation index. As interpolation techniques, linear regression and multiple adaptive regression splines are compared. As reference interpolation, we used ordinary kriging, which only uses the AWS temperature observations. To validate the models, we will use both cross-validation and crowd-sourced measurements. The temperature is reconstructed on a 1-km grid for the period 1990–2017.

The structure of the article is as follows. Section 2 describes the temperature observations, predictors, interpolation techniques, and validation methods. The results are described in Section 3 which contains spatial climatology patterns from the different interpolation techniques and a detailed validation. A discussion is presented in Section 4 together with possible applications of the interpolation techniques for local warming trends. Conclusions are drawn in Section 5.

2 Methodology

2.1 Meteorological observations

Daily air-temperature measurements from the 34 AWS locations in the Netherlands are used for the interpolation (Fig. 1). The station spreading slightly favors the west coast which is densely populated. The air temperature is measured at a height of 1.5m. The AWSs are maintained according to WMO standards. The total uncertainty of the temperature measurements, derived from the external errors, measurement uncertainty, and calibration error, is estimated at 0.13 °C (Bijma 2012). Homogeneity adjustments were made for the five longest time records: De Kooy, De Bilt, Eelde, Vlissingen, and Maastricht. Using parallel measurements, the time series were corrected on a daily basis (Brandsma 2016). There have been a few relocations of other AWS measurement sites for which we have not corrected.

In the Netherlands, a huge amount of citizen weather data available is available, although of varying qualities. From the Wunderground network,Footnote 1 19 stations have been selected which serve as an independent validation (Fig. 1). Comparing official and amateur temperature observations, the Davis Vantage instruments show the smallest absolute measurement errors around 0.2 °C and standard deviations around 0.3 °C (Bell 2014). The daytime solar irradiance–induced error was estimated on average between 0.6 and 0.7 °C (Cornes et al. 2019). The Davis Vantage (DV) temperature sensor is a PN junction silicone diode type with a measurement range from -40 up to 65 °C (Davis Instruments 2014). To ensure spatial coverage throughout the country, also one other system was selected, the Oregon Scientific Professional Weather Center (OSPWC) which has a thermo-hygro THGR800 sensor. This sensor measures from -30 up to 60 °C with an accuracy between 1 and 2 °C (Oregon Scientific 2009). The Wunderground network provides no high-quality sensors in Amsterdam (capital of the Netherlands).

However, another network, operated by the local health service (GGD), is available since 2014 in this region (van der Zee and Helmink 2017). Next to each of the 20 air-quality measurement locations, temperature sensors were installed (Fig. 2). The temperature was measured with the BX-592 system (Met One Instruments 2016)Footnote 2; the BX-592 system has the temperature sensor inside a white temperature hut and measures within the range from -30 up to 50 °C. This system has not been validated in the field by Bell (2014) or others. According to the instruction manual, the measurement error of the BX-592 system is 0.2 °C and should be of similar quality as the Davis Vantage instruments.

Fig. 2
figure 2

Meteorological observations used for the validation in the Amsterdam region from the GGD network. A full overview of the metadata can be found in Table 1

Rural measurements from station Haarweg near Wageningen University from 2001 until 2012 were also used for validation purposes (Jacobs et al. 2010). Currently, the new rural station Veenkampen, also from Wageningen University, is sited nearby (Fig. 1). Observations since 2011-06-01 were used for the validation. Compared with the Pt500 resistance which is used for the AWS stations, the Pt100 sensor is a bit more vulnerable due to its thinner platinum wire. The resistance of the Pt500 and Pt100 is respectively 500Ω and 100Ω. A complete metadata description of the meteorological observations used for our analysis is included in Table 1.

Table 1 Overview of the meteorological observations. The Wunderground network was abbreviated as “Wund.gd.”

The citizen weather data was quality controlled to avoid extreme outliers and implausible measurements. Temperatures below -40 °C and above 45 °C are rejected. If a value is repeated 8 times or more in a row, it was also rejected. The citizen weather data differs in temporal resolution for each station and is not always continuous in time. If the time between sequential measurements was more than 1 h, no daily average was calculated, additionally only sequential series longer than 1 week were selected. Next, the filtered temperature measurements were linearly interpolated to regular 10-min timestamps from which daily averages were calculated.

2.2 Spatial predictors

The temperature measurements from the AWSs were interpolated with auxiliary datasets. The climatological values for the predictors at the locations of the 34 AWSs are included in Table 2. The following datasets are taken into account: DTR, population density, elevation, albedo, solar irradiance, roughness, precipitation, and normalized difference vegetation index (NDVI) (Fig. 3). All the predictors are regridded on a 1-km grid, which is considered to be the highest possible resolution with this input data. The spatial resampling of the data was performed in R using the raster package using bilinear resampling. Additionally, the predictors were centered and scaled. To prevent over-fitting, for each day, highly correlated predictors (with a correlation coefficient of > 0.75) are excluded.

Fig. 3
figure 3

Centered and scaled auxiliary datasets, in units standard deviations, which were used as additional information for the temperature interpolation. The datasets were centered and scaled using R. First, for each individual layer, the means are subtracted. Second, the layers are scaled by their standard deviation. For the plotting routine, the time varying variables are averaged over time. The eight different datasets are: a diurnal temperature range, b population density, c elevation, d albedo, e solar irradiance at the surface, f roughness, g precipitation, and h normalized difference vegetation index

Table 2 Overview of climatological values of the prediction datasets at the AWS locations. The numbering is according to Fig. 1

2.2.1 Diurnal temperature range

The DTR is an essential variable for local temperature analysis (Theeuwes et al. 2017); here we will use the DTR as a background field. The DTR was calculated as the maximum temperature difference for the AWSs during the corresponding day. The temperature differences were interpolated using ordinary kriging. For the temperature interpolation, we used a monthly mean climatology. The 30-year mean range is shown in Fig. 3a. The pattern has a northwest-southeast gradient with relatively high values inland. Averaged over a longer time period, the DTR correlates strongly to the distance from the sea.

2.2.2 Population density

The population density is highest in the western part of the country, where the capital Amsterdam and main harbor of Rotterdam are situated (Fig. 3b). The relation between temperature and city population were initially established by Oke (1982). On a local scale, the population density in the Dutch urban areas is positively correlated to the temperature (Steeneveld et al. 2011). As a measure of the urbanization and UHI effect, we therefore use the population density.Footnote 3 The population data from central agency for statistics (CBS)Footnote 4 has been documented in van Leeuwen et al. (2017). The dataset includes yearly averages from 2000 until 2014 with an original resolution of 100m by 100m. Before 2000 the population density of 2000 was used, similarly, between 2015 and 2017 the year 2014 was used. Hereby we assume that the population patterns are similar outside of the datasets range.

2.2.3 Elevation

Lapse rates are typically in the order of 0.6–1.0 °C per 100m. Most of the Netherlands is relatively flat, though some hills are present in the east and southeast of the country, with 322m as the highest elevation. We hypothesise that although height differences are small they still are of relevance to the temperature pattern within the Netherlands.

Highly accurate elevation measurements, covering the entire country excluding water bodies, from aircraft measurements have a point density of 6–10 points per m2 (Sitek et al. 2006; Isenburg 2013).[3] The elevation points are relative to the average sea level of the North-sea and have an error of 3cm (van der Zon 2013). Processing this 1.5-TB dataset and resampling it to the temperature prediction resolution of 1km would be computationally expensive; therefore, we have pre-processed the 2.5-m contours derived data (Fig. 3c).

2.2.4 Solar Irradiance

The solar irradiance at the surface relates to the amount of energy which is available for surface heating (Müller et al. 2015) and is therefore essential for the temperature analysis. The predominant pattern of the solar irradiance is the north-south gradient though from the climatology some coastal effects and height influences can be distinguished.

Geostationary satellites provide observations four times an hour within the Netherlands, from which incoming short-wave radiation is derived. The products from SARAH (Müller et al. 2015) and SICCS (Greuell et al. 2013) are both based on radiative transfer modeling and, combined, they provide an almost complete temporal coverage between 1990 and 2017. From 1990 to 2013, SARAH1 was used, and from 2014 onward SICCS was used as a predictor. SARAH1 has a spatial resolution of 0.05 degree; validation with the baseline surface radiation network (BSRN) shows a bias of 1.3W/m2 (Müller et al. 2015). SICCS has a similar resolution of approximately 6km and a bias of 6W/m2 (Greuell et al. 2013). Comparing the patterns of the overlapping time periods of SARAH and SICCS, we found a median Pearson correlation coefficient of 0.94. In case the solar irradiance was not available, the monthly climatology from SARAH 1990–2013 was used as a trend. An example of the SARAH solar irradiance is included in Fig. 3d.

2.2.5 Albedo

Part of the solar irradiance is reflected back towards into the atmosphere. The amount of reflection is strongly related to the surface albedo (Moody et al. 2007). Albedo satellite measurements rely on cloud-free observations. During periods with a snow cover, the albedo is higher than in periods without snow. Occasionally, mainly during winter months, the Netherlands is covered with snow, influencing the temperature patterns both day and night. Therefore, on days with snow cover, the snow albedo from the International Geosphere-Biosphere Program (IGBP) was used. This dataset consists of 16 different surface classes with related snow albedos. The snow albedo differs among surfaces, e.g., under high vegetation the maximum albedo is lower (Moody et al. 2007). The snow-free 5-year climatological albedo from MODIS has a spatial resolution of 1km and a temporal resolution of 16 days (Moody et al. 2008), see Fig. 3e. On days with a snow cover or partial snow cover, the daily albedo (αd) was estimated from the IGBP snow albedo and snow-free albedo as:

$$ \alpha_{d} = C_{sn}*\alpha_{sn+veg}+(1-C_{sn})*\alpha_{clim} $$
(1)

where αsn+veg is the snow albedo depending on the vegetation, αclim is the 16 days climatological albedo of the surface without snow valid for the corresponding day, and Csn is the snow cover fraction which was calculated as:

$$ C_{sn} = min\left( 1,\frac{D_{sn}}{0.1}\right) $$
(2)

where Dsn is the interpolated snow depth in meters (Dutra et al. 2010). The snow depth was interpolated, using a thin plate spline function (from the Rs fields package) and the 325 precipitation stations from the KNMI volunteer network.

2.2.6 Surface roughness

The surface roughness is the main contributor to differences in surface temperature between forested and open land (Burakowski et al. 2018). Above the canopy layer, the forest area experiences additional mixing. Due to the vegetation, mixing of the lower air is limited. The air temperature is not only cooler due to shadowing effect but part of the year evapotranspiration lowers the forests’ temperatures. The forested areas experience less mixing below the canopy and therefore warmer nighttime temperatures and cooler daytime temperatures compared with open land. As a first-order estimate of this complex interaction, we use the surface roughness length.

The roughness length is based on land-use classification, which also forms the largest uncertainty. Lindenberg (2011) found improvements of the wind speed simulations using the roughness length based on the CORINE database compared with the USGS land-use data. CORINE uses 44 different land-use classes and has a resolution of 100m, derived from the European land-use database. For our analysis, we used the CORINE-based surface roughness.Footnote 5 Considering seasonal changes in roughness lengths, a different surface roughness is used during summer and winter (Fig. 3f).

2.2.7 Precipitation

The gridded daily precipitation files are based on 325 precipitation stations.Footnote 6 The manual precipitation network has remained almost the same over the entire period. Most of the measurement locations are in gardens of houses or near farm lands at a height of 0.40m above the ground. The precipitation is measured every 24 h at 8:00 UTC (Brandsma 2014). The measurements were interpolated using ordinary kriging (Sluiter 2012). From the daily data, the monthly mean precipitation was calculated, thereby providing an estimate of soil moisture availability. Precipitation and temperature over land are generally negatively correlated. This relationship however depends on the latitude; for the higher latitudes, the correlation is inversed (Trenberth and Shea 2005). This relationship is neither trivial nor fully understood because other variables such as the origin of the moist may play a key role. Despite this unclear relationship, precipitation is a key variable for the air temperature. The climatological precipitation patterns generally are wetter near the coast and dryer inland (Fig. 3g).

2.2.8 Normalized difference vegetation index

Vegetation absorbs and reflects solar irradiance and influences the heat exchange and therefore it influences the near surface temperature (Deng et al. 2018). Day and nigh-time temperatures are influenced by evapotranspiration and the isolating effect of vegetation. The NDVI is a commonly used vegetation index which we will use as a proxy for these processes. The NDVI mainly distinguishes between cities and vegetated areas. In the sections below, the different datasets are described accordingly.

The NDVI is derived from satellite images using visible red (0.6 − 0.7μm) and near infrared (0.7 − 1.1μm) wavelengths. The monthly NDVI product MOD13A3 (Didan and Huete 2006) from 2000 to 2016 has a resolution of 1km. For this product, the satellite cloud-free images are selected, which are corrected for the atmospheric influences and nadir-adjusted (Huete and Justice 1999). From MOD13A3 a monthly climatology was calculated by averaging 16 years of data (Fig. 3h).

2.3 Interpolation techniques

The interpolation/regression techniques ordinary kriging (ok), linear regression (lm), and multiple adaptive regression splines (MARS) will be compared (Table 3). The data analysis is performed in R, which is a programming language with useful statistical and geospatial techniques. The ok interpolation is implemented in the Rs automap package (Hiemstra 2015) and gstat package (Pebesma and Graeler 2017). The Rs caret package offers a large variety of regression techniques from which lm and MARS are further explored (Kuhn and Johnson 2016). The independent citizen weather observations are compared with the different methods. The RMSE and bias were calculated for all the citizen weather stations.

Table 3 Models used to predict the temperature field

2.3.1 Ordinary kriging

Kriging interpolation calculates spatial correlations between the observed and surrounding values. Simple kriging assumes that the covariance between the locations only depends on the distance between the locations and the mean residual is zero. Additionally, ok also assumes that the trend is a known mean value (m(x)). The estimated value at an unmeasured location (Z(x)) is calculated as:

$$ Z(x)=m(x)+e(x) $$
(3)

where m(x) describes the trend and e(x) is the spatially dependent residual (Hiemstra and Sluiter 2011), also known as the error term. In order to minimize the error, an exponential variogram model is fitted through the residuals. The kriging interpolation uses leave one out cross-validation (LOOCV ) to determine model performance, according to Hiemstra (2015).

2.3.2 Linear regression model

The 1.5m temperature (T1.5m) will be correlated to the auxiliary data using lm. The linear model is fit as follows:

$$ T_{1.5m} = \beta_{0} + \beta_{1} V_{1} + \beta_{2} V_{2} + \beta_{i} V_{i} + \beta_{...} V_{...} + ... $$
(4)

where β0 is the fitted constant and βi is the fitted value to the variable (Vi). A summary of the model settings which were used for this analysis is shown in Table 3. To prevent over-fitting, the linear model uses recursive feature elimination (rfe), which uses a backwards selection. After excluding predictors, the feature importance is recalculated for the remaining predictors. Optimization of the linear model is based on tuning RMSE values (Kuhn and Johnson 2016). RMSE values for all models are calculated as:

$$ RMSE = \sqrt{\frac{1}{n} \sum\limits_{i=1}^{n}(\overset{-}{T}_{i}-T_{i})^{2}} $$
(5)

where n is equal to the number of observations, \(\overset {-}{T}_{i}\) is the observed temperature, and Ti is the predicted temperature. Detailed descriptions and examples are included in Kuhn and Johnson (2016).

2.3.3 Multiple adaptive regression splines

The function BagEarthGCV described by Kuhn and Johnson (2016) is a wrapper function around the MARS from Friedman (1991). This extended linear model enables nonlinear fitting between the multiple predictors. The MARS model uses the build in general cross-validation (GCV) statistics to prune the model with backward feature selection. Mathematical details and a full derivation of the model can be found in Friedman (1991).

3 Results

Before running lm and MARS, predictors with a 75% correlation or higher were excluded from the analysis. In 93.6% of the cases, no predictors were excluded beforehand. For 3.5% of the cases, the DTR range was excluded, followed by the irradiance (2.0%), elevation (0.4%), and precipitation (0.4%). Only for a few cases multiple predictors were excluded, these include the DTR and precipitation (0.01%) and the precipitation and irradiance (0.01%). So, in most of the cases, all the predictors are considered in the analysis.

Looking into the overall variable importance of the predictors for the lm and MARS models shows different results. For the lm predictions, the overall variable importance equals: albedo (75.3%), DTR (6.7%), elevation (4.1%), population density (3.5%), irradiance (3.6%), NDVI (2.9%), precipitation (2.1%), and roughness (1.8%). The overall variable importance for MARS equals: albedo (8.2%), DTR (19.0%), elevation (12.1%), population density (12.3%), irradiance (17.3%), NDVI (12.2%), precipitation (11.3%), and roughness (7.5%).

3.1 Spatial patterns

The daily 10%, 50%, and 90% temperature quantiles represent the lowest temperatures which occur in winter time, the yearly averaged temperature, and the warmest temperatures which occur in summer time, respectively. The daily temperature patterns for all models are similar on the relatively larger spatial scale (Fig. 4). In the southwestern part, the average temperatures amount to 10.5–11.0 °C. In the northeast, temperatures are colder and amount to 9.5–10.0 °C. The 10% temperature quantile, or the lowest temperatures, has a strong east-west pattern. The median temperatures pattern has a southwest-northeast gradient. The 90% temperature quantile, or the warmest temperatures, has a northwest-southeast pattern. However, on a local spatial scale, ok does not show variations in temperature, and this in contrast with lm and MARS. During the summer months, lm and MARS have an average UHI of 0.5 °C. In the southwestern part, a higher UHI, up to 1.0 °C, is found for MARS. The peat areas are on average 0.3–0.4 °C cooler. The central forest areas are in MARS 0.1–0.2 °C cooler. During the winter months, the UHI for lm and MARS is around 0.8 °C and 1.0 °C respectively.

Fig. 4
figure 4

Climatology expressed as 10%, 50%, and 90% temperature quantiles: a lower temperatures, b median temperatures, and c the highest temperatures for the three different methods ordinary kriging (ok), linear regression (lm), and multiple adaptive regression splines (earth)

3.1.1 Low temperatures

The 10% temperature quantile varies between 1.0 °C in the eastern part of the country and 3.5 °C in the western coastal area. The ok model has a strong gradient in the southwestern part; this strong gradient is not supported by lm and MARS. Both lm and MARS have temperatures between 2.0 and 2.5 °C outside city areas near the coast. Within the denser populated areas, lm and MARS predict temperatures respectively around 3.0 °C and 3.5 °C. In the southeastern part, an elevated area (Fig. 3), MARS predicts 0.5 °C lower temperatures than ok and lm.

3.1.2 Median temperatures

The 50% temperature quantile in the northeast is around 10.0 °C and in the southwest 10.5–11 °C. The densely populated areas are approximately 0.5 °C warmer in lm and MARS. Similarly to the 10% temperature quantile, MARS predicts the lowest temperatures in the southeastern part (10.1 °C), and differences are 0.3 °C with lm and 0.6 °C with ok. The lm predictions are for the central elevated area 0.3 °C higher temperatures compared with ok and MARS. The peat areas in the central part of the Netherlands have, for lm and MARS, lower temperatures.

3.1.3 High temperatures

The 90% temperature quantile varies between 17.5 °C in the North and 19.0 °C in the southwest. Consistent with the 10% and 50% temperature quantiles, the city effect is pronounced for lm and MARS. In both cases, the temperatures are predicted on average 0.5 °C higher. Compared with MARS, the lm predicts higher temperatures in the central elevated area (18.6 °C compared with 18.3 °C). The earths temperatures in the peat areas are lower.

3.2 Model performance and comparison

The median RMSE LOOCV of the daily interpolations for ok, lm, and MARS are (with the standard deviation between brackets): 2.56 (3.57), 0.71 (0.38), and 0.60 (0.35) respectively. The mean absolute error equals respectively 0.34 (0.27), 0.57 (0.30), and 0.46 (0.26). For new predictions at the leave out locations of the AWSs, MARS is the most stable.

The observations are compared with the model predictions (Fig. 5). The bias at the locations of the AWSs equals zero; this is as expected. The RMSE at the AWSs is 0.52 °C for ok, 0.58 °C for lm, and 0.76 °C for MARS. This is also reflected in the broader density spreading around the 1:1 line. The Haarweg station, which is situated in a similar setting as the AWSs, shows similar results. Here, the bias is non-zero and equals for ok, lm, and MARS respectively 0.04 °C, 0.09 °C, and 0.08 °C. The RMSE values for Haarweg are smaller than the AWS stations; ok, lm, and MARS have a RMSE of 0.36 °C, 0.46 °C, and 0.46 °C. Comparing the predictions with the citizen weather observations shows a different result. The ok interpolation has a cold bias of 0.86 °C for the GGD observations. Both lm and MARS have a smaller bias of 0.54 °C and 0.43 °C. RMSE values are around 1.2 °C. The Wunderground observations have smaller RMSE with values around 1.0 °C. Similar to the GGD observations, ok has the largest bias of 0.47 °C. For lm and MARS, the bias equals 0.29 °C and 0.25 °C.

Fig. 5
figure 5

Predicted versus observed temperature values: a at the locations of the AWSs, validated for the entire time period, and b for the different citizen weather observations. The GGD observations were mainly after 2014. The Haarweg observations include measurements from 2001 onward and the first Wunderground observations start in 2008

The difference in variable importance (see Section 3) also explains the prediction differences: MARS predicts higher temperature in cities and lower temperatures in elevated areas since these are considered to be more important. Within cities the MARS predictions are closer to the observed citizen weather temperatures than lm, and this suggests that the variable importance of MARS is more realistic.

4 Discussion

Regression kriging has been used by Sluiter (2012), Mohr and Tveito (2008), and Aalto et al. (2013) and has also been explored in our analysis (not shown). Because of the large number of variables, the variogram fitting proves to be unstable. Regression kriging does not have a built-in feature selection such as lm and MARS. The additional predictors used for the lm and MARS resulted in a higher spatial variability which is also reflected in the RMSE values of the AWS stations. The comparison with independent citizen weather observation has shown that both lm and MARS have a smaller bias compared with the ok method which does not use additional predictors. Errors for higher temperatures could possibly be caused by the non-optimal or absent ventilation systems (Bell et al. 2015).

It may be possible that the high-resolution temperature grids that we present here can be used to analyze local warming trends. For the station De Bilt, located in the center of the Netherlands, van Oldenborgh and Van Ulden (2003) fitted a linear warming trend from which 1 °C is contributed to warming, including an increase in southwesterly circulation, and 0.6 °C is the standard deviation error or noise factor. Between 1975 and 2010, the warming in the Bilt was 0.049 °C per year (van der Schrier et al. 2011). For our time period (1990–2017), we found a linear warming trend in de Bilt of 0.025 °C per year. Analyzing the linear warming trend from 1990 to 2017 from our gridded temperature dataset shows an average warming of 0.027 °C per year with a noise factor of 0.64 °C for ok. A possible lack of significant spatial variability in the ok temperature grids is due to the inherited spatial smoothing of ok. The fitted trends for lm and MARS have similar noise values though their warming trends do show a spatial pattern. The first and third quantiles for lm and MARS are respectively 0.026–0.029 °C and 0.027–0.031 °C. These variations are supported by the yearly linear warming trends including:

  • The AWS stations near the coast: De Kooy (0.029 °C), Vlissingen (0.024 °C), and Rotterdam (0.025 °C).

  • A bit further inland the warming trend of AWS Schiphol was estimated at 0.031 °C.

  • In the central part of the country, the warming trends of De Bilt, Cabauw, and Deelen were 0.025 °C, 0.030 °C, and 0.021 °C respectively.

  • In the east, the AWS Twenthe had a warming trend of 0.026 °C.

  • In the northern part of the country, warming trends of the AWS stations Leeuwarden and Eelde were 0.032 °C and 0.030 °C respectively.

It has been suggested that the warming near the coast is slower than further inland. The North Sea is likely to delay the warming in the coastal regions. Also, additional warming due to an increase in southwesterly winds, as suggested by van Oldenborgh and Van Ulden (2003), is possibly stronger inland. Besides these large-scale variations in warming trends, also variations near the expanding cities can be observed. It is unclear to what extent this trend is artificial (induced by the input from population density) or representative because long-term records in cities are not available to support the trends.

Several recent studies have been dedicated to compare and validate observational data with numerical weather prediction (NWP) model or reanalysis data (Mohammadi et al. 2017; Carrera et al. 2019; Heintzman 2019; Rontu et al. 2019). Observational data helps to identify model biases and allows for bias corrections. Also interpolated or gridded products have been compared with models (Hutchison et al. 2017; Jain and Flannigan 2017; Ouyang et al. 2018; Krauskopf and Huth 2019; López Gómez et al. 2020). This comparison is complex because it not only highlights model biases but also deficits from interpolated grids. Therefore, Krauskopf and Huth (2019) compared different sources of (gridded) observations and reanalysis. In certain regions (e.g., the Pyrenees or Scandinavia) and on small scales, the largest differences were found. Our interpolation product, combined with independent citizen weather observations, can be used to improve high-resolution weather forecasts and to improve the quantification of the urban heat island effect.

The interpolation method presented here could also be applied to different countries or regions provided that the same predictors are available. It does however require an observational network which is representative for the interpolation area. If the network is too sparse, over-fitting may become an issue. It is therefore of great importance to validate the interpolation with independent observational data such as citizen weather observations.

5 Conclusions

With a final resolution of 1km, the daily air temperature within the Netherlands was reconstructed for the period 1990–2017. The goal of this study was to develop a new methodology; therefore, we used eight different spatial predictors and compared linear regression and multiple adaptive regression splines, to reconstruct daily temperature patterns. As a reference run, ordinary kriging, which does not use spatial predictors, was compared with the more advanced methods. The spatial predictors included: diurnal temperature range (DTR), population density, elevation, albedo, solar irradiance, roughness, precipitation, and normalized difference vegetation index (NDVI). These were fit to the temperature observations of the 34 automatic weather stations (AWSs), which are spread throughout the country. In order to prevent over-fitting, highly correlated predictors were excluded beforehand and the fitting routine used backwards feature selection. The models were validated using leave one out cross validation (LOOCV) for each of the 34 AWS locations. The RMSE for ordinary kriging (ok) was 2.6 °C. Using linear model (lm) and multiple adaptive regression splines (MARS) resulted in a significantly smaller RMSE: 0.7 °C and 0.6 °C, respectively. Since the AWS locations are used during the interpolation, the predictions were validated with citizen weather observations (including stations from the Wunderground network), GGD network, and Wageningen University Observatory. In contrast to the ok model, the lm and MARS models are 0.2–0.5 °C less biased compared with citizen weather temperature observations.

On country scale, temperature patterns are similar, though the temperature gradient found by ok is generally stronger. Temperature patterns of the median climatology show a southwest-northeast gradient. In the southwest, the highest mean temperatures occur. The 10% temperature quantile has a strong east-west temperature gradient, with the lowest temperatures in the east. The 90% temperature quantile has the highest temperatures in the southwest and lowest in the northeast. Both the temperature patterns from lm and MARS show local temperature variations. The MARS temperature predictions are lower in peat areas in the central part of the country and elevated areas. Higher temperatures are found in densely populated areas which is supported by the crowd-sourced measurements.

There is a preference towards the MARS model as it is closer to the citizen weather temperature observations. Additionally, the scale-ability of the MARS algorithm for larger areas with different climatological influences makes it preferable above lm.

Regional and local variations in warming trends are not captured with ok. However, our new interpolation methodology, with explanatory predictors, feature selection, and independent validation, does show local spatial variability. The warming trend from 1990-2017, from the temperature grids of lm and MARS, show a warming between 0.027 °C per year and 0.031 °C per year. Although some small-scale variations in warming trends are left unexplained, we have been able to validate regional-scale trends; these results are the first step towards local warming trend analysis. With the improved representation of local temperature variations, the high-resolution maps presented here can have applications in urban heat island studies, local trend analysis, and model evaluation.