Reassessment of the homogenization of daily maximum temperatures in the Netherlands since 1901

In 2016, the Royal Dutch Meteorological Office (KNMI) homogenized the daily temperature records for the Netherlands from 1901 to 1950 to allow a realistic comparison of the temperatures from 1901 to the present. The homogenizations for the main station De Bilt were carried out using a Percentile Matching Method (PMM) with one reference station and a 56-month reference period. In this study, it is shown that the corrections in the number of tropical days (maximum temperature ≥ 30 °C) depend strongly on the choice of the reference station and the length of the reference period. A total of 116 different variants of the homogenization of De Bilt were carried out, using all combinations of five reference stations, five reference periods, two ways to calculate percentiles, and two ways to smooth the data. The parameters used for the KNMI’s current homogenization of De Bilt result in a very sharp decrease of tropical days, which is not replicated by the majority of the 116 variants. Moreover, after homogenization, De Bilt appears to be an outlier compared to the other meteorological stations. Therefore, the current homogenized estimates of tropical days for De Bilt should be treated with considerable caution.


Introduction
In the Netherlands, there are five long-term active principal meteorological stations with temperature records from the beginning of the twentieth century up to present. These stations are Den Helder/De Kooy, De Bilt, Groningen/Eelde, Vlissingen/Souburg, and Maastricht/Beek (Fig. 1). All five stations have long time series with daily mean, minimum and maximum temperatures (Tm, Tn, Tx): De Bilt from 1901 onward and the other four stations from 1906.
Relocation took place at all five stations. Groningen station was moved to airport Eelde in 1951, about 10 km south of the city. Den Helder station was moved to airport De Kooy in 1972, about 5 km to the southeast; Maastricht station was moved to airport Beek in 1951, about 9 km northeast of Maastricht; Vlissingen station was temporarily located at Souburg (2 km further north) from 1947 to 1958. During these relocations, parallel measurements were carried out at the old and the new locations. For Groningen-Eelde and Maastricht-Beek, parallel data are available for a period of 6-7 years. A sixth station, Winterswijk, was also used in this study. It has a longterm station record starting in the beginning of the twentieth century, but was discontinued in 1970 (the station record spans the period . None of these stations remained unchanged regarding location and/or screen type during the whole period. See Fig. 1  Data of De Bilt are important, not only because of its long temperature series but also because it is one of the six stations on which the Central Netherlands Temperature record (CNT) is based. The CNT is a monthly daily mean temperature time series that is representative for a region in the center of the Netherlands. It has been constructed to study large-scale temperature changes and facilitate comparisons with climate models (Van der Schrier et al. 2011). Of importance is also that in the Netherlands, a national heat wave is defined only on the basis of the data of De Bilt.

Homogenization
Long instrumental temperature records are essential for the validation of climate models, for assessing long-term trends, and for detection of regional climate change (Venema et al. 2012). However, a problem with many long-term temperature series is the presence of non-climatological biases. External changes can cause jumps and trend breaks that are non-climatic. Peterson et al. (1998) and Trewin (2010) discussed these possible changes that can introduce nonrandom systematic biases into a data set.
Step biases are often introduced by station move or change in instrumentation or shelter. Trend biases are often associated with gradual changes in the environment of the station over multiple years, e.g., the growth of vegetation nearby (a micro-climate bias), or urbanization bias (a local climate bias).
It is likely that the stations of Groningen, Den Helder, Vlissingen, De Bilt, and Maastricht were affected by step biases as well as urbanization bias. Groningen (1951), Den Helder (1972), and Maastricht (1951 were all moved to airport locations in the neighborhood; Vlissingen (1958) regained its previous location at the harbor. In De Bilt, the screen moved twice, in 1950 and 1951. The old location was in a park with high trees, close to the buildings of the KNMI. The temporary new location in 1950 was in the same park but at greater distance from the buildings. The final location was situated in an open field on the KNMI grounds. Although De Bilt is not situated in a city, the distance to Utrecht is small and a continuing urban bias has been estimated. Brandsma et al. (2003) found for De Bilt station during the twentieth century an urbanization bias of 0.10 K ± 0.06 K. Koopmans et al. (2012) calculated for De Bilt an urban heat effect of 0.32 K in the previous century.
It is crucial that a data set reflects climate changes and not changes in observation conditions. Therefore, homogenization of temperature data sets is generally needed for construction of long-term datasets to be used for monitoring climate change. Applying homogenization to a data set typically starts with detection of inhomogeneities which is   -1906October 1944-May 1945: no data, 1947-1958: city station moved to airport Souburg, 4 km East Winterswijk 1- 1-19011- November 1944: no data, Possible inhomogeneities in 1950. Closed 1971 a well-described problem in climate science. Reviews on detection techniques include those of Reeves et al. (2007), Li and Lund (2012), Lindau and Venema (2015), and Acquaotta et al. (2016). The next step in the process of homogenization is to remove inhomogeneities by adjusting the data. Homogenization has frequently been applied to data sets of monthly and annual mean temperatures and is described in various studies. In recent decades, a significant number of algorithms have been developed to remove inhomogeneities. Venema et al. (2012) did a blind intercomparison and validation study of monthly homogenization algorithms. The algorithms were validated against a realistic benchmark dataset containing real inhomogeneous data and added random independent inhomogeneities. The relative homogenization algorithms developed to work with an inhomogeneous reference performed best. In relative homogenization, neighboring stations are exposed to nearly the same climate signal. As a consequence, the differences between those stations can be used to detect inhomogeneities.
Studying developments in daily temperatures and extremes requires a homogeneous set of daily maximum temperatures (Tx) and minimum temperatures (Tn), not just mean temperatures (Tm). Homogenization of temperatures at a daily time scale is much more difficult than at monthly or annual scales because of the day-byday changing meteorological situation. Della-Marta and Wanner (2006) used a nonlinear model to estimate the relationship between a candidate station and a highly correlated reference station. Trewin and Trevitt (1996) compared three homogenization methods for daily and monthly temperature series at two sites during a period of overlapping records and found that the "frequency distribution matching" method was superior in the case of extreme events. Mestre et al. (2011) used a method, SPLIDHOM, that relies on an indirect non-linear regression method, with reliable estimation being ensured by cubic smoothing splines. Trewin (2013) used the percentile-matching (PMM) method, which applies adjustments to daily data depending on their position in the frequency distribution.
According to Brandsma (2016), all five Dutch principal stations show step biases due to relocation. As previously noted, the De Bilt series was also affected by a screen change in 1950. In order to be able to make realistic comparisons of the data of the first half of the twentieth century with recent data, the Dutch Meteorological Office (KNMI) homogenized the data of all five stations using the percentile matching method (PMM), as described by Brandsma (2016). For Groningen/Eelde, Maastricht/Beek, Den Helder/De Kooy, and Vlissingen/Souburg, parallel measurements were available for an overlapping period of 4-7 years to homogenize the data.
For De Bilt, parallel measurements were only available for the screen change in 1950, but not for the relocations in 1950 and 1951. Therefore, the KNMI used data of Eelde  Figure 2 shows the ratio of days with Tx ≥ 25 °C and Tx ≥ 30 °C for the five principal stations before and after 1950. After homogenization, De Bilt appears to be an outlier compared with the other stations. In this study, we try to understand why the homogenization applied by Brandsma (2016) for De Bilt led to this apparently anomalous result. Specifically, we will examine the PMM homogenization algorithm as applied by Brandsma (2016) and investigate the effect of several choices Brandsma made in the algorithms. The KNMI homogenized daily Tm, Tx, and Tn values at the five stations. We are concerned here with the effects of homogenization on Tx statistics. Since De Bilt is the only station where parallel measurements are not applied and the homogenization results differ notably from the other stations, we only deal with the effect of the homogenization on the Tx measurements for De Bilt. However, we would encourage further research into the effects on Tx of the other stations a well.

Methods
The PMM algorithm has different variants that have been described well by various authors, among others Squintu et al. (2020), Trewin (2013), Kuglitsch et al. (2009), Della-Marta andWanner, (2006), Cao et al. (2016). We focus here on the variants that Brandsma (2016) used for homogenizing daily temperatures: (1) The overlap case, when parallel data of two situations at (approximately) the same location are available.
(2) The non-overlap case, when there are no parallel data, so that data from reference stations must be used.
In the overlap case, for every month, percentiles of Tm, Tx, and Tn are calculated for the two situations. The difference between the monthly percentiles leads to a correction factor that can be applied to the data from the old situation, to make the old data comparable to the new situation. This procedure was followed by Brandsma (2016) for the homogenization of four stations: Maastricht/Beek, Den Helder/De Kooy, Vlissingen/Souburg, and Groningen/Eelde. For De Bilt, Eelde was used as reference station with two overlapping periods: 1946-1950 and 1952-1956. This is the non-overlap case of the PMM algorithm, where the data for the reference stations must be homogeneous, while the candidate station has an inhomogeneity between the two periods.
The procedure that Brandsma (2016) applied is as follows: (1) Calculate for each month the percentiles (5, 10,…90, 95) in the overlapping periods of the daily temperatures before and after the inhomogeneity, both for the candidate station and the reference station.
(2) Calculate for each month before and after the inhomogeneity the difference (ΔT1, ΔT2) between the percentiles of the candidate station and the reference station.
(3) Calculate for each month and percentile the correction for the inhomogeneity as ΔT = ΔT1-ΔT2. Smooth the columns and rows of the resulting 19 × 12 matrix. For this smoothing, he used a Loess smoothing with a span of 0.6 and degree = 2, the latter meaning that the local regression is done with a quadratic function. Trewin (2013) stated that because of the distance between a reference station and the candidate station, it is preferable not to use one reference station, but a set of reference stations located in different directions from the candidate station. For the final corrections, the results of the homogenizations with the separate reference stations can be combined by taking the median.
Della-Marta and Wanner (2006) conclude about using one or more reference stations: "Since the method only uses one reference station to estimate the adjustments, it is important that all inhomogeneities in that series be accurately determined for the creation of the HSPs [homogeneous subperiods]. If more than one suitable reference station is available, a comparison between the adjustments from each could be made. If the results were consistent then we may be more confident that […] the use of this method [is] justified." Brandsma's method slightly differs from the method described by Della-Marta and Wanner. Brandsma (2016) described the homogenization of the Dutch stations in a technical report and provided additional details in a personal communication to the authors of this paper. Important additional details are: (1) The reference periods were 1-1-1946 to 31-8-1950 and 1-1-1952 to 31-8-1956. Brandsma did so to include as much overlapping data as possible: Eelde started on 1-1-1946 and the Pagoda screen in De Bilt was abandoned on 31-8-1950. This choice results in reference periods of 56 months, with an overrepresentation of calendar months 1-8.
(2) Because of the occasionally large differences between percentiles of adjacent months, Brandsma calculated the percentiles for each month from running groups of three months: for January DJF, for February JFM, etc. (3) Because these choices seem somewhat arbitrary and not underpinned, we decided to explore the effect of other choices: (4) Not only Eelde as reference station but also the other three principal stations as well as Winterswijk (five variants), (5) Reference periods of 48, 56, 60, 120, and 168 months (five variants), (6) Calculating the percentiles for each month from single months and from running groups of three months (two variants), (7) Loess smoothing not only with 2 nd degree regression but also with linear regression (two variants).
We used an R-code script similar to the one Brandsma (2016) used and executed 116 runs with all combinations of the above-mentioned variants.
Because the homogenized De Bilt record seems to be an outlier regarding the number of tropical days (see Fig. 2), we suggest that Brandsma's procedure may have led to an overcorrection. We explored this possibility by comparing the average differences between homogenized Tx of De Bilt and the other stations. Tx-corrections per percentile were calculated as ΔT = ΔT1-ΔT2, where ΔT1 = the average Tx De Bilt minus average Tx reference station over 1935-1949, and ΔT2 = likewise but over 1952-1966. We define a relative "overcorrection" as being when ΔT is positive, and a relative "undercorrection" when ΔT is negative. The relative corrections were calculated for percentiles 1, 2, 3, 4, 5, 10,…90, 95, 96, 97, 98, 99 and the results were smoothed with the Loess-function, span = 0.4, degree = 1.

Data
We used as much as possible uninterrupted data sets of the Dutch stations. All of them were obtained from the KNMIwebsite or were provided by Brandsma on our request. The data sets are presented by KNMI as homogenized. Only the data of Winterswijk seem to have an inhomogeneity around 1950. Nevertheless, we used these data because no homogenization is available, and because these data were also used by KNMI for the construction of the Central Netherlands Temperature record (CNT) by Van Schrier et al. (2011). The following data sets were used: • De Bilt 1-1-1901/31-8-1950: measured data using Pagoda screen on the old location. • De Bilt 1-1-1901/31-12-1951: homogenized data with Eelde as reference station. • De Bilt 1-1-1952/31-12-1995: measured data using Stevenson screen on the final location. • Groningen/Eelde 1-3-1906/31-12-1965: data from city station Groningen homogenized with Eelde as reference station. • Eelde 1-1-1946/31-12-1965: measured data from airport station. • Maastricht/Beek 1-1-1906/31-12/1965: data from city station Maastricht homogenized with Beek as reference station. • Beek 1-1-1946/31-12-1965: measured data from airport station. • Den Helder 1-1-1935/31-12-1965: measured data from city station. • Vlissingen 1-1-1935/31-12-1965: measured data from city station. This data series was interrupted 1-1-1947/31-12-1958, for which period homogenized data of airport station Souburg were used. • Winterswijk 1-1-1935/31-12-1965: measured data. Figure 3 shows the results of 116 variants of the homogenization for De Bilt. Long-term records are sensitive to changes in degree of urbanization and siting exposure. This may probably have influenced most of the data sets used here, so it could be impossible to make reliable adjustments to daily records from 1901 to 1951 on the basis of overlapping data from shorter subseries. The results presented here only show the effect of the choice of the reference stations, length of the reference period, and the way of calculating percentiles and smoothing.

Results and discussion
It is obvious from this figure that the choice of the reference station has considerable effect on the results. Using Groningen/Eelde or Vlissingen as reference station gives the smallest number of 1901-1951 tropical days for De Bilt. Using Maastricht/Beek as reference station gives much higher numbers. Using Den Helder or Winterswijk gives approximately the same results, intermediate between Groningen/Eelde and Maastricht/Beek.
The KNMI decided to use only Eelde as reference station for the homogenization of De Bilt because these two stations are located at comparable distances to large water surfaces. This is not quite true, because De Bilt is close to the former inland sea (Zuiderzee), which had an open connection to the North Sea until 1932, when the sea was closed by a 30 km long dike. Between 1940 and 1970, the former sea was partly impoldered. The effects of these geographical changes on the local climate are absent at station Eelde. Remarkably, using Vlissingen as reference station (a coastal station in the south west) gives very similar results as using Eelde as reference. On the contrary, using Den Helder (a coastal station to the north west) and Winterswijk (an inland station to the east) also gives similar results, but higher than when Eelde or Vlissingen are used. The conclusion could be, that an ideal reference station to homogenize the data of De Bilt does not exist. In the absence of a better reference station, the median of the results with all surrounding stations could be used, as was done by Trewin (2013) with the Australian data. An alternative could be to leave the historical data unchanged, and deduce climatic trends from the separate data of all stations.
The length of the reference period used seems to noticeably influence the spread of the results. With periods of 48-60 months, the largest spread is found. Any number of tropical days between 65 and 140 can be found. However, when applying longer reference periods, the results converge strongly. With a reference period of 168 months, the median of the number of tropical days found for De Bilt is 113 with margins 104-119. This result deviates considerably from the present KNMI data (76 tropical days before 1950).
Calculating the monthly percentiles from running groups of three months has in some cases considerable effect, but in other cases less effect, as Table 2 shows. Coincidentally, this effect is most pronounced when using Groningen/Eelde as reference station with a 56-month reference period as was done by Brandsma (2016). In that case, the number of tropical days found for De Bilt decreases from 93 to 76. On the other hand, when using Maastricht/Beek as reference station, this way of calculating the percentiles leads to an increase of the number of tropical days for De Bilt from 125 to 137. With a reference period of 168 months, the effect of the method of calculation percentiles has much smaller effects with all reference stations.
The degree of the Loess smoothing has a moderate downward effect on the number of tropical days found, except for Den Helder as reference station. From Fig. 3 and  Beek-m 1-1 Beek-m 1-3 Beek-m 2-1 Beek-m 2-3 Brandsma, 2016  Bilt from 1901 to 1951 lead to a very low number of tropical days. With Groningen/Eelde as reference station, 56-month reference period, percentiles from groups of three months and quadratic smoothing 76 tropical days are found. However, with different choices, different results can be found. Using other reference stations leads to 80-137 tropical days; using different reference periods leads to 81-112 tropical days. Taking percentiles from single months leads to 100-104 tropical days, and applying a linear Loess smoothing instead of quadratic smoothing gives 81-104 tropical days.
These results show that homogenization of temperature extremes by using the PMM is only possible when the results are reported with a wide uncertainty range. As was shown in Fig. 2, the homogenized data of De Bilt appear to be an outlier regarding the number of tropical days, compared to the other stations.
In Fig. 4, we explore the difference between station De Bilt and the other stations in two 15-year periods before and after 1950 for the lower percentiles. This shows that up to the 80 th percentile the corrections for De Bilt do not differ much from the other stations. The median of the difference between De Bilt and the other stations is about − 0.2 °C. However, for the highest percentiles, the difference increases strongly, up to + 0.7° for the highest percentiles. Note that tropical days are all above the 98 th percentile, and warm days (T ≥ 25 °C) are above the 95% percentile. This suggests that the large reduction of tropical days in Brandsma's   1935-1949 and 1952-1966. The solid curve shows the median of the five series homogenization is caused by a statistical effect that occurs especially at the high end of the temperature distribution. A possible explanation could be that Brandsma (2016) calculated monthly percentiles from running groups of three months. When, for example, the percentiles of the month of May are calculated for April-June, the highest percentile is dominated by June days. This will cause a bias in the temperature distribution for May, so that the highest Tx for May days will be corrected as if they were June days, but this effect will be similar for the reference station and for the two reference periods. A bias could occur if this effect changes in a different way for the two reference periods at different reference stations. Considering these complicated consequences, it seems advisable not to use percentiles from groups of three months. Brandsma (2016) used this procedure because he found large differences between percentiles of adjacent single months and adjacent percentiles. This problem can be overcome by using longer reference periods. In Fig. 3, it is shown that with long reference periods, the results converge to a rather narrow interval of 104-119 tropical days.
The effect of the smoothing procedure is quite small. To smooth data, the Loess function uses extrapolation of local trends. This extrapolation can be performed with degree 1 (linear) or 2 (quadratic). Quadratic regression may in some cases give a somewhat better data fit, but it requires much more computing power. As Table 2 shows, the effect of quadratic regression in the PMM homogenization results in most cases in a lower number of tropical days. To our knowledge, there is no climatological or statistical argument to use a degree 2 smoothing.

Conclusions
The homogenization of the daily maximum temperatures at the five principal stations in the Netherlands has caused a significant decrease in the number of tropical days from 160 to 76 at the main station De Bilt in the period 1901-1950. As a result, De Bilt has become an outlier compared to the other stations. We reassessed the homogenization procedure using the same percentile matching algorithms as used by KNMI.
Since there were no parallel measurements for location changes around 1950 at De Bilt, KNMI used data of only one other station, Eelde located 148 km to the northeast, with two reference periods of 56 months before and after 1950. For our reassessment, we executed 116 runs of the homogenization algorithm, using five other reference stations, five reference periods up to 168 months, and different procedures for calculating percentiles and smoothing the data. We conclude that the results of the homogenization strongly depend on the choice of these parameters.
Only with a minority of the runs, we found as little tropical days before 1951 as KNMI. Alone with Eelde as a reference station, we found several combinations of reference period and calculation of percentiles that result in much higher numbers of tropical days for De Bilt (up to 116) before 1951. With other reference stations, we found up to 140 tropical days for De Bilt before 1951.
Major causes of the strong reduction of tropical days in the homogenization by KNMI are the calculation of percentiles from running groups of three months and the reference period of 56 months that is too short to correct daily Tx-values over a period of 50 years. Comparison of the Tx-corrections for De Bilt with the corrections for the other stations suggests that the average Tx-corrections for De Bilt do not deviate strongly from those of the other stations, but for the highest percentiles of Tx, the corrections are much higher than could be expected on the basis of the other stations.
Our observations suggest that the homogenization by KNMI of the temperature extremes in De Bilt from 1901-1950 needs to be reconsidered.

Detailed information
Annex 1 specifies the results of the 116 runs of the PMM algorithm that were used for Fig. 3 and Table 2. Annex 2 specifies the number of hot days from three time series: the originally measured data of De Bilt, the homogenized data by Brandsma, and the median of our runs with 168month reference period.
Author contribution All authors contributed to the study conception and design. Data collection and analysis were performed by Frans Dijkstra and Jan Ruis; description of the stations in the Netherlands and general literature review were performed by Rob de Vos. The first draft of the manuscript was written by Jan Ruis, modified by Frans Dijkstra, and all authors commented on further versions of the manuscript. Communication with external experts was done by Marcel Crok. All authors read and approved the final version of the manuscript.
Funding The research was supported by a grant from Clintel Foundation, Amsterdam.

Data availability
The data used in this research can be downloaded from https:// klima atgek. nl/ wordp ress/ data/.

Code availability
The codes used can be downloaded from https:// klima atgek. nl/ wordp ress/ data/.

Competing interests
The authors declare no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.