1 Introduction

Quantifying spatial and temporal variability in near-surface air temperature is essential to the study of many biophysical phenomena, including energy (McVicar and Jupp 1999; Parton and Logan 1981) and water balances (Guerschman et al. 2022; McVicar and Jupp 2002; Vaze et al. 2013), wildfire (Brown et al. 2016), agricultural productivity (Holzworth et al. 2018; Walter 1967) and biogeographic distributions (Kearney and Porter 2009). Often there are trade-offs between spatial resolutions and temporal frequencies in climatological datasets, and therefore the availability of suitable products to study sub-diurnal (i.e., time-steps shorter than a 24-hour period) environmental dynamics can be limited. Quantifying air temperature at sub-diurnal time-steps enables better representation of diurnal processes (Dai 2023), whereas finer spatial resolutions can better represent the effect of lapse rates on temperature in topographically variable regions (Hutchinson 1991; McVicar et al. 2007). This increased spatial and temporal granularity can also enable better linkages with secondary data sources, such as remote sensing analyses where data are acquired at specific times.

Several post-processing techniques can be applied to achieve the desired spatial and sub-diurnal precision. For example, spatial downscaling (e.g., Wang et al. 2016) or temporal interpolation (e.g., Parton and Logan 1981) can be used to modify the spatial and temporal characteristics of existing products. Despite the availability of such methods, finding an adequate balance between the spatial resolution, temporal frequency, currency, and accuracy of air temperature datasets can be challenging (e.g., Kettle and Thompson 2004; Li et al. 2020; Pan et al. 2012).

Empirical temporal interpolation using daily air temperature extrema (minimum, maximum) as input has been used for over 130 years to model sub-diurnal air temperature (Strachey 1886). These models typically apply harmonic functions to daily (or longer) air temperature observations (i.e., measured once or summarised to one value per time-step) and measures of solar geometry (e.g., timing of sunrise and sunset), and assume that the timing of minimum and maximum air temperatures are constant with respect to local solar time (Table 1). The relatively simple input data requirements can be easily computed or accessed from meteorological databases (e.g., daily air temperature extrema from gridded datasets, station observations) and therefore are practical methods to implement. Despite this accessibility, these models require strong assumptions about the timing of minimum and maximum air temperatures, need calibration for best results, and cannot represent frontal systems (Cesaraccio et al. 2001; Parton and Logan 1981). These models are also usually calibrated for specific and/or a small set of location(s) (e.g., Gholamnia et al. 2019) despite being applied more broadly in practice, and do not explicitly consider spatial relationships between observations.

Table 1 Summary of studies using spatial (S) and/or temporal (T) methods for interpolating sub-diurnal near-surface air temperature (Ta). The studies are ordered by temporal interpolation then spatial interpolation, then within each interpolation type they are ordered chronologically by publication year. Temporal interpolation methods require minimum (Tamin) and maximum (Tamax) daily air temperature. Spatial interpolation methods include numerical weather prediction (NWP) and inverse distance weighting (IDW). The interpolated area is reported as “not applicable” (n/a) for site-based analyses

Spatial interpolation is widely used for the development of climate datasets from regional to global scales (Cornes et al. 2018; Harris et al. 2020; Jeffrey et al. 2001; Jones et al. 2009; McVicar et al. 2008; Thornton et al. 2021). Variables are commonly available at daily or longer time-steps; however, there have been applications at sub-diurnal time-steps (see Table 1). Air temperature is often provided as extrema at daily or longer time-steps (e.g., monthly), and as such are modelled as a mosaic of observations that likely occurred at different times. Models are calibrated using high-quality observations, ideally with sufficient density and representation to characterise important climatic gradients (e.g., lapse rates, sea breezes, synoptic patterns). Among the most common spatial interpolation methods include thin plate splines and kriging (Hutchinson 1991; Matheron 1962), though many have been developed (Hengl et al. 2018; Li and Heap 2014; Sekulić et al. 2020). Spatial interpolation draws on observations from surrounding locations and typically shows strong statistical performance in climate applications (Hutchinson et al. 2009; Jones et al. 2009), though model skill is dependent upon the quality and density of input data (e.g. Stewart et al. 2017), in addition to appropriate representation of process gradients (e.g., elevation for estimating lapse rates, distance from coast for sea breeze modelling). Historically, the limited availability of high-frequency climate observations (relative to daily or monthly) has been one barrier to applying these methods at sub-diurnal time-steps over many years at continental extents.

Sub-diurnal meteorological fields can be accessed via climate reanalysis products, which are developed using physical models in conjunction with assimilated observations (e.g., Gelaro et al. 2017; Hersbach et al. 2020; Kobayashi et al. 2015; Muñoz-Sabater et al. 2021; Su et al. 2019). These datasets quantify many atmospheric variables, are regional to global in extent, range from decades to over a century in length, and often provide sub-diurnal time-steps. Reanalyses preserve the physical relationships between variables and include multiple vertical layers representing increasing altitudes, but are sensitive to spatial-temporal patterns in assimilated observations (as with spatial interpolation) and often have very coarse spatial resolution (0.1 ° to 2.5 ° grid cells, equivalent to ~ 11 km to 277 km at the equator). The coarse spatial resolution means that many fine-scale features are not well represented in the reanalysis output without post-processing (e.g. Guo et al. 2022; Karger et al. 2017; Politi et al. 2021), which can be limiting where analysing with remotely sensed data and/or studying near-surface processes in regions of highly variable topography. Furthermore, the delay between the current date and reanalysis data availability can range from several months to years (e.g., Muñoz-Sabater et al. 2021; Saha et al. 2014; Su et al. 2019), limiting suitability for studying recent events.

The increased availability of hourly observations across Australia in recent decades (Australian Bureau of Meteorology 2023; Trewin 2012) provides an opportunity to demonstrate spatial interpolation of near-surface air temperature at sub-diurnal time-steps. This enables the representation of frontal systems, avoiding many of the assumptions associated with temporal interpolation of air temperature, such as the (consistent) timing of daily minimum and maximum air temperatures. Spatial interpolation also allows for the development of high-resolution surfaces better suited for use with remote sensing data, and can be rapidly updated, enabling the study of recent events (e.g., extreme weather). Our aim was to develop and evaluate a high-resolution spatial dataset of hourly air temperature calibrated using weather station observations distributed across Australia. Our specific objectives, which form the basis for sub-headings in the following Methods, Results and Discussion sections, were to:

  1. i)

    describe the development and statistical performance of spatially interpolated hourly air temperature surfaces for Australia;

  2. ii)

    evaluate the performance of our method against temporal interpolation using daily extrema air temperature observations; and.

  3. iii)

    compare our spatially interpolated hourly product against high-quality, contemporary regional and global reanalysis products.

Many studies have investigated spatial and temporal air temperature interpolation in isolation (Table 1), but few, if any, have analysed the relative benefits of using one approach over another. These analyses, in conjunction with validation against independent observations and comparisons with reanalysis products, demonstrate the feasibility and suitability of spatial interpolation for generating high quality hourly air temperature surfaces.

2 Study region and materials

Hourly air temperature (at ~ 1.5 m height) observations recorded across Australia between January 1990 and November 2019 were obtained from the Australian Bureau of Meteorology (BoM). The dataset included records for 621 stations. The number of these stations where meteorological parameters are available at hourly time-steps has steadily increased over recent decades (Figure S1; Australian Bureau of Meteorology 2023; Trewin 2012). Hourly air temperature observations were available for just 35 of these stations in 1990, increasing to 328 stations in 2000 and 568 stations in 2019. All records were converted from their respective Australian time zones, accounting for daylight saving time as needed, to Coordinated Universal Time (UTC) to ensure they were temporally aligned.

For independent validation, hourly air temperature observations from 24 OzFlux weather stations (see Beringer et al. 2016; https://www.ozflux.org.au/) and 4 CosmOz weather stations (see Hawdon et al. 2014; https://cosmoz.csiro.au/) were compiled. The OzFlux stations monitor energy, carbon, and water fluxes across Australian ecosystems and contribute to the global FLUXNET network. The CosmOz stations are a network of cosmic ray probes used to measure average soil moisture over large areas (~ 30 ha footprint). Observations at the CosmOz stations were in some cases recorded at irregular time-steps and were rounded to the closest hour. The OzFlux observations are often made at multiple heights up an instrumentation mast. In all cases, we took those measurements closest to standard station height (i.e., 1.5 m). The difference between observation height and standard station height was recorded and is presented alongside the validation results. The spatial distribution of weather stations used herein are illustrated in Fig. 1.

Fig. 1
figure 1

Spatial distribution of hourly near-surface (1.5 m) air temperature observations used for calibration (i.e., observations used for model fitting) and validation (i.e., observations used for independent model validation) of spatial interpolations. The number of stations has fluctuated over time and therefore these points include a mosaic of observation dates between 01/Jan/1990 and 14/Nov/2019

Surveyed site elevation data, sourced from station metadata, were used for all model cross-validation, validation site predictions and model fitting. The GEODATA 9-second Digital Elevation Model (DEM; Hutchinson et al. 2008), reprojected to Australian Albers (EPSG: 3577) equal area projection (with 1 km resolution) was used for gridded spatial interpolation. Distance to the generalised coast data was obtained from ANUClimate v2.0 (Hutchinson et al. 2021).

Two contemporary reanalysis products were used to evaluate our spatially interpolated surfaces: (i) the 12 km Bureau of Meteorology Atmospheric high-resolution Regional Reanalysis for Australia (BARRA-R; Su et al. 2021); and (ii) ERA5-Land (Muñoz-Sabater et al. 2021). Screen temperature and 2 m air temperature were obtained for both BARRA-R and ERA5-Land, respectively, at 4 times of the day from 01/Jan/2015 to 31/Dec/2018. The analyses were restricted to this period to ensure consistent seasonal summaries.

Minimum and maximum daily air temperature observations were obtained from the SILO patched point dataset (https://www.longpaddock.qld.gov.au/silo/point-data/; Jeffrey et al. 2001). The SILO database provides quality controlled and gap-filled daily records of meteorological variables across Australia. Daily data were acquired for 387 stations located across Australia (see Figure S2) that could be matched by station identifier to the co-occurring hourly records.

3 Methods

3.1 Spatial interpolation of hourly air temperature

Two distinct approaches to spatial interpolation of hourly near-surface air temperature were evaluated: (i) climatologically aided spatial interpolation (CASI); and (ii) direct spatial interpolation (DSI). CASI (Willmott and Robeson 1995) involves the separate interpolation of a stable long-term base climatology and corresponding anomalies. For DSI, models are fitted using all available station observations and covariates at each time-step (i.e., allowing model responses to vary with specific weather conditions).

We implemented both the CASI and DSI methods for analyses at sub-diurnal time-steps using hourly records. We produced and statistically evaluated interpolated hourly air temperature grids following eight key steps for CASI, and two key steps for DSI. For CASI, these steps were to: (C1) gap fill observations to mitigate potential bias introduced from temporally incomplete records; (C2) calculate hourly air temperature climatologies for every 5 day-of-year (DOY) period; (C3) fit and evaluate spline models for climatologies; (C4) interpolate every 5th DOY hourly climatologies with quart-variate thin plate splines; (C5) calculate hourly anomalies (i.e., hourly deviations from climatology); (C6) fit and evaluate spline models for anomalies; (C7) interpolate anomalies with bi-variate thin plate splines; and (C8) add the interpolated climatologies to the interpolated anomalies to produce hourly surfaces. For DSI, these steps were to: (D1) fit and evaluate quart-variate spline models; and (D2) interpolate hourly air temperature. Each step is discussed in detail below.

Gaps in the hourly observations were gap-filled (Step C1) using a regression patching procedure (based on Hopkinson et al. 2012; Stewart and Nitschke 2017) to mitigate potential bias in climatologies calculated with temporally incomplete records. Stations with at least 220 observations for any specific hour ± 5 DOYs (20 years by 11 days), were considered as “long-term stations” and used as reference points for gap-filling. Linear regression was used to model missing observations from the closest 10 long-term stations where at least 44 observations (4 years by 11 days) co-occurred. Only models with statistically significant (F-test; α ≤ 0.01) linear relationships were retained. Gaps were filled using the model achieving the highest F-score for each time-step. This procedure was iteratively applied across each station, until all possible gaps were filled, to estimate as many missing observations as possible.

Climatologies were calculated for each station to quantify long-term average air temperatures (Step C2). Climatologies were developed for each hour, centred on every 5th DOY (n = 73, i.e., every 5th day for 365 days) between 01/Jan/1990 and 31/Dec/2019 (i.e., 30 years) to represent shifting solar geometry and seasonal changes in temperature (n climatologies = 1,752, calculated as 73 by 24 h). For each hour we included all available observations within +/- 5 DOYs (including the central day there are 11 days summarised) across all available years. This maximised the number of data points available to build reliable climatologies (i.e., n = 330, 11 days by 30 years). This 11-day aggregation window was chosen to minimise the effect of changes in solar geometry on air temperature for specific times of the day. The time of sunrise and sunset (when air temperatures may vary rapidly) varies by less than 10 min anywhere in Australia for all 11-day periods, well within the hourly interpolation time-step.

Climatologies were calculated for each hour and 5th DOY period at every weather station, where at least 110 (10 years by 11 days) observed or gap-filled values were available (mean = 313.2, standard deviation = 20.4 values) and where no more than 90% of the values were estimated (mean = 30.5, standard deviation = 20.1%). Gap-filled values were only used to calculate climatologies. Note: climatologies were only retained for stations that met the above criteria for all time periods (i.e., n = 1,752 values per station, calculated as 73 by 24 h) to ensure consistency across space and time for the subsequent spatial interpolations. A total of 505 stations met these criteria for building the climatologies (see Fig. 1). While records at an additional 116 locations that did not meet these criteria were retained to calibrate and cross-validate the hourly interpolations (Fig. 1).

Hourly air temperature climatologies for every 5th DOY were fitted (Step C3) and spatially interpolated (Step C4) with quart-variate thin plate smoothing splines (easting, northing, elevation, and a coastal distance index as independent spline variables) using ANUSPLIN v4.4 (Hutchinson and Xu 2013). The smoothing parameters for each surface were selected via Generalised Cross Validation. We included 95% of data points as knots for all climatologies (see Figure S3 for results identifying the optimal % of data points included as knots). Elevation was exaggerated by a factor of 100 relative to the coordinate system to represent the differences in horizontal and vertical synoptic scales as is typical for spline-based climate interpolation (see Hutchinson et al. 2009). The coastal distance index (CDI) was calculated as:

$$\:\begin{array}{c}CDI={e}^{-D2C/d}\:\:\:\:\end{array}$$
(1)

where D2C is distance to the generalised coast (in units of km; Hutchinson et al. 2021) and d is the parameter controlling the rate at which the index decays (see Figure S4). Higher values of d cause the coastal distance index to decay more slowly with distance from the coast. The CDI was included to represent the effect of coastal weather (e.g., sea breeze) on air temperature (Abbs and Physick 1992; Daly et al. 2002; Hutchinson et al. 2021; Miller et al. 2003).

Climatologies for each time-of-day and 5th DOY were iteratively re-fitted with 20 potential values of d between 3 and 50 (see Figure S4 for specific values) to determine the optimal decay rate for the coastal distance index (n climatologies = 35,040, calculated as 73 5th DOYs by 24 h by 20 d values). The optimal d was evaluated using two methods. The first approach was to simply pool all cross-validation results by each unique value of d and empirically determine the best performing value for the full set of climatologies. The second approach was to pool cross-validation results by each time-of-day, 5th DOY, and unique value of d. The optimal d value that minimised root mean squared error (RMSE) for each combination of time-of-day and 5th DOY was then selected to enable a more specific time-varying response to coastal weather conditions to be fitted. Filtering processes were then implemented to ensure the resulting interpolations did not contain step-changes that could negatively affect model predictions. Time-varying d climatologies were first filtered so that only those values that reduced RMSE by at least 3% relative to the optimal fixed value (determined by the first, simpler approach) were considered for a specific time-of-day and 5th DOY. These filtered d values were then smoothed using a focal mean across all times-of-day ± 1 h and all 5th DOYs ± 5 DOYs. The filtered and smoothed values of d were used to generate a CDI for each specific time-of-day and DOY combination.

Hourly air temperature anomalies were then calculated (Step C5) as:

$$\:\begin{array}{c}{A}_{h}={{Ta}_{h}-Ta}_{c}\:\:\:\:\:\:\end{array}$$
(2)

where h is the hour of the observation, c is the climatology for the same hour-of-day and closest 5th DOY and A is the anomaly (i.e., difference between the observation and corresponding climatology). Point interpolations for each of the 1,752 climatologies were performed at each of the 116 stations (n = 203,232) not meeting the inclusion criteria for climatologies to enable calculation of anomalies. Anomaly models were fitted (Step C6) and spatially interpolated (Step C7) using bi-variate thin plate splines as a function of easting and northing and 80% of data points as knots (see Figure S3) to mitigate the potential for model instability and exact interpolation. The final hourly air temperature surfaces were calculated by adding the spatially interpolated climatology (for the same hour, and closest 5th DOY period) and anomaly (Step C8).

Hourly air temperature was ‘directly’ (DSI) fitted (Step D1) and spatially interpolated (Step D2) using quart-variate thin plate splines (easting, northing, elevation, and a coastal distance index as independent spline variables) with hourly observations as input (i.e., without any anomaly calculation). As with CASI, 80% of the hourly observations were used as knots (see Figure S3). The optimal d values for DSI were assessed using a similar process as with the climatologies (see Step C3/C4 above) but using hourly cross-validation statistics between 01/Jan/2015 and 31/Dec/2018 in place of the climatologies. Hourly cross-validated predictions were aggregated to the closest 5th DOY when assessing d, both for consistency with the climatologies and to increase the number of available samples for calculating reliable cross-validation statistics. There was considerable variability but fewer large outliers in d when evaluated for DSI and therefore values were not filtered for % change in RMSE, but the focal mean step was applied (i.e., across all times-of-day ± 1 h and all 5th DOYs ± 5 DOYs). DSI was additionally modelled with d values obtained from the analysis of climatologies to determine the effectiveness of using more temporally stable coastal proximity indices.

Statistical performance was evaluated for each of the climatologies (representative of 1990–2019) and all hours available between 01/Jan/2000 and 14/Nov/2019 (174,064 h over 7,258 days). The number of hourly observations across Australia has steadily increased over time (Figure S1), therefore this period was chosen to provide conservative error estimates that reflect our ability to interpolate historical data. Hourly CASI and DSI were compared against one another; however, only the best performing model was retained for the remainder of the analyses. Hourly air temperature surfaces (1 km resolution) and point predictions at validation sites were interpolated for each hour between 01/Jan/2015 and 14/Nov/2019 (1,779 days = 42,696 h). This shorter period was chosen for validation against independent OzFlux and CosmOz station observations, and comparison with alternative methods (i.e., empirical models, reanalysis data) to better represent the higher station density that is currently available (and likely available on an ongoing basis).

Model performance was evaluated on leave-one-out cross-validated predictions generated by ANUSPLIN during the model fitting procedure (Fig. 2, orange symbols). Hourly CASI was evaluated using both the cross-validated climatology and cross-validated anomaly together to ensure a conservative estimate of error. Independent observations at the OzFlux and CosmOz stations were evaluated against point interpolations for each hour (i.e., using the fitted model from the best performing of the CASI and DSI workflows). The coefficient of determination (R2), root mean squared error/deviation (RMSE/RMSD) and mean error (bias) were used to quantify agreement between observed and cross-validated values (Willmott 1982). The RMSE was used to indicate where statistical comparisons are made between observations (i.e., ground truth) and modelled estimates, and the RMSD was used to indicate statistical comparisons between two modelled estimates. Changes in the absolute value of bias are reported when directly comparing different analyses (e.g., CASI versus DSI). The same metrics were used for all subsequent comparisons. Confidence intervals were given to 1 standard deviation and all results were presented in UTC + 9 unless otherwise specified. This offset was selected to optimally align with the time zones used across Australia, which vary from UTC + 8 in the west to UTC + 11 in the east during daylight savings (in the austral summer).

Fig. 2
figure 2

Workflow for climatologically aided spatial interpolation (CASI, workflow C) and direct spatial interpolation (DSI, workflow D) of hourly air temperature across Australia. Numbers indicate the order of data processing steps. Green symbols are associated with air temperature climatologies, blue with CASI, and orange with DSI. Rectangles correspond to processes and parallelograms to data. ANUSPLIN programs used in steps C3, C4, C6, C7, D1 and D2 are capitalised and underlined. Note that step C5 requires point-based interpolated climatologies (via LAPPNT) for stations not meeting minimum criteria for calculating climatologies as part of step C2. The coastal distance index varies by time-of-day and day-of-year and is calibrated as part of steps C3 and D1

3.2 Comparing hourly spatial interpolation with temporal interpolation of daily air temperature

Empirical estimates of hourly air temperature were modelled using cross-validation predictions of daily minimum and maximum air temperature following Parton and Logan (1981), which has been previously used in Australia (Holzworth et al. 2014; McVicar and Jupp 1999). The model (herein denoted PL81) uses a truncated sine function to model daytime temperature and exponential decay function to model night-time air temperature. We parameterised the model by empirical analysis of the time lag between solar noon and maximum air temperature (Figure S5), and sunrise and minimum air temperature (Figure S6) for each calibration and validation station. These parameters were determined seasonally, and typically varied by an hour or less at any specific site (Figure S7). Local solar time (LST) was calculated based on hourly time-steps in UTC to ensure air temperatures produced by PL81 were temporally aligned with the available observations.

The PL81 analyses used cross-validated and point predictions of daily minimum and maximum air temperature, as the daily temperature extrema are not necessarily well captured by data at (relatively) infrequent time-steps and therefore additional biases may be introduced by using hourly records. Daily minimum and maximum air temperatures were also interpolated using quart-variate thin plate splines (full spline dependence on easting, northing, elevation, and coastal distance index) using 80% of the observations as knots. The d value used for transforming the coastal distance index was independently assessed for daily air temperature (as described for the hourly climatologies and DSI), though with the time-varying analysis aggregated by the closest 5th DOY only. PL81 modelling used cross-validated predictions of daily air temperature from calibration stations (n = 391 stations) and point predictions for all remaining stations (n = 231 stations), including the validation stations (Figure S2).

Hourly air temperature was modelled with PL81 iteratively for each day in the comparison period (01/Jan/2015 to 14/Nov/2019), using cross-validated minimum air temperature of both the current and next day in separate runs to ensure a smooth transition in the diurnal air temperature profile between days. Hourly predictions modelled using PL81 were compared against the air temperature observations (at both calibration and validation sites) and performance statistics were compared against cross-validation predictions from hourly spatial interpolations for the same comparison period. Hourly observations were first converted to LST to align timestamps. Statistical comparisons of PL81 and observed air temperature were calculated for each hour, the daily mean, at the time of sunrise − 1 h, and at the time of solar noon + 1.5 h. The latter two times were selected to explore the potential biases present at the typical time of daily minimum and maximum air temperature, respectively. These periods were evaluated using the closest time available per day, as UTC time will vary with respect to a fixed solar time (e.g., changes in sunrise and sunset times as a function of DOY and latitude). Cross-validation predictions from the hourly spatial interpolations were compared by converting them to LST and evaluating the difference in statistical performance, relative to PL81, by hour-of-day and day-of-year.

3.3 Comparing hourly spatial interpolation with reanalysis products

The spatially interpolated air temperature surfaces were compared against two contemporary reanalysis products: (i) BARRA-R (regional; Su et al. 2021); and (ii) ERA5-Land (global; Muñoz-Sabater et al. 2021). Each of the corresponding spatially interpolated surfaces were reprojected to the native resolution of BARRA-R (0.11°) and ERA5-Land (0.10°) for subsequent analyses. Statistical comparisons of both pairs of products (i.e., [i] CASI/DSI and BARRA-R; and [ii] CASI/DSI and ERA5-Land) were calculated for each pixel through the available paired time points, and for the seasonal means across all pixels at four times-of-day (03:00, 09:00, 15:00, 21:00 UTC + 9). Hourly records from the calibration and validation (OzFlux and CosmOz) stations were then compared against both reanalysis outputs and the spatial interpolations to assess the accuracy of each dataset relative to the observations.

4 Results

4.1 Spatial interpolation of hourly air temperature

Hourly air temperature and climatologies were best interpolated using time-varying estimates of d for calculating the coastal distance index (Fig. 3e, f). Time-varying d values improved interpolation performance for hourly climatologies by up to 22.4% (Fig. 3c, i.e., DOY 306, 17:00) in comparison to fixing d at 5 (Fig. 3a), and were most effective in the afternoon and evening of the warmer DOYs (i.e., DOYs 250–50, 15:00–21:00, Δ RMSE = -10.4% ± 5.9%). Hourly DSI showed similar patterns in d when evaluated over 4 years; however, the values were more variable than those for the climatologies (Fig. 3d). Separate evaluation of DSI with the climatological d and hourly d revealed little difference in statistical performance (Fig. 3, Figure S8, Δ RMSE = -0.06% ± 0.17%), and therefore the climatological d was applied to the DSI for all subsequent analyses. As with the climatologies, the DSI improved most in the afternoon and evening of the warmer DOYs (i.e., DOYs 250–50, 15:00–21:00, Δ RMSE = -2.1% ± 1.5%). Further performance improvements were found for DSI throughout the year in the early morning (Fig. 3f, i.e., 07:00–08:00, Δ RMSE = -0.6% ± 0.3%), reflecting the lower optimal d values at these times (Fig. 3d).

Fig. 3
figure 3

Cross-validation performance for climatologies (a, e) and direct spatial interpolation (DSI) of hourly air temperature (b, f) interpolated using quart-variate thin plate splines (full spline dependence on easting, northing, elevation, and coastal distance index) with fixed (a, b) and time-varying d values (c, d) that control the rate at which the coastal distance index decays. Higher values of d cause the coastal distance index to decay more slowly with distance from the coast. The d values estimated from DSI of hourly air temperature (d) are presented but were not used due to short range temporal variability. Time-varying changes (denoted by Δ) in root mean squared error (RMSE, e, f) were calculated by comparing stable d values obtained from the climatological analyses (c) with fixed values (a, b). The climatological d values (c) were first filtered to exclude results that did not improve performance by at least 3% in comparison to the pooled value (i.e., d = 5, see a). A focal mean filter was applied (± 5 DOYs and ± 1 h) to smooth both the climatology and hourly d values. Dashed red lines correspond to the best performing fixed d value when pooling all results. Climatology results (a, c, e) are underpinned by 17,695,200 cross-validated predictions (calculated as 73 instances of every 5th DOY by 24 h by 505 stations by 20 d values). Hourly results (b, d, f) are underpinned by 364,041,740 cross-validated predictions (35,064 h over 1,461 days by 20 d values for a variable number of stations per time-step)

Hourly air temperature climatologies (1990–2019; see Fig. 4) were best interpolated at daytime during the cooler DOYs (i.e., DOYs 150–230, 08:00–16:00, mean = 15.43 °C, R2 = 0.99, RMSE = 0.60 °C); however, they also performed worst during the early hours of the morning at the same time of year (01:00–07:00, mean = 9.57 °C, R2 = 0.95, RMSE = 1.08 °C, Fig. 5). Cross-validation statistics show a bimodal pattern in diurnal performance during the warmer DOYs (i.e., DOY 340–055; Fig. 5). During these warmer DOYs, performance was best when temperatures increase in the hours after sunrise (i.e., 06:00–09:00, mean = 21.9 °C, R2 = 0.99, RMSE = 0.51 °C) and in the evening (i.e., 17:00–23:00, mean = 22.5 °C, R2 = 0.99, RMSE = 0.66 °C), with performance being weaker in the early hours of the morning (i.e., 00:00–05:00, mean = 19.0 °C, R2 = 0.98, RMSE = 0.65 °C) and during the afternoon (i.e., 12:00–16:00, mean = 26.9 °C, R2 = 0.98, RMSE = 0.76 °C). There was a small positive bias during the hours prior to sunrise on the cooler DOYs (i.e., DOY 100–300) and a marginal negative bias during the daytime hours throughout the year, but the overall magnitude of mean error was very small (< 0.03 °C; Fig. 5).

Fig. 4
figure 4

Examples of hourly air temperature climatologies (every 5th DOY, 01/Jan/1990 to 14/Nov/2019) interpolated across Australia for four times-of-day, and four days-of-year. There are 1,752 climatologies (calculated as 73 by 24, see Section 3a) generated with the CASI workflow during Step C4 (see Fig. 2). These plots illustrate 16 examples of typical spatial, seasonal, and sub-diurnal variation

Fig. 5
figure 5

Heatmaps of pooled cross-validation statistics for hourly air temperature climatologies (every 5th DOY, 01/Jan/1990 to 14/Nov/2019) interpolated across Australia (n = 505 stations). Statistics presented include the (a) coefficient of determination (R2), (b) root mean squared error (RMSE), (c) mean of observations, and (d) bias (i.e., mean error), and are derived by analysing the output of the CASI workflow from Step C4 (see Fig. 2). There are 884,760 observations underpinning each of these plots (calculated as 73 instances of every 5th DOY by 24 h by 505 stations)

Pooled statistics across all stations and hours show strong performance for air temperature climatologies interpolated for seasonal (R2 = 0.98 to 0.99, RMSE = 0.66 °C to 0.86 °C, Bias = -0.01 °C to -0.00 °C) and annual (R2 = 0.99, RMSE = 0.75 °C, Bias = -0.00 °C) aggregation periods (Table 2). Cross-validation performance remained strong when pooled annually for each station (Fig. 6). Statistical performance was best at a daily time-step (Fig. 6), where observations and predictions were aggregated to a daily mean value prior to calculating each metric (R2 = 0.99 ± 0.02, RMSE = 0.43 °C ± 0.27 °C, Bias = -0.00 °C ± 0.43 °C). Performance when hourly observations were diurnally maximum (R2 = 0.99 ± 0.05, RMSE = 0.50 °C ± 0.39 °C, Bias = 0.00 °C ± 0.52 °C) was typically more reliable than at the diurnal minimum (R2 = 0.99 ± 0.03, RMSE = 0.75 °C ± 0.52 °C, Bias = -0.02 °C ± 0.82 °C) across all stations (Fig. 6). Hourly statistics were within the range of performance at diurnal minima/maxima (R2 = 0.99 ± 0.05, RMSE = 0.66 °C ± 0.35 °C, Bias = -0.00 °C ± 0.43 °C; Fig. 6).

Table 2 Pooled seasonal and annual cross-validation statistics for hourly air temperature interpolated across Australia
Fig. 6
figure 6

Violin plots of pooled cross-validation statistics for hourly air temperature climatologies (every 5th DOY, 01/Jan/1990 to 14/Nov/2019) interpolated at individual weather stations (n = 505 stations) across Australia. Statistics presented include the (a) coefficient of determination (R2), (b) root mean squared error (RMSE), (c) mean of observations, and (d) bias (i.e., mean error), and are derived by analysing the output of the CASI workflow from Step C4 (see Fig. 2). Daily mean statistics are calculated using the mean observed and predicted hourly climatology for each day. Daily minimum (denoted min) / maximum (denoted max) statistics are calculated using the observed and predicted minimum/maximum hourly climatology for each available day. There are 884,760 observations in the hourly data (calculated as 73 instances of every 5th DOY by 24 h by 505 stations) with 36,865 observations for the daily data (calculated as 73 instances of every 5th DOY by 505 stations). The shape of the violin illustrates the mirrored kernel density estimate of values. Boxes represent the 25th, 50th and 75th percentiles. Whiskers extend to the largest value no further than 1.58 times the interquartile range. Only two of the 505 stations achieved R2 < 0.87 and RMSE > 1.94 °C for hourly climatologies

Seasonally, DSI of hourly air temperature achieved marginally higher R2 (up to 0.01) and lower RMSE (0.03 °C to 0.04 °C) than CASI (Table 2). Time-series of cross-validation statistics pooled by month show an increase in the performance of DSI over time (Fig. 7a, c, e; mean R2 = 0.94 / 0.96, RMSE = 1.67 °C / 1.52 °C for 2000–2003 / 2015–2018). DSI consistently performed better than CASI after January 2004 (Fig. 7b, d, f) when the number of hourly observations reached ~ 400 per hour (Figure S1), though overall the differences were small (mean Δ R2 < 0.01, Δ RMSE < 0.08 °C). Statistical performance of DSI tended to decrease with lower observation density (Fig. 8a, c, e; mean R2 = 0.94 / 0.88, RMSE = 1.42 °C / 1.91 °C, n = 497 / 103, where mean distance to the closest 10 stations is ≤ 200 km / > 200 km), and CASI trended towards slightly better performance on average where observation density was low (mean Δ R2 = 0.01, Δ RMSE = -0.09 °C, where mean distance to the closest 10 stations is > 200 km). DSI outperformed CASI at weather stations located at elevations exceeding 800 m (Fig. 8b, d, f; n = 26, mean Δ R2 = 0.04, Δ RMSE = -0.40 °C). The spatial distribution of error for each station using DSI, and comparison with CASI, are illustrated in Figure S9.

Fig. 7
figure 7

Cross-validation statistics for (a, c, e) direct spatial interpolation (DSI) of hourly air temperature from 01/Jan/2000 to 14/Nov/2019, pooled by month, and (b, d, f) compared with climatologically aided spatial interpolation (CASI). Statistics presented include the coefficient of determination (R2) for (a) DSI (b) compared to CASI, root mean squared error (RMSE) for (c) DSI (d) compared to CASI, bias (i.e., mean error) for (e) DSI, and (f) difference in absolute bias for DSI compared to CASI. Differences between DSI and CASI (denoted by Δ) are calculated by subtracting CASI statistics from DSI statistics (i.e., DSI minus CASI). The red dashed horizontal line in (b), (d) and (f) indicates where there is no difference in performance between DSI and CASI. There are 77,106,396 observations (174,064 h for a variable number of stations per time-step) underpinning each of these plots

Fig. 8
figure 8

Relationship between cross-validation statistics and observation density for (a, c, e) direct spatial interpolation (DSI) of hourly air temperature and (b, d, f) change relative to climatologically aided spatial interpolation (CASI) at individual weather stations (n = 621) across Australia, 01/Jan/2000 to 14/Nov/2019. Statistics presented include the coefficient of determination (R2) for (a) DSI (b) compared to CASI, root mean squared error (RMSE) for (c) DSI (d) compared to CASI, bias (i.e., mean error) for (e) DSI, and (f) difference in absolute bias for DSI compared to CASI. Differences between DSI and CASI (denoted by Δ) are calculated by subtracting CASI statistics from DSI statistics (i.e., DSI minus CASI). Points are coloured by station elevation. There are 77,106,396 observations (174,064 h for a variable number of stations per time-step) underpinning each of these plots. The lines of best fit (blue lines) and 95% confidence intervals (grey shading) are fitted with a loess smoother (span = 0.75)

DSI of hourly air temperature achieved lower R2 and higher RMSE than the climatologies (Δ R2 = -0.03 and Δ RMSE = 0.81 °C on an annual basis, respectively; Table 2); however, there were similar trends in temporal patterns (Fig. 9) and distribution of per-station performance statistics (Fig. 10). The overall magnitude of hourly bias was negligible when pooled by time-of-day and day-of-year (< |0.03| °C; Fig. 9d). As with the climatologies, statistical performance per station was best when aggregating air temperature to daily values (R2 = 0.96 ± 0.06, RMSE = 0.91 °C ± 0.35 °C, Bias = -0.01 °C ± 0.47 °C). Performance when hourly observations were diurnally maximum (R2 = 0.92 ± 0.10, RMSE = 1.25 °C ± 0.52 °C, Bias = -0.18 °C ± 0.56 °C) was also better than the diurnal minimum (R2 = 0.90 ± 0.10, RMSE = 1.72 °C ± 0.70 °C, Bias = 0.21 °C ± 0.96 °C). Statistical performance was strong for individual stations at an hourly time-step (R2 = 0.93 ± 0.08, RMSE = 1.51 °C ± 0.42 °C, Bias = -0.01 °C ± 0.47 °C).

Fig. 9
figure 9

Heatmaps of pooled cross-validation statistics for direct spatial interpolation (DSI) of hourly air temperature (01/Jan/2000 to 14/Nov/2019) across Australia (n = 621 stations). Statistics presented include the (a) coefficient of determination (R2), (b) root mean squared error (RMSE), (c) mean of observations, and (d) bias (i.e., mean error), and are derived by analysing the output of the DSI workflow from Step D1 (see Fig. 2). There are 77,106,396 observations (174,064 h for a variable number of stations per time-step) underpinning each of these plots. For this period, as the number of stations in the observation network changes over time (as mentioned in Section 2), the mean number of stations contributing observations each hour is 443.0 with a standard deviation of 67.7

Fig. 10
figure 10

Violin plots of pooled cross-validation statistics for direct spatial interpolation (DSI) of hourly air temperature (01/Jan/2000 to 14/Nov/2019) at individual weather stations (n = 621) across Australia. Statistics presented include the (a) coefficient of determination (R2), (b) root mean squared error (RMSE), (c) mean of observations, and (d) bias (i.e., mean error), and are derived by analysing the output of the DSI workflow from Step D1 (see Fig. 2). Daily mean statistics are calculated using the mean observed and predicted hourly air temperature for each day. Daily minimum (denoted min) / maximum (denoted max) statistics are calculated using the observed and predicted minimum / maximum hourly air temperature for each available day. There are 77,106,396 observations (174,064 h / 7,258 days for a variable number of stations per time-step) underpinning each of these plots. The shape of the violin illustrates the mirrored kernel density estimate of values. Boxes represent the 25th, 50th and 75th percentiles. Whiskers extend to the largest value no further than 1.58 times the interquartile range

Independent validation showed that hourly DSI of air temperatures were reliable at most locations when compared against independent observations from the 24 OzFlux and 4 CosmOz stations (R2 = 0.92 ± 0.07, RMSE = 1.78 °C ± 0.62 °C, Bias = -0.03 °C ± 0.93 °C; Table 3; see Fig. 1 for station locations). Validation measurements taken above 13.8 m in height (n = 8) achieved lower R2 (0.88 ± 0.09) and higher RMSE (1.88 °C ± 0.33 °C) on average than those at lower heights (R2 = 0.94 ± 0.06, RMSE = 1.74 °C ± 0.71 °C; n = 20). Cape Tribulation in far north Queensland (145.38 °E, -16.11 °S) gave the worst validation performance (R2 = 0.68, RMSE = 2.41 °C, Bias = 0.83 °C); however, there were large differences in measurement height (43.5 m). Seasonal summaries of validation performance for each station (Table S1) show the weakest performance from June to August (in the austral winter), consistent with the cross-validation statistics (Table 2).

Table 3 Validation error statistics for direct spatial interpolation (DSI) of hourly air temperature compared with field-based observations at 28 stations between 01/Jan/2015 and 14/Nov/2019. Note the “Δ height a” column refers to the difference in instrumentation height relative to the calibration stations used in this study (1.5 m). The terms “height” and “elevation” are used as per McVicar and Körner (2013)

Validation statistics across each station (Fig. 11) were best when first aggregating air temperature to daily values (R2 = 0.95 ± 0.08, RMSE = 0.99 °C ± 0.67 °C, Bias = -0.03 °C ± 0.93 °C). Spatial interpolations frequently demonstrated negative bias with poorer performance at the time of minimum temperature (R2 = 0.90 ± 0.09, RMSE = 1.61 °C ± 0.51 °C, Bias = -0.31 °C ± 0.97 °C), and positive bias with better performance at the hour of maximum temperature (R2 = 0.94 ± 0.08, RMSE = 1.50 °C ± 1.13 °C, Bias = 0.54 °C ± 1.52 °C). As with the cross-validation results, hourly interpolations were associated with low bias and intermediate performance (R2 = 0.92 ± 0.07, RMSE = 1.78 °C ± 0.62 °C, Bias = -0.03 °C ± 0.93 °C) in comparison to hourly observations at the diurnal maximum and minimum.

Fig. 11
figure 11

Violin plots of pooled validation statistics for direct spatial interpolation (DSI) of hourly air temperature (01/Jan/2015 to 14/Nov/2019; a total of 1,779 days) at independent OzFlux (n = 24) and CosmOz (n = 4) stations (see Fig. 1 for locations). Statistics presented include the (a) coefficient of determination (R2), (b) root mean squared error (RMSE), (c) mean of observations, and (d) bias (i.e., mean error), and are derived by evaluating point interpolation from Step D2 (see Fig. 2) of the DSI workflow against validation observations. Daily mean statistics are calculated using the mean observed and predicted hourly air temperature for each day. Daily minimum (denoted min) /maximum (denoted max) statistics are calculated using the observed and predicted minimum / maximum hourly air temperature for each day. Each of these plots are underpinned by 763,347 observations at an hourly time-step (see Table 3), and 32,211 aggregated observations at a daily time-step. The shape of the violin illustrates the mirrored kernel density estimate of values. Boxes represent the 25th, 50th and 75th percentiles. Whiskers extend to the largest value no further than 1.58 times the interquartile range

4.2 Comparing hourly spatial interpolation with temporal interpolation of daily air temperature

Spatial interpolation of daily minimum and maximum air temperature, a key input to the PL81 model, showed reliable cross-validation performance on an annual basis (Table S2; R2 = 0.94 / 0.98, RMSE = 1.81 °C / 1.29 °C, Bias = -0.00 °C / 0.00 °C). Optimised selection of the d parameter (used to transform the coastal distance index) showed little improvement relative to a fixed value (Figure S10c, d; Δ RMSE < 0.5%), and high variability across DOY (Figure S10, f). The d parameter was therefore fixed at 11 and 23 for minimum and maximum temperature, respectively (Figure S10a, b).

Empirical time-of-day interpolation using PL81 typically performed best near solar noon (i.e., 12:00–14:00 LST, mean = 23.21 °C, R2 = 0.94, RMSE = 2.19 °C, Bias = 1.37 °C) and in the early hours of the morning (0:00–05:00 LST, mean = 15.45 °C, R2 = 0.89, RMSE = 2.54 °C, Bias = -1.42 °C) across all days-of-year (Fig. 12). PL81 showed strong positive biases during the day (i.e., 07:00–14:00 LST, mean = 20.73 °C, R2 = 0.92, RMSE = 2.62 °C, Bias = 1.82 °C) and strong negative biases in the evening and early morning (i.e., 18:00–04:00 LST, mean = 17.06 °C, R2 = 0.89, RMSE = 2.77 °C, Bias = -1.74 °C). Cross-validated DSI consistently performed better than PL81 (i.e., when subtracting statistics for PL81 from those for DSI), particularly during post-sunrise warming (07:00–11:00 LST, mean Δ R2 = 0.03, Δ RMSE = -1.42 °C) and afternoon cooling (16:00–19:00 LST, mean Δ R2 = 0.07, Δ RMSE = -1.36 °C) cycles (Fig. 12).

Fig. 12
figure 12

Heatmaps of pooled error statistics for hourly air temperature (01/Jan/2015–14/Nov/2019), modelled by applying cross-validation and point predictions of minimum and maximum temperatures to the time-of-day interpolation method (PL81) developed by Parton and Logan (1981) at select weather stations across Australia (n = 566; see Figure S2 for locations). Statistics presented include the coefficient of determination (R2) for (a) PL81 (b) compared to cross-validated direct spatial interpolation (DSI), root mean squared error (RMSE) for (c) PL81 (d) compared to cross-validated DSI, (e) mean of observations, and (f) bias (i.e., mean error) for PL81. Differences between DSI and PL81 (b, d, denoted by Δ) are calculated by subtracting PL81 statistics from DSI statistics (i.e., DSI minus PL81, blue shading indicates DSI performs better). There are 22,335,650 observations (42,798 h for a variable number of stations per time-step) underpinning each of these plots

PL81 performance across each of the calibration stations (n = 566; Fig. 13) was limited in the hours following solar noon (i.e., solar noon + 1.5 h, R2 = 0.89 ± 0.13, RMSE = 2.25 °C ± 0.54 °C, Bias = 1.61 °C ± 0.76 °C) and prior to sunrise (sunrise − 1 h, R2 = 0.85 ± 0.11, RMSE = 2.14 °C ± 0.58 °C, Bias = -0.64 °C ± 1.01 °C), despite the temporal proximity to the timing of minimum and maximum air temperature (Figure S5, Figure S6). Hourly PL81 predictions (R2 = 0.83 ± 0.13, RMSE = 2.68 °C ± 0.69 °C, Bias = -0.21 °C ± 0.95 °C) performed less effectively than the corresponding DSI statistics (Δ R2 = -0.10 ± 0.07, Δ RMSE = -1.16 °C ± 0.54 °C, Δ Abs. Bias = -0.16 °C ± 0.47 °C) for calibration stations. Statistical performance of PL81 at each of the validation stations (Table 4; mean R2 = 0.85 ± 0.10, RMSE = 2.80 °C ± 0.65 °C, Bias = -0.00 °C ± 0.98 °C), with the exception of bias, was consistently better with DSI (mean Δ R2 = 0.07 ± 0.06, Δ RMSE = -1.02 °C ± 0.43 °C, Δ Abs. Bias = -0.04 °C ± 0.27 °C).

Fig. 13
figure 13

Violin plots of error statistics for hourly air temperature (01/Jan/2015 to 14/Nov/2019) estimated by applying the cross-validated minimum and maximum temperatures to the time-of-day interpolation method developed by Parton and Logan (1981) at weather stations across Australia (n = 566, see Figure S2). Statistics presented include the (a) coefficient of determination (R2), (b) root mean squared error (RMSE), (c) mean of observations, and (d) bias (i.e., mean error), and are derived by evaluating PL81 predictions against validation observations. There are 22,335,650 hourly observations, and 930,264 / 930,710 observations at sunrise − 1 h / solar noon + 1.5 h (respectively) underpinning each of these plots (with a variable number of stations per time-step). The shape of the violin illustrates the mirrored kernel density estimate of values. Boxes represent the 25th, 50th and 75th percentiles. Whiskers extend to the largest value no further than 1.58 times the interquartile range

Table 4 Validation error statistics for temporal interpolation of hourly air temperature using PL81 when compared with independent field-based observations at 28 stations between 01/Jan/2015 and 14/Nov/2019, and the difference in performance when compared against direct spatial interpolation (DSI). Differences between DSI and PL81 (denoted by Δ) are calculated by subtracting PL81 statistics from DSI statistics (i.e., DSI minus PL81, where positive values indicate DSI performs better for Δ R2, and negative values indicate DSI performs better for Δ RMSE and Δ |Bias|)

4.3 Comparing hourly spatial interpolation with reanalysis products

Comparisons of mean air temperature, aggregated to seasonal and annual summaries at each of four times-of-day (i.e., 03:00, 09:00, 15:00 and 21:00 UTC + 9), showed strong agreement with the coarsened DSI surfaces and both BARRA-R (Table 5; R2 ≥ 0.93, RMSD ≤ 1.41 °C, Bias ≤ |0.92| °C) and ERA5-Land (Table 6; R2 ≥ 0.94, RMSD ≤ 1.62 °C, Bias ≤ |1.29| °C). Deviations were typically lowest during the daylight hours (09:00–15:00 UTC + 9) for both reanalyses. DSI showed positive biases relative to BARRA-R that were most pronounced during the austral warmer months (September – February; Table 5) after sunset (21:00 UTC + 9, Bias = 0.55 °C to 0.92 °C) and in the early hours of the morning (03:00 UTC + 9, Bias = 0.55 °C to 0.78 °C). When compared against ERA5-Land, DSI showed negative biases most pronounced during the austral cooler months (March to August; Table 6), also after sunset (21:00 UTC + 9, Bias = -1.29 °C to -1.26 °C) and in the early hours of the morning (03:00 UTC + 9, Bias = -1.24 °C to -1.15 °C).

Table 5 Statistical comparison of mean air temperature across all available grids (01/Jan/2015 to 31/Dec/2018) for coarsened DSI surfaces (0.11°) assessed against BARRA-R during each season at four times of the day (n = 57,818 pixels). Bias is calculated by subtracting BARRA-R from DSI (i.e., DSI minus BARRA-R, values > 0 °C indicate higher air temperatures for DSI)
Table 6 Statistical comparison of mean air temperature across all available grids (01/Jan/2015 to 31/Dec/2018) for coarsened DSI surfaces (0.10°) assessed against ERA5-Land for each season at four times of the day (n = 69,286 pixels). Bias is calculated by subtracting ERA5-Land from DSI (i.e., DSI minus ERA5-Land, values > 0 °C indicate higher air temperatures for DSI)

The spatial distribution of comparative statistics between coarsened DSI assessed against both BARRA-R and ERA5-Land, calculated on a pixel-by-pixel basis through time and by season, are illustrated in Figs. 14 and 15, respectively. Spatially autocorrelated patterns in bias and statistical deviations in many coastal regions are present in each comparison. DSI resulted in higher air temperature relative to BARRA-R (0.14 °C to 0.38 °C) and lower air temperature relative to ERA5-Land (-0.44 °C to -0.18 °C) per season across all pixels.

Fig. 14
figure 14

Statistical comparison of hourly air temperature (DSI, resampled to 0.11°) when assessed against BARRA-R reanalysis between 01/Jan/2015 and 31/Dec/2018 at four times of day (i.e., 03:00, 09:00, 15:00 and 21:00 UTC + 9), reported seasonally (n = 57,818 pixels). Bias is calculated by subtracting BARRA-R from DSI (i.e., DSI minus BARRA-R, values > 0 °C indicate higher air temperatures for DSI). The four rows represent the four seasons, with the statistical variables shown in each column, as outlined above the legend in each case

Fig. 15
figure 15

Statistical comparison of hourly air temperature (DSI, resampled to 0.11°) when assessed against ERA5-Land reanalysis between 01/Jan/2015 and 31/Dec/2018 at four times of day (i.e., 03:00, 09:00, 15:00 and 21:00 UTC + 9), reported quarterly (n = 69,286 pixels). Bias is calculated by subtracting ERA5-Land from DSI (i.e., DSI minus ERA5-Land, values > 0 °C indicate higher air temperatures for DSI). The four rows represent the four seasons, with the statistical variables shown in each column, as outlined above the legend in each case

Statistics comparing the agreement between calibration and validation observations, and the spatial interpolations, BARRA-R and ERA5-Land are illustrated in Fig. 16. Spatially interpolated data showed a stronger fit to the observations used in model calibration (for those that could be compared; n = 424 stations; see Figure S11) on average than BARRA-R (i.e., when subtracting statistics for BARRA-R from those for DSI, Δ R2 = 0.05 ± 0.05, Δ RMSE = -1.06 ± 0.70 °C, Δ Abs. Bias = -0.38 ± 0.70 °C) and ERA5-Land (Δ R2 = 0.07 ± 0.04, Δ RMSE = -1.22 °C ± 0.57 °C, Δ Abs. Bias = -0.41 °C ± 0.59 °C). Marginal improvements were found for the validation observations (n = 28 stations) when subtracting performance statistics for BARRA-R (Δ R2 = 0.01 ± 0.04, Δ RMSE = -0.27 °C ± 0.45 °C, Δ Abs. Bias = -0.13 °C ± 0.47 °C) and ERA5-Land (Δ R2 = 0.00 ± 0.05, Δ RMSE = -0.12 °C ± 0.54 °C, Δ Abs. Bias = -0.23 °C ± 0.47 °C) from DSI. The spatial distribution of pooled error statistics for BARRA-R and ERA5-Land at each of the calibration and validation sites are mapped in Figures S12 and S13. The differences in RMSE between DSI and both BARRA-R and ERA5-Land are reported seasonally at four times-of-day (i.e., 03:00, 09:00, 15:00 and 21:00 UTC + 9) in Figures S14 and S15.

Fig. 16
figure 16

Statistical evaluation of near-surface air temperature by comparing pixel values from direct spatial interpolation (DSI; y-axes), BARRA-R (x-axes, a, c, e) and ERA5-Land (x-axes, b, d, f) to observations from 424 Bureau of Meteorology stations used for calibration and 28 OzFlux/CosmOz stations (see Figure S11 for locations) used for validation between 01/Jan/2015 and 31/Dec/2018 (all points represent individual stations). Data points were included for each station at four times of the day (i.e., 03:00, 09:00, 15:00 and 21:00 UTC + 9). Note that data points were only considered where values were available for all three products (i.e., interpolated, BARRA-R, ERA5-Land), leading to the exclusion of 197 calibration stations due to dates of operation (n = 68) or differences in spatial resolution / land mass delineation (n = 129). Each of these plots are underpinned by 2,441,276 observations. The lines of best fit (represented by the solid lines) and 95% confidence intervals (shown by the grey shading) are fitted with a loess smoother (span = 1.3). DSI performed better than reanalyses where data points fall on the side of the black dotted 1:1 line indicated by the red ‘bullseye’ (i.e., the circled cross) symbol. The legend in (a) applies to all other parts. Note to optimally show both the calibration and validation datasets for each sub-part the X-axis and Y-axis have different ranges, as using the same range for both axes would ‘compress’ the data

Statistical performance of cross-validated DSI was higher overall than both BARRA-R (Δ R2 = 0.02 ± 0.05, Δ RMSE = -0.37 °C ± 0.70 °C, Δ Abs. Bias = -0.23 °C ± 0.78 °C) and ERA5-Land (Δ R2 = 0.03 ± 0.03, Δ RMSE = -0.53 °C ± 0.60 °C, Δ Abs. Bias = -0.25 °C ± 0.70 °C) at the Bureau of Meteorology calibration stations (Fig. 17; n = 424). High elevation stations (> 800 m, n = 24) validated poorly overall (Figure S16) and in comparison to cross-validated predictions from DSI (Fig. 17) for both BARRA-R (Δ R2 = -0.05 ± 0.06, Δ RMSE = 1.69 °C ± 1.84 °C, Δ Abs. Bias = 1.78 °C ± 2.06 °C) and ERA5-Land (Δ R2 = -0.05 ± 0.04, Δ RMSE = 1.69 °C ± 1.49 °C, Δ Abs. Bias = 1.82 °C ± 1.74 °C). Statistical performance of reanalysis products did not show clear improvements in regions of lower observation density (i.e., mean distance to closest 10 stations > 200 km) when compared to cross-validated DSI (Fig. 17; BARRA-R, Δ R2 = -0.01 ± 0.06, Δ RMSE = 0.17 °C ± 0.57 °C, Δ Abs. Bias = 0.09 °C ± 0.54 °C; ERA5-Land, Δ R2 = -0.02 ± 0.03, Δ RMSD = 0.16 °C ± 0.45 °C, Δ Abs. Bias = -0.01 °C ± 0.43 °C). The relationship between statistical performance and observation density at the validation sites was variable (Fig. 18) and sample sizes were limited when the mean distance to the closest 10 stations was > 200 km (n = 4).

Fig. 17
figure 17

Relationship between observation density and the difference in validation statistics for (a, c, e) BARRA-R and (b, d, f) ERA5-Land when compared to cross-validation statistics for direct spatial interpolation (DSI) at Bureau of Meteorology calibration stations (n = 424), 01/Jan/2015 to 31/Dec/2018. Points are coloured by station elevation. Differences between DSI and BARRA-R / ERA5-Land (denoted by Δ) are calculated by subtracting reanalysis statistics from DSI statistics (i.e., DSI minus BARRA-R, and DSI minus ERA5-Land). Points are coloured by station elevation. There are 2,316,687 observations underpinning each of these plots. The lines of best fit (blue lines) and 95% confidence intervals (grey shading) are fitted with a loess smoother (span = 0.75)

Fig. 18
figure 18

Relationship between observation density and the difference in validation statistics for BARRA-R (a, c, e) and ERA5-Land (b, d, f) when compared to DSI at OzFlux and CosmOz validation stations (n = 28), 01/Jan/2015 to 31/Dec/2018. Differences between DSI and BARRA-R / ERA5-Land (denoted by Δ) are calculated by subtracting reanalysis statistics from DSI statistics (i.e., DSI minus BARRA-R, and DSI minus ERA5-Land). Points are coloured by station elevation. There are 124,589 observations underpinning each of these plots. The lines of best fit (blue lines) and 95% confidence intervals (grey shading) are fitted with a loess smoother (span = 0.75)

5 Discussion

5.1 Spatial interpolation of hourly air temperature

Direct spatial interpolation (DSI), using an optimal number of observations points as knots (Hutchinson 1995; Hutchinson et al. 2009; Johnson et al. 2016; Price et al. 2000), was most effective in generating high quality predictions of hourly air temperature across Australia. Our pooled hourly cross-validation results (R2 = 0.96, RMSE = 1.56 °C) compared well against those reported by Webb and Minasny (2020; R² = 0.89 to 0.91, RMSE = 1.6 °C to 1.7 °C), who spatially interpolated air temperature across Australia at 30 min time-steps between 01/Jan/2019 and 31/Dec/2020. Our study built upon this previous research, as we: (i) evaluated a longer analysis period (i.e., 01/Jan/2000 to 31/Dec/2019); (ii) compared both CASI and DSI to evaluate the relative strengths of each approach (e.g. Table 2; Figs. 7 and 8; Figure S9); (iii) designed the study for the development of stable long-term climatologies and multi-year historical datasets; (iv) validated DSI with independent station observations; and (v) used a time-varying coastal distance index to represent the effects of continentality. Coastal weather can have large impacts on the quality of spatial climate data products (Daly 2006; Daly et al. 2002, 2003; Hutchinson et al. 2021; Jones et al. 2009), and coastal proximity metrics have been reported to increase statistical performance of monthly mean minimum and maximum air temperature interpolation by up to 25% (Hutchinson et al. 2021).

Sea breeze systems, that develop when temperature (and associated pressure) gradients at the land-water interface cause cool air over the ocean (or water bodies) to move inland, can exert a strong influence on coastal weather (Abbs and Physick 1992; Miller et al. 2003; Simpson 1994). They typically begin early in the day, when air temperatures over land exceed those over water, and can persist well into the night under suitable conditions (Miller et al. 2003). Sea breezes can bring cool, moist air up to several hundred km inland (Abbs and Physick 1992; Clarke 1955; Simpson et al. 1977), and tend to be stronger in sub-tropical and tropical climates than mid-latitudes and in the afternoon and evening during warm months (Abbs and Physick 1992; Azorin-Molina et al. 2011; Miller et al. 2003). These systems can be difficult to model due to interacting factors such as synoptic scale wind and cold fronts, coastline morphology, and topographic features (Abbs and Physick 1992; Azorin-Molina and Chen 2009; Miller et al. 2003).

Temporal patterns in the coastal distance indices developed herein reflect the timing and expected behaviour of sea breeze systems, with the strongest inland propagation of cool air inferred (i.e., with high values of d) during the afternoon and evening in spring and summer. Interpolation performance improved most (up to 22.4% for climatologies and 7.5% for DSI) in late spring, when the coastal distance index decays very slowly with distance from the coast (Fig. 3; Figure S4) and air temperature decreases over time more rapidly than further inland (see Figure S17). These same periods correspond to times when Webb and Minasny (2020) reported large errors in coastal regions, indicating that coastal distance metrics often play a key role in improving interpolation performance for meteorological variables. The variability in d found when calibrating the coastal distance index using DSI (e.g., Fig. 3d) reflects the difficulty in predicting sea breeze systems (Miller et al. 2003). Calibrating the coastal distance index with long-term stable climatologies was essential in identifying a generalizable temporal structure that was performant even when applied to DSI (e.g. see Fig. 3 and Figure S10). The low optimal values of d (restricting coastal influences) in the morning hours are potentially associated with the convergence of land and sea breezes and provided a marginal but consistent improvement in DSI (Fig. 3f). Our time-varying coastal distance index provides a parsimonious method for capturing sea breeze dynamics in coastal regions, and further improvements (e.g., varying d in both space and time) may be possible with further research.

Hourly air temperature performed best overall with DSI; however, CASI can improve stability for interpolating some climate time-series (Hutchinson et al. 2021; Jeffrey et al. 2001), and enables blending of different datasets (Funk et al. 2015; Harris et al. 2020; Karger and Zimmermann 2018) or interpolation methods (Jones et al. 2009; Raupach et al. 2012). We found that CASI tended to perform better in data sparse times (Fig. 7) and locations (Fig. 8); however, this finding depends on the specific methods applied (e.g., independent spline variables used for modelling anomalies), and the density and spatial autocorrelation structure of the observations (Hofstra et al. 2008, 2010; Jeffrey et al. 2001). A considerable limitation of CASI however is that model responses to environmental gradients (e.g., environmental lapse rates with elevation, influence of coastal proximity) are fixed according to the climatology (e.g., identical lapse rates at the same time of year in a specific location) when interpolating anomalies using positional coordinates only. While we interpolated anomalies as a function of positional coordinates only, the large difference in error between DSI and CASI for high elevation stations suggests that our anomaly interpolation did not adequately represent variability of environmental lapse rates for air temperature, and may benefit further from incorporating additional independent spline variables (i.e., elevation, coastal distance indices when sea breezes are more likely; Hutchinson et al. 2021). This limitation can be addressed with DSI, but can come at the cost of reduced model stability and increased sensitivity to poor quality data (e.g., Jeffrey et al. 2001).

We found error was more pronounced during the coolest parts of the day during winter. This is consistent with our daily air temperature extrema analyses (Table S2) and many interpolation studies, where minimum temperature performs poorly in comparison to maximum temperature (Hutchinson et al. 2009; Jeffrey et al. 2001; Jones et al. 2009; Mark et al. 2002; Webb and Minasny 2020). There are several potential reasons why hourly air temperature was least performant during night-time in winter. These include: (i) lower spatial autocorrelation ranges for minimum temperature (Jones and Trewin 2000); (ii) the occurrence of temperature inversions (e.g., driven by katabatic winds and cold air pools) that commonly develop under clear, calm conditions and can confound air temperature lapse rate estimates (Stewart et al. 2017; Trewin 2005; Whiteman et al. 1999); (iii) the latent heat of condensation (Hutchinson et al. 2009); and (iv) associated humidity dynamics where saturated air lowers the (wet) adiabatic lapse rate. While it is difficult to attribute the change in performance to any one factor, the patterns identified in our cross-validation statistics provide insights into specific times when further improvements in spatial interpolation performance may be achieved.

Independent validation, using the OzFlux and CosmOz stations further supported the use of spatial interpolation as a viable option for generating air temperature surfaces at sub-diurnal time-steps. Overall, the statistical performance at these validation stations was strong, despite differing site conditions (i.e., different types of forested and agricultural ecosystems; Beringer et al. 2016; Hawdon et al. 2014) and height at which observations were made (Table 3). This indicates that DSI can also play a role in gap-filling sub-diurnal air temperature at field sites. The resultant hourly 1 km near-surface air temperature grids can be used in numerous applications, such as being coupled with Himawari geostationary remotely sensed imagery (Bessho et al. 2016) to monitor sub-diurnal processes such as cloud presence and cloud type (Qin et al. 2019), incoming shortwave radiation (Qin et al. 2021), land surface temperature (Yu et al. 2024) and vegetation dynamics. With Himawari being launched in July 2015 there are adequate numbers of hourly stations to support the generation of hourly air temperature grids for Australia; see Table S3.

5.2 Comparing hourly spatial interpolation with temporal interpolation of daily air temperature

Temporal interpolation of daily air temperature consistently performed poorly in comparison to DSI. The differences in statistical performance between PL81 and DSI were lowest around the time of sunrise and after solar noon (Fig. 12b, d), when minimum and maximum air temperatures typically occur. This is unsurprising given that cross-validated predictions of daily minimum and maximum air temperatures performed well statistically (see Table S2) when compared against previous studies for Australia (RMSE = 1.7 °C to 2.0 °C and RMSE = 1.2 °C to 1.7 °C, respectively; Jeffrey et al. 2001; Jones et al. 2009) and were used as a key input to the PL81 model. It also demonstrates that the empirical parameterisation of PL81 was effective for accurately estimating the timing of minimum and maximum temperature (see Figure S1, S6). PL81 performed poorly by comparison at all other times, showing strong positive biases during the day and strong negative biases overnight (Fig. 12f). This pattern of bias indicates that PL81 does not accurately represent the rate of sub-diurnal temperature change. While further improvements may be possible by tuning the exponential decay parameter, this would only reduce error overnight. The errors accumulated due to the rate of air temperature change given by the truncated sine curve and exponential decay curve were a key source of poor model performance. Our findings are supported by Reicosky et al. (1989), who noted that temporal interpolation of daily air temperature extrema is useful for many (not all) applications, but it is unlikely to be appropriate when accurate air temperatures are required for specific times.

These findings were expected given at least two limitations of PL81: (i) the inability of daily air temperature extrema to represent natural variability in sub-diurnal air temperature; and (ii) functional form (i.e., truncated sine curve and exponential decay curve) and model parameterisation. The former limitation was addressed with spatial interpolation, where frontal systems and sub-diurnal variability can be represented by geostatistical modelling using hourly observations. The latter limitation was mitigated in part by station-specific calibration of the PL81 parameters. Here we have empirically determined the time-lag between solar noon and maximum air temperature (i.e., PL81 parameter ‘a’, Figure S5), and sunrise and minimum air temperature (i.e., PL81 parameter ‘c’, Figure S6), but recognize that some performance improvement may be possible with station-specific calibration of the exponential decay rate (i.e., PL81 parameter ‘b’). Given the inability of PL81 to characterise frontal systems, and the truncated sine curve to accurately represent post-sunrise warming, this would not otherwise alter our conclusions. The climatologies produced as part of this study can, however, provide opportunities to develop and parameterise temporal interpolation models.

5.3 Comparing hourly spatial interpolation with reanalysis products

Overall, there was good agreement between spatially interpolated air temperature and both reanalysis products. Except for bias, pooled statistical metrics typically showed lower errors relative to the point-based cross-validation; however, these analyses were conducted at coarser spatial scales and for only a subset of our analysis period (i.e., 01/Jan/2015 to 31/Dec/2018). Interpolated surfaces were slightly warmer on average (i.e., ≤ 0.5 °C) than BARRA-R in the evening and early hours of the morning (Table 5), in contrast to previous analyses (Fig. 5 of Su et al. 2019) showing BARRA-R produced marginally warmer minimum air temperatures on average (i.e., < 0.2 °C) than interpolated daily minimum air temperature. While direct comparisons are difficult to make, these differences may be in part explained by: (i) differences in the analysis period; (ii) inability of observations at regular time-steps to capture daily air temperature extrema; and (iii) the (6-hourly) data assimilation mechanisms used in BARRA-R. The interpolated surfaces were on average cooler than ERA5-Land (-0.44 °C to -0.18 °C; see Table 6; Fig. 15), consistent with previous analyses of ERA5 products (Dee et al. 2011; Hersbach et al. 2020) that showed positive biases relative to Australian climate products (Su et al. 2021; Su et al. 2019). Spatial analyses showed autocorrelated biases that are expected given the differences between modelling techniques. For example, interpolated surfaces are driven by digital elevation models, whereas atmospheric models are sensitive to land characteristics and assimilate many data sources. This was clearly demonstrated in ERA5-Land, where the higher RMSD across Australia’s largest salt lakes (Fig. 15), was otherwise absent from BARRA-R (Fig. 14) where they were treated as a bare soil surface (Su et al. 2019).

Point-based analyses using both the calibration and independent validation datasets (Fig. 16) showed that the DSI surfaces better represented the observations used for model calibration than either BARRA-R or ERA5-Land. Overall, this finding was the same for validation stations, although there was greater variability across locations. Each gridded product was evaluated at their native resolution, and therefore this analysis represented how well each product reproduced ground-based measurements in the absence of any downscaling. Several high elevation stations were among the worst performing when compared against both reanalyses (Fig. 17); however, this is likely a result of the comparatively coarse spatial resolution of BARRA-R (0.11°) and ERA5-Land (0.10°). BARRA-C (~ 1.5 km resolution), a regionally downscaled version of BARRA-R (~ 1.5 km), has been shown to better represent these observations for stations at elevations above 500 m and/or proximal to the coast (i.e., within ~ 150 km). We did not perform a direct comparison for two key reasons: (i) BARRA-C doesn’t cover our whole study extent; and (ii) the reported magnitude of improvement relative to BARRA-R (Fig. 2 of Su et al. 2021) is unlikely to substantially impact upon our findings.

Despite the tendency for spatial interpolation performance to decrease with observation density (e.g., Fig. 8), we did not find a clear relationship with observation density when comparing the statistical performance of BARRA-R and ERA5-Land with (cross-validated) DSI (see Fig. 17). This suggests that DSI still performs well for interpolating hourly air temperature in sparser regions of the network, and part of the trend towards decreased performance may be an artefact of very high-quality predictions when dense observations are available. Similar patterns are found in the validation analyses, where relative performance of DSI increases with station density (mean distance to closest 10 stations < 100 km) and then levels off; however, the number of samples in low density regions is limited. Our results demonstrate that spatial interpolation can provide substantial accuracy advantages in situations where hourly, high spatial resolution air temperature data are required to support analyses (Fig. 16). Overall, spatial interpolation remains a parsimonious and computationally efficient method for accurately quantifying sub-diurnal near-surface air temperature dynamics.

6 Conclusion

Direct spatial interpolation (with an appropriate knot parameter selection) was effective for modelling near-surface hourly air temperature spatio-temporal dynamics over a continental scale (i.e., Australia). This was demonstrated by strong statistical performance achieved by cross-validation, at independent validation stations, against temporal interpolation techniques, in comparison with two atmospheric reanalyses, and when evaluated against similar interpolation studies. The methods developed herein: (i) improved model performance with time-varying coastal distance indices; (ii) avoided the limitations of temporal interpolation; (iii) were efficient in comparison to computationally expensive and data intensive reanalyses; and (iv) maximised preservation of information contained in the observational record as evidenced by the point-based analyses. Future work could use more complex models (e.g., machine learning) to incorporate land surface processes into spatially interpolated datasets, and/or downscale existing reanalysis products for further study. The density and observation frequency of observations that are currently available enable the development of historical and future hourly air temperature surfaces, which will support numerous scientific applications.