Downscaling long lead time daily rainfall ensemble forecasts through deep learning

Skilful and localised daily weather forecasts for upcoming seasons are desired by climate-sensitive sectors. Various General circulation models routinely provide such long lead time ensemble forecasts, also known as seasonal climate forecasts (SCF), but require downscaling techniques to enhance their skills from historical observations. Traditional downscaling techniques, like quantile mapping (QM), learn empirical relationships from pre-engineered predictors. Deep-learning-based downscaling techniques automatically generate and select predictors but almost all of them focus on simplified situations where low-resolution images match well with high-resolution ones, which is not the case in ensemble forecasts. To downscale ensemble rainfall forecasts, we take a two-step procedure. We first choose a suitable deep learning model, very deep super-resolution (VDSR), from several outstanding candidates, based on an ensemble forecast skill metric, continuous ranked probability score (CRPS). Secondly, via incorporating other climate variables as extra input, we develop and finalise a very deep statistical downscaling (VDSD) model based on CRPS. Both VDSR and VDSD are tested on downscaling 60 km rainfall forecasts from the Australian Community Climate and Earth-System Simulator Seasonal model version 1 (ACCESS-S1) to 12 km with lead times up to 217 days. Leave-one-year-out testing results illustrate that VDSD has normally higher forecast accuracy and skill, measured by mean absolute error and CRPS respectively, than VDSR and QM. VDSD substantially improves ACCESS-S1 raw forecasts but does not always outperform climatology, a benchmark for SCFs. Many more research efforts are required on downscaling and climate modelling for skilful SCFs.


Introduction
Seasonal climate forecasts (SCF) can provide great value to many socioeconomic sectors such as agriculture, construction, mining, tourism, energy, and health (Merryfield et al. 2020;Manzanas 2020). As in the driest inhabited continent, e.g., Australia's agricultural system is heavily dependent on rainfall. The potential annual value added from skilful SCFs for the whole of Australia could be around A$1.6 billion for the agricultural sector and A$192 Extended author information available on the last page of the article million for the construction sector (The Centre for International Economics 2014). Thus, various ensemble SCFs based on General Circulation Models (GCMs) are routinely produced around the world (Hudson et al. 2017;Merryfield et al. 2020;Saha et al. 2014;Johnson et al. 2019). Despite ongoing development in GCMs, their typical grid resolutions ( $ 100 km) still limit their direct application to weather-sensitive sectors (Baño-Medina et al. 2020;Schepen et al. 2020;Kusunose and Mahmood 2016;Luo 2016). The barriers could be overcome via downscaling techniques which generate more skilful and localised forecasts by making use of local observations (Maraun and Widmann 2018;Bettolli et al. 2021).
Due to the challenges caused by spatial-temporal variabilities of climate variables, especially precipitation, a large number of downscaling techniques have been developed, including dynamical downscaling (Ratnam et al. 2017;Thatcher and McGregor 2009;Luo 2016), statistical downscaling (Maraun and Widmann 2018), and the recent development of deep-learning-based downscaling. Dynamical downscaling uses a physics-based climate model, such as the conformal cubic atmospheric model (CCAM) (Thatcher and McGregor 2009), forced by boundary conditions from a GCM, to simulate atmospheric conditions at a finer resolution. Statistical downscaling builds empirical relationships between GCM raw hindcasts and historical observations and then uses them to remove systematic biases, adjust the uncertainty spread, and restore local daily variability of raw forecasts (Ahmadalipour et al. 2018;Maraun and Widmann 2018;Ş an et al. 2022;Shao and Li 2013;Crimp et al. 2019). A typical example is quantile mapping (QM) which assumes that the distribution of model-simulated data at a given location should preserve the distribution of observed data (Michelangeli et al. 2009;Li and Jin 2020). Comparisons between traditional statistical and dynamical downscaling suggest that neither group of methods are superior, however, in practice, computationally cheaper statistical methods are widely used (Baño-Medina et al. 2020). Pre-engineered predictors and relationships limit these statistical downscaling techniques to exploit various spatio-temporal dependencies, and then their abilities to capture information beyond prior knowledge (Baño-Medina et al. 2020;Liu et al. 2020). Automatic feature extraction and selection integrated into the modelling process with deep learning, especially convolutional neural networks (CNNs), have achieved notable success in modelling data with spatial context, recently in climate science (Reichstein et al. 2019). Deep learning has been successfully used in precipitation nowcasting (Shi et al. 2015;Espeholt et al. 2022), which predicts rainfall intensity in a region over 3-6 h, and precipitation parameterisations from GCMs (Pan et al. 2019). More related to this study, several downscaling techniques have been developed based on single image super-resolution (SISR) techniques. For example, DeepSD was proposed by augmenting multi-scale topography into stacked super-resolution convolutional neural networks (SRCNN) (Vandal et al. 2017). DeepSD succeeded in downscaling daily rainfall data to 12.5 km. For long-term climate projection, Rodrigues et al. (2018) proposed a very deep CNN-based SISR strategy to interpolate low-resolution 125 km weather data to 25 km output for weather forecasts. Baño-Medina et al. (2020) assessed CNN methods with three convolutional layers followed by different connection layers for downscaling 200 km reanalysis precipitation to 50 km observational grids over Europe. Super-resolution deep residual network (SRDRN) was proposed based on a deep CNN with residual blocks and batch normalisation for downscaling daily precipitation and temperature (Wang et al. 2021). It leaves behind biascorrection. Liu et al. (2020) presented YNet which consists of an encoder-decoder-like architecture with residual learning through skip connections and fusion layers to enable the incorporation of topological and climatological data as auxiliary inputs for downscaling. It was tested on monthly precipitation means that have different characteristics from daily data.
Downscaling SCFs looks similar to SISR as both aim at getting high-resolution images from low-resolution ones if climate variable data are treated as images (Liu et al. 2020). However, there are several differences.
1. Inputs and outputs in downscaling SCFs are from different sources, such as low-resolution forecasts from GCM vs historical weather data (Liu et al. 2020). In SISR, the low-resolution images and high-resolution target ones are arguably from the same source, e.g., the high-resolution images are often aggregated to form low-resolution images as the training inputs . As far as we know, almost all the deeplearning-based downscaling techniques focused on such a simplified situation (Vandal et al. 2017;Rodrigues et al. 2018;Liu et al. 2020;Wang et al. 2021). 2. Bias and displacement in space or time are common in SCFs, especially for precipitation, due to the inherent complexity of our climate system. To mitigate these mismatch issues, multiple possible forecast trajectories are provided as a practical standard for short or long lead time forecasts (Hudson et al. 2017;Merryfield et al. 2020;Johnson et al. 2019). Therefore, downscaling performance should be evaluated in terms of both forecast accuracy and overall ensemble forecast skill by considering forecast uncertainty (Grimit et al. 2006;Li and Jin 2020;Kusunose and Mahmood 2016). The latter is predominant in the literature (Grimit et al. 2006;Ferro et al. 2008;Schepen et al. 2020) but, as far as we know, has never been used in deep-learningbased downscaling. 3. Downscaling precipitation can use auxiliary variables (Maraun and Widmann 2018;Bettolli et al. 2021). Rainfall events are often associated with other climate variables, e.g., intense low-pressure systems (Pan et al. 2019;Baño-Medina et al. 2020;Liu et al. 2020), which are found often beneficial for downscaling (Baño-Medina et al. 2020;Liu et al. 2020).
To address these differences, we model downscaling ensemble forecasts as a SISR problem with an additional target on maximising ensemble forecast skills. To leverage advanced deep learning techniques, for downscaling long lead time daily precipitation forecasts for the whole of Australia (Sect. 2), we choose very deep super-resolution (VDSR) (Kim et al. 2016) from outstanding SISR techniques as a suitable candidate for downscaling based on the continuous ranked probability score (CRPS), a widely used ensemble forecast skill metric (Grimit et al. 2006;Li and Jin 2020;Ferro et al. 2008;Schepen et al. 2020). Raw precipitation forecasts from GCMs are partially parameterised and are usually considered less reliable compared to directly resolved variables, such as pressure and temperature (Pan et al. 2019). To improve its downscaling performance, we incorporate other resolved climate variables into VDSR and propose a very deep statistical downscaling (VDSD) model. The VDSD structure is again finalised based on CRPS (Sect. 3). It is tested on real-world application scenarios for downscaling 60 km SCFs to 12 km. Leave-one-year-out cross-validation results illustrate its better performance than VDSR and two classical downscaling techniques in terms of both forecast accuracy and ensemble forecast skills. In addition, its performance is better than or comparable with climatology, a benchmark for long lead time climate forecasts. VDSD does not always outperform climatology (Sect. 4). Many more research efforts are required on downscaling and climate modelling for skilful SCFs.
2 Data and pre-processing 2.1 ACCESS-S1 forecast and calibrated data This work focuses on downscaling daily rainfall forecasts for the whole of Australia. We use daily rainfall retrospective forecasts from Australia's operational seasonal climate forecast system, the Australian Community Climate and Earth-System Simulator Seasonal model version 1 (ACCESS-S1) (Hudson et al. 2017; Bureau National Operations Centre 2019), which is used for climate outlooks on multi-week through to seasonal timescales. Its atmospheric model has enhancements to the ensemble generation strategy to make it appropriate for sub-seasonal forecasting and large ensembles. The resolution of the atmospheric model is raised to 0.6 , nearly 60 km in the mid-latitudes. The hindcast data 1 of ACCESS-S1, from 1990 to 2012, are publicly available. 2 Within each year, it has forecasts on 48 different initialisation dates (i.e. 1st, 9th, 17th, and 25th of each calendar month). Its forecasts have 11 ensemble members, each of which provides a full description of the evolution of weather for the upcoming 217 days. Daily precipitation data from ACCESS-S1 are based on the BoM's day definition of 9 am to 9 am (local time). Three precipitation forecasts for 7 Jan 2012 made on 1 Jan 2012 with a lead time of 6 days are illustrated in the second column of Fig. 1. ACCESS-S1 data also provides a calibrated version. For each forecast initialisation date, lead time, and grid point location, it has a calibrated function to downscale to a 5 km resolution (Bureau National Operations Centre 2019). For a given forecast day, the calibration functions first carry out spatial interpolation using bilinear interpolation to high spatial resolution and then apply QM to adjust the bias and spread between observations and forecasts in the other 22 years. Bilinear interpolation is to interpolate an image using repeated linear interpolation. It first linearly interpolates a low-resolution image in one direction, and then in the second direction. QM downscaling for a location can be formulated as is the inverse function of F o , and F f and F o indicate the cumulative distribution functions (CDFs, aka quantile functions) of raw forecasts x f and observations x o respectively (Maraun and Widmann 2018). The empirical distributions of raw forecasts and observations over a 15-days reference period are used as the estimates of F f and F o (Li and Jin 2020; Bureau National Operations Centre 2019). We use the calibrated data for forecast skill comparison, denoted as QM hereafter.

BARRA reanalysis data
Bureau's Atmospheric High-Resolution Regional Reanalysis for Australia (BARRA), 3 is a regional numerical climate forecast model using the Australian Community Climate and Earth-System Simulator-Regional (ACCESS-R), Australia's first reanalysis model of the atmosphere (Su et al. 2019). Through assimilating local surface observations and locally derived wind vectors that are not available to global reanalysis models, BARRA reaches a good tradeoff between the spatial resolution and consistency with precipitation observations (Acharya et al. 2019). Its spatial resolution of 0.12 , is realised in the whole region of Australia and New Zealand. Six-hour accumulated precipitation, obtained from BARRA from 1 Jan 1990 to 31 Dec 2013, is aggregated to daily frequency by taking the sum of the four 6-h grid point values within each 9 am to 9 am 24-h window.

Pre-processing
We choose a region from 9 S to 43.7425 S and 112.9 E to 154.25 E as our study region, covering all the Australian landmass (see, Fig. 1). As pre-processing, we crop all the climate variable surfaces to the same area defined in the case study region. These climate variables have different value ranges. For example, precipitation ranges from 0 to 900 mm per day, and geopotential height at 850 hPa ranges from 1200 to 1600 m. To bring climate variables to have the same value range of [0, 1] during learning, we carry out simple linear normalisation.
To facilitate 4-time image super-resolution, we generate two versions of ACCESS-S1 forecasts via bicubic interpolation (BI). One is 12 km, used as inputs for VDSR and our proposed VDSD, and the second is 48 km, used as inputs for other deep learning models. We pair the ACCESS-S1 forecasts made on date i with forecast lead time l day with the BARRA reanalysis data on date dð¼ i þ lÞ together for training/validation/test. There are about 2.62 million image pairs for each spatial resolution. To save training time, we only use the first seven lead time forecast pairs for each initialisation date in the training period. Images in the four columns are the high-resolution image from BARRA reanalysis data, ensemble member forecasts from ACCESS-S1 after bicubic interpolation, and downscaled results from VDSR and VDSD respectively. Only the 1st, 2nd, and 11th members are illustrated in the figure

Methodology
By treating ACCESS-S1 raw forecast and BARRA reanalysis daily data as low-and high-resolution images respectively, we model our SCF downscaling problem as a single image super-resolution (SISR) problem with an additional target on maximising ensemble forecast skill to leverage advanced deep learning techniques and mitigate the mismatch issues. We first briefly describe and review SISR techniques.

Image super-resolution and deep learning
SISR is to recover a high-resolution image H a from a lowresolution one L a . L a is often regarded as the result of a degradation function L a ¼ D H a ; c ð Þ with parameter c . Most super-resolution data sets are obtained by aggregation or degradation mapping from high-resolution images . Given low-and high-resolution image pairs fL a ; H a g a¼1;...;p , SISR is to find a super-resolution mapping function F with parameters h to generate, from low-resolution ones L a , high solution images S a as close as possible to H a The simplest SISR techniques are spatial interpolation, such as nearest neighbour interpolation, bilinear interpolation, and BI. BI uses cubic splines or other polynomial techniques to interpolate data on a two-dimensional regular grid, which could sharpen or enlarge images. BI can consider more neighbouring grid points, and get smoother images with fewer interpolation artefacts than bilinear interpolation. BI is often considered to be the baseline for spatial downscaling of precipitation fields (Vandal et al. 2017). Since Dong et al. (2014) first introduced super-resolution CNN (SRCNN), deep-learning-based SISR techniques have been widely developed and achieved marvellous improvements in terms of image or perceptual quality ). Most of them are based on CNNs (Liu et al. 2020). As one of the most popular deep learning neural networks, a CNN is mainly used in processing gridlike data such as images. Besides an input and an output layer, a CNN has one or more hidden layers. The hidden layers could be of several different types: convolution, activation, pooling, normalisation, and fully-connected one. Its core building block, a convolution layer, uses multiple filters to slide across the height and width of a matrix or matrices, which is input for this layer, to generate a spatial representation of the receptive region of these filters. The spatial representation forms feature maps, which are output to feed the next layer. Multiple convolution layers allow us to study features in quite different local areas, and re-using filters reduces the number of model parameters for training (Dong et al. 2014;Wang et al. 2020). These SISR models use several additional network design techniques, such as gradient clipping and residual learning in very deep super-resolution (Kim et al. 2016), residual dense block in Dense feature fusion (DFF) (Zhang et al. 2018b), and attention mechanism in residual channel attention network (RCAN) (Zhang et al. 2018a). RCAN achieve state-of-the-art in terms of image quality measured by peak signal-to-noise ratio (PSNR) ). To generate more realistic images, encoderdecoder networks and generative adversarial networks (GAN), are later employed in super-resolution GAN (SRGAN) (Ledig et al. 2017) and enhanced SRGAN (ESRGAN) (Wang et al. 2018), and multiple semantic information used in (Rad et al. 2019). ESRGAN outperforms various models in terms of image perceptual quality (Wang et al. 2018).
By leveraging these SISR models for maximising ensemble forecast skills, we develop our SCF downscaling techniques in two steps. First, we select representative deep learning techniques based on their outstanding SISR performance, and train them to generate a high-resolution precipitation image from each low-resolution forecast ensemble member and then choose the one with the highest ensemble forecast skill for the whole of Australia on a separate validation data set (Sect. 3.2). Secondly, based on the selected deep learning structure, we incorporate other resolved climate variables and propose VDSD to enhance downscaling (Sect. 3.3).

Model selection for downscaling ensemble forecasts
SCFs, covering long lead times from weeks to multiple months, are located at the transition between weather forecasting and climate projection, and have been a big challenge in the weather and climate communities for years (Merryfield et al. 2020 hereafter for simplicity. CRPS considers both forecast bias and forecast uncertainty of ensemble members. It reaches its minimum 0 when all the forecasts are identical with the observation, and increases with forecast bias and spread of the ensemble forecast. As the initialisation conditions in SCFs vary from one initialisation date to another, these ensemble members do not correlate across initialisation dates. Instead of generating an aggregated forecast from an ensemble of forecasts e.g., in Liu et al. (2020), we need to generate one highresolution forecast precipitation image from each low-resolution forecast image, such that these high-resolution forecasts can be used directly by applications, such as feeding into biophysical models (Schepen et al. 2020;Basso and Liu 2019;Luo 2016;Jin et al. 2022). Thus, our downscaling problem can be defined as follows. For lowresolution output images from GCMs, precipitation image X ði;l;eÞ 2 R m 0 Ân 0 and other climate variable images Z ði;l;eÞ 2 R m 0 Ân 0 Âp concerning a target high-resolution image Y ðdÞ 2 R mÂn ; we would like to find such a function G, which generates high-resolution precipitation image as the same resolution as Y ðdÞ , where w ði;lÞ j;k is the weight for the ensemble forecast made on date i, lead time l at location (j, k), and d ¼ i þ l. We use w ði;lÞ j;k 1 for this study for simplicity. The downscaling problem is modelled as a SISR problem (Eq. 3) but optimising the average ensemble forecast skill CRPS (Eq. 4). Our deep learning downscaling solution is to provide a SISR model, i.e., the function G, to generate a high-resolution precipitation imageŶ ði;l;eÞ from each low-resolution precipitation forecast image X ði;l;eÞ , as well as other lowresolution climate variable Z ði;l;eÞ from a climate model. Thus, E high-resolution forecastsŶ ði;l;eÞ È É e¼1;...;E , corresponding to an ensemble of E low-resolution forecasts X ði;l;eÞ È É e¼1;...;E , can form a high-resolution ensemble forecast for a forecast target day dð¼ i þ lÞ. Averaging over the CRPS of these ensemble forecasts on the validation days is used as an optimisation objective to finalise the deep learning network architecture.
To determine such a good function G and its parameter h, as illustrated in Fig. 2, we take a relatively simple twostep procedure: the first step is to find a suitable deep learning model as F in Eq. 1 according to the average CRPS, and then incorporate extra variables Z ði;l;eÞ to enhance its downscaling performance. Before that, as illustrated in Fig. 2, we split the forecast data into two groups according to their initialisation dates. In the model selection and development steps in this and next subsections, we partition all the initialisation dates in the 23 years of ACCESS-S1 randomly into two groups. The first group has 1056 initialisation dates and image pairs from this group are used for the model training. The image pairs from the remaining 48 initialisation dates are left for forecast skill validation. The forecast skill CRPS is averaged over the forecasts with lead times up to 30 days. For the performance test in the next section, we put initialisation dates in 1 year out for testing and the other year data for model parameter training to facilitate a fair comparison with Climatology and QM. In the first stage of SISR model selection, we treat our downscaling problem as image super-resolution and employ three SISR models, VDSR (Kim et al. 2016), RCAN (Zhang et al. 2018a), and ESRGAN (Wang et al. 2018). They are chosen because of their outstanding performance on SISR (Wang et al. , 2018. The parameter learning is based on image super-resolution, and our deep learning-based problem becomes: where L is the pixel-level loss function, such as L1, ; or PSNR between high-resolution images Y ðdÞ and superresolution imagesŶ ði;l;eÞ from Eq. 3; k is a trade-off parameter and L h is the high-level image loss like perceptual and/or adversarial loss (Wang et al. 2018). During our model training, we used the default settings of the three SISR models. For example, VDSR used the whole images, and RCAN and ESRGAN used cropped high-resolution patches with the spatial size of 192 Â 192 and 128 Â 128 respectively. Their mini-batch size was 64, 16 and 16 respectively (Wang et al. 2018;Kim et al. 2016;Zhang et al. 2018a). Restricted by the computational resources we could access, the mini-batch of RCAN and ESRGAN was hard to increase. On the separate validation data set for lead times up to 30 days, the average forecast CRPS of trained VDSR, RCAN, and ESRGAN across Australia is 1.38, 1.53, and 1.68 respectively. The forecast skill of RCAN and ESRGAN is not as good as QM with a forecast CRPS of 1.39. For our downscaling problem, both ESRGAN and RCAN were found slow to converge, partially because cross-channel dependency mechanisms became useless for our data. VDSR converged fast and outperformed QM. We selected the VDSR model for further development. We also tried a few different settings of VDSR, such as with 8, 12, 15, 20, 30, or 36 convolutional layers. Its forecast CRPS on the validation data set decreased from 8 to 20 layers, and after that, did not change much on the validation data set. We stuck with 20 layers as Kim et al. (2016) did for SISR.

Very deep statistical downscaling (VDSD)
As we discussed earlier, other climate variables such as temperature or air pressure could influence precipitation and have often been used for precipitation simulation and downscaling (Pan et al. 2019;Baño-Medina et al. 2020). To further improve downscaling performance of VDSR, we include these climate variables in our very deep statistical downscaling (VDSD). The climate variables, different from precipitation, are resolvable in climate modelling and often have more reliable forecasts (Pan et al. 2019;Baño-Medina et al. 2020;Merryfield et al. 2020). Including these climate variables and their combination into VDSR, we have 15 variants of VDSR, indicated as VDSRv1 to VDSRv15, as in Fig. 2. The overall structure of finalised VDSD is illustrated in Fig. 3 in which Geopotential Height (ZG) at 850 hPa is incorporated as the additional input for the 9th ensemble members as in Fig. 1. VDSD, modified from VDSR, mainly has three parts: input, intermediate feature extraction, and output layers. It can take precipitation images and other climate images, such as ZG as input in Eq. 3. These input images have been pre-processed with the same spatial resolution as highresolution output images, and the same normalised value range (0 to 1, detailed in Sect. 2.3). These input images go through multiple feature extraction layers/blocks. These feature extraction layers have both convolution and activation modules, while the output layer only has a convolution module to generate a residual precipitation image. Adding back the interpolated raw precipitation image, indicated by the rightwards arrow with tip downwards in Fig. 3, it finally generates an output image at the same resolution as the target image. VDSD, similar to VDSR, does not have pooling and normalisation layers, as it maintains residual learning which has been widely demonstrated to contribute to robust and speedy training for SISR (Kim et al. 2016;Wang et al. 2020).
As shown in Fig. 3, two or more input images X lr and Z lr , which represent the raw climate forecasts after upsampling (i.e., with 48 km spatial resolution for our applications), first go through the input layer. This layer has a convolution layer and a rectified linear unit (ReLU) activation layer that forces a negative input to zero and leaves a positive input unchanged. The convolution layer has 64 3 Â 3 matrix filters that are slid across the input image and multiplied with the input image to produce 64 first-level feature images. Then the ReLU layer performs the ReLU function (max(0, x)) to force negative values from the feature images to be zero. The operation can be formulated as where M 0 is the first level feature images generated by the input layer, and ReLU() and Conv() are ReLU and convolution layers that perform the ReLU function and 2-dimensional convolution respectively. Each convolutional layer has 64 filters and produces 64 feature images. The filter size is set to be 3 Â 3. Both the padding and stride step lengths are 1. Therefore, the size of each feature image is the same as the size of high-resolution images. Suppose the size of input images is m Â nÂ2, then the size of the feature image generated by the input layer is m Â nÂ64. These basic features then go through multiple intermediate blocks.
The intermediate blocks are identical and each of them consists of a convolutional layer, which extracts deeper spatial features, and a ReLU layer, which introduces nonlinearity and interaction. Each convolutional layer takes 64 feature images from the previous block as input. Therefore, the operation of each intermediate block is the same, which can be written as where M t represents the tth level feature image, and B is the operation of an intermediate block.
The output layer is a convolutional layer that converts 64 high-level feature images into a residual image-that is to use spatial patterns discovered to predict the difference between upsampled low-resolution rainfall forecast and the target image. Finally, the residual image is added to the upsampled precipitation input image to generate a superresolution precipitation forecast. The unknown parameters, Fig. 3 The structure of the VDSD model, modified from VDSR (Kim et al. 2016), where È represents element-wise matrix addition with input precipitation image, orange blocks indicate layers of the neural network, blue rectangles indicate feature images, input and output daily rainfall data images are on the left-hand and right-hand sides respectively. For easy understanding, these input/output images are shown in the original scale, instead of the normalised scale used in downscaling such as these in filters, will be learned from training image pairs.
A convolution layer is used in each of input, intermediate feature extraction, and output layers. A convolution layer, together with a ReLU layer, in each feature extraction block, uses 64 filters with a size of 3 Â 3 to slide through its input images to generate different spatial features with different receptive fields. The receptive field on the original input images gradually becomes larger after passing through one feature extraction block. Around 20 feature extraction blocks can generate spatial features from quite different local areas from both precipitation and the other climate variable images. These features are tuned during model training by adjusting the parameters in the filters. These features may capture potentially complex spatial patterns, which are expected to improve downscaling.
We again used the forecast CRPS across the whole of Australia on the separate validation data set to finalise VDSD. We tested two different types of variants of VDSD.
(1) One added extra input images for downscaling. We tried four climate variables from ACCESS-S1 and their combinations: ZG, daily maximum temperature, daily minimum temperature, and sea level pressure. There are total 15 variants of VDSR. Our results showed that adding more climate variables than ZG rarely improved CRPS, and often deteriorated its ensemble forecast skills. (2) The other was to try two different loss functions in Eq 5, i.e., L1 or L2. It leads to two more variants as in Fig. 2. L1 gave a slightly lower average forecast CRPS of 1.36 for lead times up to 30 days and would be used in our finalised VDSD model. Thus, there are two main differences between VDSD and VDSR. VDSD uses extra input images, and L1, instead of L2 as the learning objective.

Test results and discussions
To illustrate the downscaling performance of VDSR and VDSD whose network structure has been finalised in Sect. 3, we used the last 3-year retrospective forecast data for cross-validation. To facilitate a fair comparison with the ACCESS-S1 calibrated data set (i.e., the QM method), we conducted two leave-one-year-out tests. (1) We took forecasts made on the 48 initialisation dates in 2012 for testing and the other forecasts made before 2012 for training the downscaling models' parameters. Daily BARRA precipitation data between 1 Jan 2012 and 29 July 2013 were included in the test set as the ACCESS-S1 forecasts made on 25 Dec 2012 cover up to 29 July 2013 for its 217-day lead time forecasts. (2) We left forecasts made on the 48 initialisation dates in 2010 as the second test set and took ACCESS-S1 forecasts made in other years, i.e, 1990-2009 and 2011-2012, as training data. Daily precipitation data between 1 Jan 2010 and 29 July 2011 were used in the test. In total, around 1152 daily precipitation images from BARRA were used in the prediction performance test.

Downscaling performance metrics
A benchmark for SCFs for a given year is to use observations on the same day of other years except for the target year in a base period to form an ensemble forecast, which is often called climatology (Li and Jin 2020;Schepen et al. 2020). In this study, we use 1990-2012 as the base period, and thus there are 22 ensemble members in our climatology ensemble forecasts.
As in Sect. 3.2, the average CRPS of ensemble forecasts for each grid point (k, j) on test data is treated as an overall ensemble forecast skill assessment. For each grid point, averaging across all the initialisation dates in the test year, we obtain the average CRPS of a forecast model m for a lead time l, CRPS ðmÞ l;k;j . For further comparison with climatology and easy understanding, we calculate the CRPS skill score for model m against the CRPS of climatology as follows.
The model with a higher CRPS skill score is preferred. The skill score ranges from À1 to 1 and reaches its maximum of 1 when the CRPS is 0, i.e., a perfect forecast where each forecast is identical to its associated observation. The skill score is zero if a forecast has the same average CRPS as climatology. A positive CRPS skill score indicates the downscaled forecast is better than the climatology model, and vice versa.
As downscaling techniques are often assessed by MAE (Liu et al. 2020;Wang et al. 2020), we also use another comparison metric, average MAE, which is defined as Similarly, a higher MAE skill score is preferred.
For a fair comparison, skill scores presented in the following subsections exclude locations on the ocean as QM has results only on the Australian continent and Tasmania.

Results for forecasts made in 2012
Three typical ensemble members forecast on 1 Jan 2012 for 7 Jan 2012, downscaled by BI, VDSR and VDSD respectively, are illustrated in Fig. 1. VDSR keeps similar precipitation area patterns as BI-downscaled ACCESS-S1 forecasts and often has more areas with precipitation. Downscaled results of VDSD follow precipitation patterns of the raw forecasts and more likely reduce precipitation amount for 7 Jan 2012. VDSD could adjust the precipitation area shapes to some degree. For example, for ensemble members 2 and 11, VDSD substantially reduces the precipitation along the 30 S latitude line, which brings its downscaled precipitation images closer to the observations. 5 Averaging across 48 initialisation days in 2012, we calculate its average CRPS skill score for each grid point for different lead times. Figure 4 illustrates average CRPS skill scores across the whole of Australia by the four downscaling models along with different lead times up to 216 days. VDSR has the highest scores in the first three lead times, and then VDSD becomes the best of the four models for almost all the other lead times. For example, for the lead time of 6 days (some typical downscaled results are illustrated in Fig. 1), its CRPS skill scores for the four models are spatially visualised in Fig. S1. 6 For most locations on the Australian land (except north-western Australia, the eastern seaboard of Australia, and Tasmania), VDSD has a positive CRPS skill score. It has very high skills for locations in the central part of the Australian mainland where its three counterparts perform badly. The average CRPS skill score of VDSD is 5:69 Â 10 À2 . It is higher than 2:13 Â 10 À2 , À8:50 Â 10 À3 and À1:21 Â 10 À1 of VDSR, QM and BI respectively (the second column, Table 1). Averaging across the 217 different lead times, VDSD has positive CRPS skill scores for most locations in Australia, while its three counterparts have negative scores for most locations (Fig. S2). Their mean CRPS skill scores are 5:63 Â 10 À3 ; À2:54 Â 10 À2 ; À1:05 Â 10 À1 , and À1:42 Â 10 À1 , respectively. That means only VDSD is on average better than the climatology for most locations. Among the four downscaling techniques, only VDSD is better than climatology on average as its positive CRPS skill score.
To check the performance of these downscaling techniques for sub-seasonal forecasts, Fig. 5 illustrates the average CRPS skill scores for the first 45 lead times. The skill scores of BI are around their mean of À1:39 Â 10 À1 for most locations in Australia. QM has some improvement with a mean of À7:40 Â 10 À2 . For most locations, VDSR has skill scores close to 0 with a mean of À4:65 Â 10 À3 . VDSD has positive skill scores for most locations on the Australian continent, with a mean of around 2:76 Â 10 À2 . VDSD still has negative skill scores along the eastern coastline, north-western parts of Australia, and Tasmania. VDSD has better forecast accuracy in terms of average MAE (Fig. S3). Except for the lead time of 0, VDSD has the lowest MAE values. Its average MAE is 1.37 mm/day for the first 45 lead times. VDSR comes second with an average MAE of 1.67 mm/day. They are much smaller than the other three methods, QM, BI and climatology, with the average MAE of 2.06, 2.29, and 2.20 mm/day respectively. The MAE skill scores of the four downscaling methods are 4:12 Â 10 À1 , 2:30 Â 10 À1 , 7:29 Â 10 À2 , and À9:90 Â 10 À2 respectively as listed in the third column, Table 1.
Similar to CRPS skill scores, the two deep learning methods hold their improvement for the long lead times in terms of the MAE skill scores as illustrated in Fig. 6. Except for the first six lead times, BI has negative MAE skill scores. QM often has positive MAE skill scores. Both deep learning models, VDSR and VDSD have substantial improvements for all the different lead times. Averaging across these 217 different lead times, the MAE skill scores of VDSD, VDSR, QM and BI are around 4:17 Â 10 À1 , 2:29 Â 10 À1 , 4:00 Â 10 À2 , and À9:48 Â 10 À2 respectively. Figure 7 illustrates the average CRPS skill scores along with different forecast lead times based on SCFs made on the 48 different initialisation days in 2010. For most of the 217 different lead times, VDSD has the highest CRPS skill score among the four models. For example, for the lead time of 6 days, the CRPS skill scores of two deep learning models are positive in most locations in Australia (Fig. S4). In comparison, QM and BI have negative CRPS skill scores in various locations. VDSD has higher CRPS skill scores than VDSR in the southeast and south-central Australia though both have quite similar spatial patterns. On average, the average CRPS skill scores of VDSD and VDSR are 5:34 Â 10 À2 and 3:97 Â 10 À2 . They are substantially higher than À3:01 Â 10 À2 and À8:79 Â 10 À2 of QM and BI respectively ( Table 2). The average CRPS skill scores of VDSD and VDSR across the 217 different lead times are often positive or close to zero for most locations on Australian land (Figs. S5d and S5c), and both QM and BI are normally in the negative domain (Figs. S5b and S5a). The mean CRPS skill scores for VDSD, VDSR, QM and BI across Australian land and 217 lead times are À1:02 Â 10 À2 , À2:53 Â 10 À2 , À6:46 Â 10 À2 , and À6:52 Â 10 À2 respectively (Table 2). VDSD is about 1:51 Â 10 À2 higher skill score than VDSR, and 5:50 Â 10 À2 higher than both the traditional downscaling techniques. VDSD is slightly worse than climatology on average for the 217 different lead times. Note that VDSD has 11, less than 22 in climatology, ensemble members that can lead to a few percentage points lower CRPS skill score (Ferro et al. 2008;Li and Jin 2020).

Results for forecasts made in 2010
For sub-seasonal forecasts, the average CRPS skill scores for the first 45 different lead times are 1:38 Â 10 À2 , À1:02 Â 10 À3 , À6:62 Â 10 À2 , À9:06 Â 10 À2 , respectively, for VDSD, VDSR, QM and BI. As illustrated in Fig. 8, for most locations in Australia, both VDSR and VDSD have CRPS skill scores between À0:10 and 0.10 while VDSD has slightly higher CRPS skill scores in northern and eastern Australia. From the forecast accuracy, VDSR and VDSD have the average MAE values for the first 45 lead times around 2.14 and 2.07 mm/day respectively. They are smaller than 2.92, 3.16 and 2.74 mm/day obtained by QM, BI, and climatology, respectively. Thus VDSD, VDSR, QM and BI again have MAE skill scores from high to low, as listed in Table 2.
The four downscaling techniques keep their MAE skill score orders almost for all the 217 different forecast lead times as illustrated in Fig. 9. VDSD and VDSR always have positive skill scores. Except for the first eight lead times, both QM and BI have negative skill scores. Averaging across these 217 lead times, the MAE skill scores of these four models are 2:41 Â 10 À1 , 2:13 Â 10 À1 , À9:14 Â 10 À2 and À1:97 Â 10 À1 respectively. VDSD has a relatively small improvement against VDSR, and both are much better than climatology. Considering these results, we conclude VDSD is comparable with climatology in terms of both forecast accuracy and ensemble forecast skill for the SCFs made on the 48 initialisation dates in 2010.

Discussions
From Figs. 4, 6, 7, and 9, we can see that VDSD normally has the highest average CRPS and MAE skill scores among  the four downscaling models for different forecast lead times. The skills of ACCESS-S1 raw forecasts are often a bit away from climatology, indicated by the negative CRPS and the negative MAE skill scores of BI. VDSD often makes substantial improvements on downscaled forecasts. As listed in the last row of Table 1, VDSD has about 1:47 Â 10 À1 higher CRPS and 5:12 Â 10 À1 higher MAE skill scores than BI averaging across the whole of Australia and the 217 lead times on the SCFs made in 2012. The improvement can be slightly higher for the first 45 lead times. Similarly, for 2010 in Table 2, VDSD has about 5:50 Â 10 À2 higher CRPS skill scores and 4:38 Â 10 À1 higher MAE skill scores on average. Thus, the bias and displacement issues in the ACCESS-S1 raw forecasts are mitigated to some degree. VDSR also makes substantial improvements over BI, though it performs slightly worse performance than VDSD. The improvement of VDSD and VDSR may come from their large receptive fields of around 492 km Â 492 km due to the 20 convolutional layers, which could benefit local forecasts from GCM's forecast skills on a large scale. The relatively small image patches used for training the other two outstanding SISR models, RCAN and ESRGAN, would fail to capture GCM's forecast skills on a large scale. The further improvement of VDSD over VDSR illustrates the usefulness of incorporating other climate variables into downscaling.  Though VDSD makes a substantial improvement, for the SCFs made in 2010, its average overall ensemble forecast skill, CRPS skill score, is slightly worse than climatology in general. There are some possible reasons. (1) For forecasts made in 2010, rainfall data from 1 Jan 2010 to 29 July 2011 are used for skill assessment. The years 2010 and 2011 are the third-and second-wettest calendar years on record for Australia, with 703 mm and 708 mm respectively. Both are well above the long-term average of 465 mm due to the La Niña event peak. 7 The La Niña event peak in 2012 is much weaker and made 2012 relatively easier to forecast. That means the training data for the models to test on 2010 and 2011 have relatively less precipitation, hence VDSD intends to move in that direction, which deteriorated its performance for the SCFs made in 2010. (2) The climatology benchmark we have used has 22 ensemble members. Such a double ensemble size can lead to a few percentage points higher CRPS skill scores (Ferro et al. 2008). (3) The host climate model ACCESS-S1 may perform worse in 2010 than in 2012, on e.g., geopotential heights. For both test settings, VDSD's final performance still heavily depends on ACCESS-S1 raw forecasts. Therefore, to generate skilful SCFs, more research efforts should be put into improving both deep- learning downscaling methods like VDSD and climate models like ACCESS-S1.
To compare other downscaling techniques for Australia, the deep learning techniques are preliminarily examined against station-based techniques: Extended Copula-based Post-Processing (ECPP) (Li and Jin 2020), and an improved analogue method (Shao and Li 2013), as well as dynamic downscaling CCAM (Thatcher and McGregor 2009). The analogue method (Shao and Li 2013) is a specific implementation of the well-known K-Nearest Neighbours (KNN) method in the machine learning community. As most station-based downscaling techniques are time-consuming to train models (like ECPP) or to make forecasts (like analogue) for the whole of Australia, we only choose five weather stations for a simple comparison. These five stations are from three different states in Australia and have different climate characteristics (Table S4). The forecast lead time is up to 28 days as the dynamic downscaling like CCAM generates a huge amount of data for its 11 runs with different boundary conditions from 11 ensemble members for each SCF initialisation date. Similarly, the initiation date is restricted to the first day of each month in 2012. The CCAM downscaling used in this comparison employed a better soil initialisation with spinup soil temperature and moisture, which have improved forecast skills. The average CRPS skill score for a single lead time is averaged from the five locations and 12 different initialisation days. For the first week (with a lead time of up to 6 days), all the downscaling techniques except BI have a positive skill score (the second column in Table S5). ECPP performs best, followed by two deep learning downscaling techniques VDSD and VDSR. For the first 2 weeks (lead time up to 13 days), VDSD performs best on average, followed by VDSR, ECPP and QM. Averaging over 29 different lead times, VDSD is better than ECPP, QM and VDSR that have similar forecast skills as climatology. Analogue, BI and CCAM have negative skill scores as the raw seasonal forecasts lose skills with a longer lead time for these five weather stations. VDSD often has better forecast skills than its counterparts for the three different lead time periods. These improvements are relatively small compared with the large standard deviation of CRPS skill scores (Table S5), though we can claim at least that VDSD has comparable forecast skills as its competitors.
Another comparison can be made from the perspective of computational time. Table 3 lists the average computation time required for both training and operation/test where 11 ensemble members for 217 days forecasts from ACCESS-S1 are downscaled. We used the following hyper-parameters in both VDSR and VDSD. The number of epochs was 50, for which we normally observed the

Conclusion
Downscaling long lead time daily rainfall ensemble forecasts has been modelled as a single image super-resolution (SISR) problem with an additional target on maximising overall ensemble forecast skill-continuous ranked probability score (CRPS). To leverage advanced deep learning techniques, we have applied three outstanding SISR techniques to generate high-level features and learn non-linear relationships automatically, and chosen very deep superresolution (VDSR) as the most suitable model. The selection has been based on average CRPS on a randomly selected validation data set. We have incorporated an extra climate variable, geopotential height, into VDSR and established the very deep statistical downscaling (VDSD) model with an expectation to enhance downscaling. Both deep learning models have finalised their structures based on the average CRPS on the validation data. On leave-oneyear-out cross-validation for 48 ensemble SCFs made in 2012 and 2010, VDSD has outperformed VDSR and two traditional downscaling techniques QM and BI in terms of both forecast accuracy and skill on the whole of Australia. With positive CRPS skill scores in general in Australia, VDSD has outperformed climatology, a benchmark for long lead time ensemble climate forecast, in 2012 and the first 45 lead times in 2010, though its improvement has become smaller with longer forecast lead time. As evidenced by its forecast skill improvement over the ACCESS-S1 forecasts in 2012 and 2010, optimising CRPS in our model development might have mitigated the mismatch issues between the ACCESS-S1 low-resolution raw forecasts and high-resolution observations to some degree. A simple comparison with two station-based statistical downscaling methods, an improved analogue method and ECPP, and one dynamic downscaling model CCAM on five representative weather stations has further demonstrated VDSD's advantages. Compared with the CRPS skill score variabilities, the improvement of VDSD over VDSR or ECPP is relatively small. VDSD is comparable with, if not better than, various downscaling counterparts in terms of forecast skill and accuracy. On the other hand, both VDSD and VDSR, after lengthy training over a large amount of data, have downscaled long lead time daily precipitation very fast. They have been normally much faster than sophisticated station-based downscaling techniques like ECPP and analogue, and dynamic downscaling like CCAM for the whole of Australia. Thus, deep learning models, such as the proposed VDSD, have demonstrated their potential for possible operational use in the future.
Though deep-learning-based downscaling methods can provide more skilful high-resolution SCFs to drive impact models or biophysical models, the accuracy and overall forecast skills of these SCFs may still not be high enough for direct use by wider communities such as agriculture and hydrology (Kusunose and Mahmood 2016;Luo 2016). There are several directions to move the proposed technique for daily operation in the future. Station-based precipitation observations have not been assimilated in BARRA and its grid precipitation may not be very consistent with on-the-ground observations (Acharya et al. 2019). To remove such inconsistency, station-based downscaling techniques might further improve long lead time forecasts. Integrating station-based techniques with deep learning and conduct comprehensive comparison are subject to future work. VDSD only downscales to 12 km, which should be further enhanced to a higher resolution for real-world applications, by including other inputs (Vandal et al. 2017;Pan et al. 2019). Deep-learning-based downscaling still requires a lot of time for model development, structure finalisation, and parameter training, and we leave downscaling techniques for other climate variables as future work. For a fair comparison (and saving training time), we have only included the forecasts with forecast lead times less than 7 days into the training data. It would lead our models to take less attention to correcting inherent biases of GCM's long lead time forecasts. Skilful SCFs depend more heavily on progresses made in climate modelling. Thus, to deliver skilful SCFs to final applications, climate modelling and deep learning communities should collaborate closely for further development.

Declarations
Conflict of interest The authors have no relevant financial or nonfinancial interests to disclose.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.