TRU-NET: A Deep Learning Approach to High Resolution Prediction of Rainfall

Climate models (CM) are used to evaluate the impact of climate change on the risk of floods and strong precipitation events. However, these numerical simulators have difficulties representing precipitation events accurately, mainly due to limited spatial resolution when simulating multi-scale dynamics in the atmosphere. To improve the prediction of high resolution precipitation we apply a Deep Learning (DL) approach using an input of CM simulations of the model fields (weather variables) that are more predictable than local precipitation. To this end, we present TRU-NET (Temporal Recurrent U-Net), an encoder-decoder model featuring a novel 2D cross attention mechanism between contiguous convolutional-recurrent layers to effectively model multi-scale spatio-temporal weather processes. We use a conditional-continuous loss function to capture the zero-skewed %extreme event patterns of rainfall. Experiments show that our model consistently attains lower RMSE and MAE scores than a DL model prevalent in short term precipitation prediction and improves upon the rainfall predictions of a state-of-the-art dynamical weather model. Moreover, by evaluating the performance of our model under various, training and testing, data formulation strategies, we show that there is enough data for our deep learning approach to output robust, high-quality results across seasons and varying regions.


Introduction and Background
Across the globe, society is becoming increasingly prone to extreme precipitation events due to climate change. The United Nations stated that flooding was the most dominant weather-related disaster over the 20 years to 2015, affecting 2.3 billion people and accounting for $1.89 trillion in reported economic losses (Wallemacq and Herden, 2015). With the increase in the monetary and societal risk posed by flooding (Shukla et al., 2019), the CM predictions for extreme precipitation events are an important resource in guiding the decision of policy-makers.
State-of-the-art regional climate models (RCM) typically run at horizontal resolutions of ∼2-25 km for simulations. These simulations provide an approach to get local detail, but they are computationally expensive and must be developed separately for each climate model (CM) (May, 2004, IPCC, 2007. Alternatively, cheaper statistical methods can be applied to the output of any coarser resolution CM to achieve quantification of model uncertainty and explore a wide range of scenarios. The main aim of this paper is to create a model that can produce high resolution predictions for daily rainfall across the UK, by using low resolution CM simulations of model fields (weather variables) as input. It should be noted that the input simulations do not include precipitation, but include other weather variables which, unlike precipitation, are well simulated. When trained, our model can then be used on the output of any CM simulation for the future. This allows us to produce computationally cheap long-term forecasts of high resolution precipitation into the future. This will help to diagnose changes in precipitation events due to climate change.
To do this, we use the ERA5 reanalysis dataset (Hersbach et al., 2020) as an analogue for CM output. Reanalysis data is based on a weather forecast model -that is running at a similar resolution to a CM -to which observations are constantly assimilated to yield the best estimate of the weather state. We take these historical weather state estimates and use it to predict the high resolution precipitation observations, made available by the E-obs dataset (Cornes et al., 2019).
More concretely, our input data is formed as a timeseries of length t days (t ∈ [1, T ]) containing 6 key model fields (air temperature, specific humidity, longitudinal and latitudinal components of wind velocity at 850 hPa, geopotential height at 500 hPa and total column water vapour in the entire vertical column), each defined on a (20×21) grid representing the UK at approximately 65km spatial resolution. By stacking together the six model fields, we have a (20, 21, 6) matrix representing the UK weather state at 6-hour intervals. Our model will therefore take as input a sequence of daily model field observations X t ∈ R 4×20×21×6 from which it will output a prediction for the true daily total precipitation (mm), Y t ∈ R 100×140 , defined on a (100, 140) grid over the UK with approximately 8.5km spatial resolution.
Interpreting X t as a spatio-temporal sequence of low resolution 3D images and Y t as a spatio-temporal sequence of high resolution 2D images, our task can be interpreted as a combination of Sequence Transduction and Image Super-Resolution. DeepSD (Vandal et al., 2017) utilised a popular Image Super-Resolution model to downscale 2D precipitation images up to a factor of 4x. (Vandal et al., 2018) extended DeepSD by incorporating an optimization scheme utilising a conditional-continuous (CC) loss function for improving the modelling of zero-skewed precipitation events. In the related sequence transduction task, precipitation nowcasting, (Shi et al., 2015) introduced the Convolutional Long Short-Term Memory (ConvL-STM) cell to simultaneously model the behaviour of weather dynamics in space and time, using convolutions to incorporate the surrounding flow-fields and a recurrent structure to model the temporal dependencies of weather. Extending upon this, the encoder-decoder ConvLSTM (Shi et al., 2015) is also able to represent weather dynamics defined on multiple spatial scales in space due to the use of successive convolution based layers, each layer to model larger scale dynamics.
We extend the encoder-decoder ConvLSTM by adding the ability to represent weather dynamics defined on multiple scales in time with our proposed TRU-NET model, a Convolutional Gated Recurrent Unit (ConvGRU) based encoder-decoder that includes the following three features: 1. A novel Fused Temporal Cross Attention (FTCA) mechanism to improve upon existing methods (Jauhar et al., 2018, Zhao et al., 2019, Liu et al., 2019b to model multiple temporal scales of weather dynamics using a stacked recurrent structure. 2. A encoder-decoder structure adapting the U-NET (Ronneberger et al., 2015) structure, by contracting or expanding the temporal dimensions as opposed to the spatial dimensions. 3. A conditional continuous (Husak et al., 2007, Vandal et al., 2018 loss training scheme to improve the prediction of extreme precipitation events, by decoupling the modelling of low intensity precipitation events and high intensity precipitation events.
We present TRU-NET and the conditional continuous loss in Section 2 and discuss the model training details in Section 3. In Section 4, we perform experiments to compare TRU-NET to baselines and investigate TRU-NET's performance on out-of-sample forecasting tasks. Finally, we perform an ablation study to compare our proposed FTCA against alternative existing methods. The results of these experiments show that: -Our novel model, TRU-NET, achieves a lower RMSE and MAE than both a state-of-the-art dynamical weather forecasting model's coarse precipitation prediction and a hierarchical ConvGRU model. -The quality of TRU-NET's predictions are stable when evaluated on out-ofsample weather predictions formed by time periods outside of the training set. -Our proposed FTCA outperforms existing methods to decrease the temporal scale modelled by successive recurrent layers in a stacked recurrent structure.

Temporal Recurrent U-NET
Our TRU-NET model, visualised in Figure 1, maps the 6-hourly low resolution model fields, X t ∈ R 4×20×21×6 , to a representation capturing variability on 6hourly, daily and weekly time scales, and then uses these representations to output a prediction,Ŷ t ∈ R 100×140 , for daily total rainfall. Fig. 1: TRU-NET architecture. This figure depicts the Conditional Continuous variant, which outputs values for rain level and rain probability for 28 consecutive days. The Sequence Length of 3D tensors between layers contracts/expands through the encoder/decoder. This relates to an increasing/decreasing of the temporal scales modelled. The horizontal direction from left to right indicates time.
As a first step within TRU-NET, we map the input data of the coarse grid onto the fine grid using bi-linear upsampling 1 .
The encoder contains a stack of 3 bi-directional ConvGRU layers. Within the encoder, these 3 layers map the input into coarser spatial/temporal scales, from six-hourly/8.5km, to daily/34km, and to weekly/136km. To achieve this reduction in the temporal scales modelled by contiguous encoder layers, we propose a novel Fused Temporal Cross Attention mechanims (FTCA) as shown in Figure 2. These scales are aligned to the timescales associated with extreme rainfall events in the UK (Burton, 2011).
The decoder maps the latent representation captured at the weekly scale back to the daily scale before feeding it to an output layer for daily rain prediction.
Due to memory constraints we do not input the full (28·4×100×140×6) dimensional model fields at once. In space, we extract stencils of 16 × 16 grid-points for the input to predict precipitation over the stencil of 4 × 4 grid-points in the centre of input stencil. TRU-Net processes 28 days worth of information at a time, generating an output of total daily precipitation for all of the 28 days for each application of TRU-NET: with J = 28. This will naturally generate a lack of information on the past for the first timesteps (J = 1, 2, 3...) and a lack of information on the future for the last timesteps (J = ...26, 27, 28). However, this could be avoided by a stream of input data that only makes predictions for the time-steps in the centre of the time-series in future studies. In the following, we describe each of the main components of TRU-Net in more detail.

Encoder
The encoder of our TRU-NET model, as shown in Figure 1, has L = 3 ConvGRU layers, where the l-th layer decreases the sequence length by a factor m l : 1 → 4 → 7. This results in the number of units in each ConvGRU based layer decreasing in the manner: 112 → 28 → 4, corresponding to six-hourly, daily and weekly temporal resolutions.
The conventional ConvGRU is a recurrent neural network designed to model spatial-temporal information. In a conventional ConvGRU Layer, each unit i shares its trainable weight matrices {W k , U k , b k : k ∈ [z, r,Ã]} with other units in the layer, and collectively they are described as having tied weights. Each unit i takes two inputs, namely the previous state A i−1 and the input in the current time , and outputs a state A (h a ,w a ,c a ) i , as detailed below. Here, z i is the update gate, r i is the reset gate,Ã is the cell state, • and * denote the Hadamard product and convolution, respectively.
When mapping an input from one time scale to another, e.g. generating the daily time scale tensor for day t from a sequence of 4 corresponding six-hourly time scale tensors, a simple approach is to average the 4 six-hourly tensors. However, such a simple aggregation strategy ignores the influence of the daily time scale tensor from the previous day t − 1. We instead propose Fused Temporal Cross Attention (FTCA), as a better aggregation strategy based on the cross attention mechanism.
In the final two ConvGRU layers of the encoder, FTCA is fused into the ConvGRU in order to aggregate the inputs from the previous layer to generate a representation for the current layer. The ConvGRU with FTCA is illustrated in Figure 2 and explained in the following subsection.

Convolutional Gated Recurrent Unit with Fused Temporal Cross Attention (ConvGRU w/ FTCA)
In the conventional ConvGRU, the i th unit of the l th layer, denoted as D l,i , takes two inputs, the previous state A i−1 and the input in the current time step B i . In our setup here, however, we stack ConvGRU layers with different temporal scales. As such, the input in the current time step to D l,i is no longer a single tensor, but instead, an ordered sequence of tensors, , as shown in Figure  2(a), where the input B l,i consists of T b time-aligned outputs from the (l−1)-th ConvGRU layer, i.e., B l,i ≡ {A l−1,j:j=1,...,T b }. For example, if the l th layer has the daily time resolution, then the (l − 1) th layer would have the six-hourly time , we propose a Fused Temporal Cross Attention (FTCA) mechanism to calculated a weighted average B i . Here, we use A i−1 to derive a query tensor and B i to derive both a key tensor and a value tensor. The query tensor is compared with the key tensor to generate weights which are used to aggregate various elements in the value tensor to obtain the final aggregated representation of B i . Afterwards, the ConvGRU operations in Equation 2 are resumed. The FTCA related operations for unit i have been decomposed into the following three steps: -Downscaling representations: On A i and B i , we first perform a 3D average pooling 2 (3DAP) with a pool size of M ×M ×1 and transform them to matrices A P F i and B P F -Similarity calculation using relative attention score (RAS): . We then 2 The use of 3D average pooling is motivated by the high spatial correlation within a given feature map due to the spatially correlated nature of weather and to reduce the computational expense of the matrix multiplication.
compute a matrix of weights S (1,T b ) , corresponding to the T b vectors in K, as follows: Note here we use the relative attention score (RAS) function (Shaw et al., 2018) to compute the similarity in Equation 4. Generally to calculate the similarity scores between Q and each vector K b , the inner product function is used (Vaswani et al., 2017). RAS extends this inner product scoring function by considering the relative position of each vector K b to one another. In our case, this position relates to the temporal position of K b relative to other members of K. To facilitate this, we also learn vectors a K b which encode the relative position of each K b .
-Informative representation: Finally the new informative representation B is learnt using two trainable convolution weight matrices with c f filters, W and a set of trainable vectors a V i ∈ a V , encoding the relative position of each vector V i ∈ V as following: We also use Multi-Head Attention (MHA) which allows the attention mechanism to encode multiple patterns of information by using H heads, {W Q , W K , W V 1 }, and performing H parallel cross-attention calculations. The different values of {W Q , W K , W V 1 } across the heads capture different pattern/relationship in data, whereas simply using one head will lead to less diverse or informative patterns captured.

Decoder
The decoder is composed of one Dual State ConvGRU (dsConvGRU) layer and an output layer which outputs predictions for the rain level Y t:t+28 for 28 consecutive days. If the conditional-continuous framework is in use, a second output layer outputs the corresponding predictions for the probability of rainfall r t:t+28 as illustrated in Figure 1.
dsConvGRU: As illustrated in Figure 1, the inputs to the dsConvGRU layer comes from the 2nd and the 3rd Encoder layers, while the output of the dsConvGRU layer is a sequence of 28 tensors which form a latent representation for the 28 days of the target daily precipitation Y t:t+28 . As the dsConvGRU layer contains 28 units, we must expand the 3rd Encoder layer's output from sequence length 4 to sequence length 28. To do this, we repeat every element in the sequence of length 4, 7 times, as in (Tai et al., 2015). As such, each unit in the dsConvGRU layer receives an input from the temporally aligned unit in the 3rd Encoder layer.
Extending Equations 2, the dsConvGRU augments the conventional ConvGRU by replacing the input B i with two separate inputs B i,(1) and B i,(2) , each possessing the same dimensions as B i . Further, the i-th unit of the dsConvGRU layer takes three inputs, A i−1 , B i,(1) and B i,(2) , and outputs a state A i .
Finally, referring to Equations 2, we calculate two sets of the values, Output Layer: As we need to output two sequences of values, rainfall probabilitieŝ r t:t+28 and rainfall valuesŶ t:t+28 , for the conditional-continuous framework which will be discussed in Section 2.4, our model contains a separate output layer stacked over the dual-state ConvGRU layer for each output. Each output layer contains two 2D convolution layers, with 32 and 1 filters respectively and a kernel shape of (3,3).

Conditional Continuous (CC) Augmentation
To reflect the zero-skewed nature of rainfall data, due to many days without rainfall, a conditional continuous (CC) distribution (Husak et al., 2007, Stern andCoe, 1984) is often used to model precipitation. These distributions can be interpreted as the composition of a discrete component and a continuous distribution to jointly model the occurrence and intensity of rainfall: where δ is the Dirac function such that ∞ −∞ δ(x)dx = 1, r t is the probability of rain at t-th day and g(·) is Gaussian distribution with unit variance and predicted rainfallŶ t as mean. Therefore (1 − r t )δ(y t ) models the no rain events, while This conditional-continuous distribution requires our model to output a prediction (r t ), for the probability of rain occurring as well as a prediction (Ŷ t ), for the level of rainfall conditional on day t being a rainy day. To facilitate the requirement of two outputs,Ŷ t andr t , we augment the decoder to contain a second identical output layer. In this case, the TRU-NET model has a branch like structure, with r t and Y t the respective outputs of each of these branches.
During training, we sample one set of [Ŷ t ,r t ] per prediction and use the following loss function. This can be observed as a combination of the binary cross entropy on predictions for whether or not it rained (the first term) and a squared error term on the predicted conditional rainfall intensity (the second term).
2.5 Monte Carlo Model Averaging (MCMA) When training with dropout, each of the n weights in the neural network has a probability p of being masked. As such, there are 2 n possible models, defined by the combination of weights that can be masked. When sampling predictions from the model, it is infeasible to sample from each of the 2 n variations. Instead, we can form a sample of predictions from a random selection of the 2 n possible models, and calculate the average of the sample. More formally, MCMA is the process of using dropout during training and testing. During training, dropout is performed with a fixed probability p of masking weights. During testing we draw n samples, from our model for each prediction. To do this we use n different dropout masks on the model's weights. Each dropout mask uses the same masking probability, p, on the model's weight as was used during training. We then calculate the mean of these samples to arrive at a model averaged prediction. Experiments in (Srivastava et al., 2014, §7.5), show this method is effective to sample from neural networks trained with dropout. During inference, we use the MCMA framework to produce i ∈ I samples [r i t ,Ŷ i t ] for each observed rainfall Y t . For each observation, we calculate a final predictionŶ t for Y t :Ŷ

Experimental Setup
This section describes the data used for performance evaluation, baseline models used for comparison, and model hyper-parameter setup.

Baseline Models
We compare TRU-NET with the following baselines: Integrated Forecast System (IFS): The IFS is a numerical weather prediction system which is solving the physical equations of atmospheric motion. IFS is used for operational weather predictions at the European Centre for Medium-Range Weather Forecasts (ECMWF). It is also used to generate the ERA5 reanalysis data which is used as input data for TRU-NET. While the input fields are a product of the data assimilation process of ERA5, there are also data for precipitation predictions available which are diagnosed from short-term forecast simulations with IFS which use ERA5 as initial conditions. There are two forecast simulations started each day at 6 am and 6 pm. We extract the precipitation fields for the first 12 hours of each simulation to reproduce daily precipitation -this is presently the optimal way to derive meaningful precipitation predictions from a dynamical model that is consistent with the large-scale fields in the ERA5 reanalysis data. The ERA5 and precipitation data is available on a grid with 31 km resolution. However, our target is to use model fields from climate models as input which are typically run at coarser resolution. We therefore map the ERA5 data onto the grid that is used in the HadGEM3 climate model (Murphy et al., 2018).
Hierarchical Convolutional GRU (HCGRU): The general structure of the HCGRU, illustrated in Figure 9 in the Appendix, has been used successfully in precipitation nowcasting (Shi et al., 2015(Shi et al., , 2017 wherein it outperformed an Optical Flow algorithm  produced by the Hong Kong Observatory. Our implementation contains 4 ConvGRU layers and an output layer, matching the number of layers in our TRU-NET model. Prior to the first layer, we reduce the input sequence from length 112 to 28, by concatenating blocks of 4 sequential elements. Each of the 4 ConvGRU layers contain 28 recurrent units, with each recurrent unit in each layer containing convolutional operations with 80 filters and a kernel shape of (4, 4). Skip connections exists over the final 3 ConvGRU layers and a final skip connection exists from the output of the first ConvGRU layer to the input of the output layer. The output layer follows the same formulation as in TRU-NET, with two 2D Convolution layers.

Data
Our input data comprises the following 6 model fields: air temperature, specific humidity, longitudinal and latitudinal components of wind velocity at 850 hPa, geopotential height at 500 hPa and total column water vapour in the entire vertical column, each defined on a (20×21) grid representing the UK at approximately 65km spatial resolution chosen to match that used in the UK Climate Projections datasets (Murphy et al., 2018).
These locations were chosen as they are important population centres that sample a wide breadth of locations across the UK. Further, collectively these locations posses varied meteorological profiles, depicted in Figure 3. For example, percentage of days with rainfall >10mm (R10) ranges from 2.4% to 11.9% and average rainfall conditional on an R10 event is ranging from 13.8 mm to 16.5 mm.
During testing, we either test on the whole UK, region by region, or test on a single location such as a city. For single location testing, we extract the nearest grid point to the centre of the given location.

Hyperparameter Settings
For TRU-NET and HCGRU, the dropout rate used for the output layer and Fused Temporal Cross Attention is 0.2. For the the remaining weights in the ConvGRUbased layers, dropout rates of 0.2 and 0.3 were used. During training we used global norm gradient clipping with the RectifiedAdam optimizer (Liu et al., 2019a), featuring gradient warm up. Parameters for RecADAM were selected as follows; β 1 = 0.9, β 2 =0.99, = 5e−8, maximum learning rate of 1e −3 , minimum learning rate of 8e −4 and total warmup steps of 20 with 13 steps of increase.
We trained all models in python, using Tensorflow and executed our experiments on a server with 4 NVIDIA GTX 1080 GPUs. We also utilize mixed precision training. The models were trained for a maximum of 300 epochs, with early stopping.

Seasonal breakdown for all of the UK
We use the following metrics to evaluate the performance of each model: Root Mean Squared Error (RMSE), RMSE for days of observed rainfall over 10/mm (R10 RMSE) and Mean Absolute Error (MAE). We present these metrics for each season, where the seasons have been defined as Spring (March, April, May), Summer (June, July, August), Autumn (September, October, November) and Winter (December, January, February  In Table 1(a) we observe that the TRU-NET CC model generally outperforms alternative models in terms of RMSE and MAE. Further, the CC variants of TRU-NET and HCGRU achieve a better R10 RMSE than their non conditional continuous counterparts.

City-wise breakdown
In the previous sub-section, we presented seasonal performance metrics for each model tested on the whole country. Here we focus on the predictive errors on 5 specific cities across the range of precipitation profiles displayed in Figure 3. These cities chosen can be divided into two groups; those with lower rainfall characteristics (London, Birmingham and Manchester) and those with high rainfall characteristics (Cardiff and Glasgow). These locations have been chosen in order to discern whether the quality of predictions over a region is related to the region's precipitation profile. The following tables present the predictive scores of the TRU-NET CC model trained on data from 16 locations over the time span covering 1979 till 2013. The results are presented in Table 2 where we provide the performance of the IFS model as the second number in each cell.
We observe that both TRU-NET CC and IFS generally achieves lower RMSE scores during the Spring and Summer months with less rainfall. By observing the Mean Error (ME) we notice our model generally under-predicts rainfall for cities with high average rainfall (Glasgow and Cardiff) and over-predicts rainfall for cities with low average rainfall (London, Birmingham, Manchester). When comparing TRU-NET's predictions to IFS predictions, we notice a significant number of cases wherein both TRU-NET and IFS predict rainfall higher than 0mm, for days where observed rainfall is 0mm. However, as can be seen by the vertical blue cloud of points to the left of each sub-figure, TRU-NET's logtransformed predictions for non-rainy days spread up to 2.75, while IFS performs worse and spread up to 3.4.

Distribution of Predictions
For observed rainfall events between 10 and 19 mm/day we notice that both TRU-NET and IFS slightly under-predict the observed rainfall by a similar amount. However, TRU-NET's predictions have less variance than the IFS predictions, which routinely produce predictions significantly below or above the y=x line. This is highlighted by the large vertical spread of IFS predictions, in Figure 4 (b), between observed rainfall of 10mm/day and 3.
For observed rainfall events above 20mm/day, we notice that TRU-NET underpredicts rainfall events more than IFS. We believe that the rarity of rainfall>20 events in the training set has negatively impacted TRU-NET's ability to learn these relationships, while IFS learns the underlying physical equations.

Cross Correlation across Predictions for Cities
Here, we check the spatial structure of the predictions via cross-correlation plots for TRU-NET CC Normal predictions on the central point within pairs of cities. We use Leeds as our base location and compute pairwise cross-correlations with the following six locations; Bradford (13km), Manchester (57km), Liverpool (104km), Edinburgh (261km), London (273km) and Cardiff (280km), where the each bracketed number is the distance of this location from Leeds. The cross correlations    Figure 6, as expected, we observe that the Lag 0 cross-correlation between the predicted daily rainfall for cities decreases as the cities become increasingly distant from each other.

Investigation of TRU-NET's Limitations
The high temporal correlation in weather data reduces the effective sample size and provides the risk that any neural network trained on N consecutive years will  Fig. 4: Distribution of Predictions: These figures illustrate the distribution of predicted rainfall against observed rainfall for the TRU-NET CC and IFS models from Section 4.1.1. The dashed red line shows the mean and standard deviation of predictions in 3 mm/day intervals of observed rainfall. The purple line indicates the boundary for rainfall events with at least 10mm/day. For illustrative purposes, we sub-sample every 25th pair of prediction and observed value. The log transform used is log(y + 1).
only learn a limited set of weather patterns. The reliability of the DL model's extrapolation to out of sample predictions (new weather patterns) is more doubtful because DL models do not aim to learn the underlying physical equations, unlike numerical weather algorithms.
The three experiments introduced below evaluate the robustness of TRU-NET's out of sample predictive ability. Here, we present XCF up to lag 28. The orange line provides the same statistics, except with the true observed rainfall values. The red line provides the 5% significance threshold, above which we can assume their is significant correlation.

Varied Time Span Experiment
Here, we fix the test set to span the years 2014 to August 2019 and vary the number of years, starting from 1979, used to train our TRU-NET CC model. We measure the training set size by years and by unique test datums. As our model operates on extracted temporal patches from the coarse grid, the amount of unique datums in   a training set is proportional to the product of the number of years we choose to train on and the number of locations included in the training set. In Figure 7, we observe a downward trend in RMSE and MAE as the number of years and unique test datums increases. The fact that the RMSE is reaching the lowest value for the largest dataset indicates that an increase of our dataset by using more locations could achieve further improvements in our model's predictive ability.

Forecasting Range Evaluation Experiment
Here, we evaluate the change in quality of predictions at increasingly larger temporal distances from the time covered by the training set. We train a TRU-NET CC model using data between 1979 and 1997 and then calculate annual RMSE, R10 RMSE and MAE metrics for each calendar year of predictions between 1998 and 2018. In the results, illustrated in Figure 8, the R10 RMSE shows no clear upward or downward trend throughout the whole test period, while the RMSE and MAE decline until 2005, after which the score remains steady. This indicates that our model's predictive ability is robust to at least 21 years of weather pattern changes due to climate change and natural decadal variability.

Varied Time Period Experiment
To judge the extent to which TRU-NET's predictive ability is dependent on the time period it is trained on, we divide the 40 year dataset into 4 sub-datasets of 10 years each. The first sub-dataset (DS1) corresponds to years 1979-1988, the second sub-dataset (DS2) to the years 1989-1998, the third sub-dataset (DS3) to the years 1999-2008 and the fourth sub-dataset (DS4) to the years 2009-2018. We set-up a K-fold cross validation based experiment by training a separate TRU-NET CC model on each of DS1, DS2, DS3 and DS4, creating four models M1, M2, M3 and M4. Appendix Table 5 shows the results from testing each model on the out-ofsample datasets. For each evaluation metric we perform a Tukey HSD test. A Tukey HSD test is used to accept or reject a claim that the means of two groups of values are not significantly different from each other. In our case we use the models (M1-4) as treatment groups and each models predictive scores form a groups of observations. The Tukey HSD test, then compares two models for a significant difference between the mean of each models reported predictive scores.
The Tukey HSD results for each evaluation metrics and all pairs of Models is presented in Table 3(a,b,c). The 1st two columns indicate the models under comparison, the 3rd the mean difference in their predictive scores. The rightmost column confirms whether or not there is a significant difference (sig. diff) between the performance of the corresponding pair of models. We can observe that the predictive performance between each pair of models is not significantly different. This implies our TRU-NET CC model is fairly invariant to the period of data it is trained on.

Ablation: Fused Temporal Cross Attention
In this section, we investigate the efficacy of our FTCA relative to other methods for achieving the multi-scale hierarchical structure in the TRU-NET Encoder. More concretely, we replace the temporal fused cross attention with concatenation, last element method (Jauhar et al., 2018, Zhao et al., 2019 and temporal self attention (Liu et al., 2019b). We examine how the effect of changing the number of heads in FTCA. Table 4 shows that our model achieves lower RMSE than other methods of achieving the multi-scale hierarchical structure. Furthermore, we notice strong performance relative to the self attention variant which has the same model size. This highlights the importance of using information from higher spatio-temporal scales to guide the aggregation of information from lower spatio-temporal scales in our TRU-NET model.   Table 4: Ablation Study: Here, we evaluate the predictive performance of alternative methods to achieve the multi-scale hierarchical structure in the TRU-NET CC Encoder. We evaluate these TRU-NET CC variants using a training set consisting of the years 1979 till 2013. The test set is composed of data between the dates 2013 till August 2019 for the whole UK. We observe that our proposed Temporal Cross Attention (T Cross-Attn) with 8 heads outperforms other methods.

Conclusion and Future work
In this work we present TRU-NET, featuring a novel Fused Temporal Cross Attention mechanism to improve the modelling of processes defined on multiple spatio-temporal scales. We utilise a conditional-continuous loss function to obtain predictions for zero-skewed rainfall events. For the prediction of local precipitation for all seasons over the whole UK, our model achieves a 10% lower RMSE than a hierarchical ConvGRU model and a 15% lower RMSE than a dynamical weather forecasting model (IFS) initialised 0-24h before each precipitation prediction. After further analysis, we observe that TRU-NET attains lower RMSE scores than IFS when predicting rainfall events up to and including 20 mm/day, which comprises the majority of rainfall events. However, after this point TRU-NET under-predicts rainfall events to a higher degree than IFS.
We address concerns regarding the suitability of DL approaches to precipitation prediction (Rasp et al., 2020(Rasp et al., , 2018, given the limited amount of training data. We show that the current amount of data available is sufficient for a DL approach to produce quality predictions. The current work used deterministic models and readily available reanalysis data as an analogue for climate model output. Future works, could utilise probabilistic neural network methods, such as Monte Carlo Dropout (Gal and Ghahramani, 2015) or Horseshoe Prior (Ghosh et al., 2018), as well as data from climate simulations to simulate risks of severe weather under varying climate scenarios. Further, methods combining Extreme Value Theory and machine learning (Ding et al., 2019) could be used to improve TRU-NET's ability to predict rainfall events over 20mm/day.
The code used to train and evaluate our models can be downloaded from https://github.com/Akanni96/TRUNET.