On the suitability of deep convolutional neural networks for continental-wide downscaling of climate change projections

In a recent paper, Baño-Medina et al. (Configuration and Intercomparison of deep learning neural models for statistical downscaling. preprint, 2019) assessed the suitability of deep convolutional neural networks (CNNs) for downscaling of temperature and precipitation over Europe using large-scale ‘perfect’ reanalysis predictors. They compared the results provided by CNNs with those obtained from a set of standard methods which have been traditionally used for downscaling purposes (linear and generalized linear models), concluding that CNNs are well suited for continental-wide applications. That analysis is extended here by assessing the suitability of CNNs for downscaling future climate change projections using Global Climate Model (GCM) outputs as predictors. This is particularly relevant for this type of “black-box” models, whose results cannot be easily explained based on physical reasons and could potentially lead to implausible downscaled projections due to uncontrolled extrapolation artifacts. Based on this premise, we analyze in this work the two key assumptions that are made in perfect prognosis downscaling: (1) the predictors chosen to build the statistical model should be well reproduced by GCMs and (2) the statistical model should be able to reliably extrapolate out of sample (climate change) conditions. As a first step to test the suitability of these models, the latter assumption is assessed here by analyzing how the CNNs affect the raw GCM climate change signal (defined as the difference, or delta, between future and historical climate). Our results show that, as compared to well-established generalized linear models (GLMs), CNNs yield smaller departures from the raw GCM outputs for the end of century, resulting in more plausible downscaling results for climate change applications. Moreover, as a consequence of the automatic treatment of spatial features, CNNs are also found to provide more spatially homogeneous downscaled patterns than GLMs.


Introduction
As a result of several decades of intense research, a variety of statistical downscaling approaches and techniques are nowadays available to fill the gap between the coarse resolution outputs provided by Global Climate Models (GCMs) and the local or regional information required for impact studies-typically at individual locations or over regional interpolated grids covering the area of study- (Maraun and Widmann 2018). The local predictand(s) of interest (e.g. precipitation and temperature) are inferred from GCM outputs using statistical models which build on a set of largescale atmospheric variables (predictors such as geopotential, temperature or humidity) explaining a large fraction of the local climate variability. Under the perfect prognosis (PP) approach, the different statistical downscaling models (hereafter SDMs) available (linear and generalized linear models, analogs, machine learning techniques) are trained based on historical observations for both predictands and predictors (the latter coming from a reanalysis). A number of intercomparison studies have reported the relative merits and limitation of state-of-the-art SDMs when 'perfect' reanalysis predictors are used, based on cross-validation. In particular, the VALUE initiative (pan-European network to validate downscaling methods, Maraun et al 2015) conducted the largest-to-date intercomparison with over 50 contributing SDMs (VALUE Experiment 1, Gutiérrez et al 2019).
In a recent paper,  assessed the suitability of deep convolutional neural networks (CNNs) for the downscaling of temperature and precipitation over Europe using large-scale 'perfect' reanalysis predictors. To do so, they used the experimental framework defined in VALUE (Experiment 1) to compare the results provided by CNNs with those obtained from a set of other more classical, standard techniques i.e., generalized linear models, concluding that CNNs are well suited for continental-wide applications. There have been similar studies over North America (Pan et al 2019) and China (Sun and Lan 2020), all showing that CNNs achieve similar or better performance than standard SDMs. Moreover, CNNs circumvent the problem of feature selection/extraction-which is highly case-dependent and becomes a very complex task to accomplish in classical downscaling methods-by performing an implicit manipulation of the input space in the internal structure of the network (Baño-Medina 2020).
However, these previous studies do not test the suitability and potential limitations of CNNs for climate change applications. This is particularly relevant due to the limited interpretability offered by these "black-box" models (Baño-Medina 2020) that may hinder extrapolation analysis. As a result, users are reluctant to use this kind of techniques and standard SDMs are still the preferred choice to downscale climate change scenarios . Moreover, only a few studies have focused on the impact that the two key assumptions of perfect prognosis downscaling may have for climate change applications: (1) the predictors should be well reproduced by GCMs and (2) the statistical model should be able to generalize and extrapolate out-of-sample (e.g. climate change) conditions. This is crucial to assess the credibility of future climate information and avoid misadaptation (Pryor and Schoof 2020). Furthermore, these studies are mostly available at national or subnational levels (see, e.g., Gutiérrez et al 2013;San-Martín et al 2016;Manzanas et al 2015Manzanas et al , 2020. At a continental level, there is an ongoing experiment aimed at analyzing these assumptions in the framework of the VALUE European initiative-now encompassed in the EURO-CORDEX collaboration (Jacob et al 2020),-but results are not available yet. Details of the experimental framework (VALUE Experiment 2, which we use in this study) are available at http:// www. value-cost. eu/ valid ation.
The present work is a first step to fill this knowledge gap by testing for the first time the suitability of CNNs for downscaling future climate change projections, extending the analysis done in  in the framework of VALUE (Experiment 2). We analyze the two aforementioned perfect prognosis assumptions, which are particularly relevant in the case of "black-box" models such as CNNs, whose results cannot be easily explained based on physical reasons, and could potentially lead to implausible downscaled projections due to uncontrolled extrapolation artifacts. As a first step to test the suitability of these models we test potential extrapolation problems by comparing the downscaled and raw model signals. Here we explore this open problem focusing on both temperature and precipitation over Europe and compare the downscaled signals provided by CNNs with those obtained from a set of standard, benchmarking SDMs (in particular different variants of GLMs). Our results show that, as compared to GLMs, CNNs provide more spatially homogeneous downscaled patterns which exhibit smaller departures from the raw GCM outputs for the end of century.
The paper is organized as follows: In Sect. 2 we describe the data, SDMs and indicators used. The results obtained are discussed throughout Sect. 3, with a special focus on the analysis of the climate change signals projected for the end of the century (Sect. 3.3). Finally, the main conclusions are presented in Sect. 4.

Data
We use the reference datasets and periods of analysis proposed in VALUE Experiment 2 -an extension of Experiment 1 used in Baño-Medina et al (2019),-which defines a comprehensive framework in the context of climate change.
In particular, we use as predictors the variables shown in Table 1 to downscale daily surface temperature and precipitation over Europe, using E-OBS (version 14, at a 0.5 • spatial resolution, Cornes et al (2018)) as observational dataset.
Both ERA-Interim and EC-Earth predictor data are available from the VALUE website (http:// www. value-cost. eu/ valid ation) and can be downloaded as netCDF files. E-OBS data is available at the ECA&D webpage (https:// www. ecad. eu/ downl oad/ ensem bles/ downl oad. php).

Statistical downscaling methods
We selected for this work the CNN1 and CNN10 models presented in Baño-Medina et al (2019) to downscale daily precipitation and temperature, respectively. These models consist of an input layer (with stacked spatial predictors) feeding three layers of convolutions (50:25:1 for precipitation and 50:25:10 for temperature), each formed by 3 × 3 spatial kernels. The last convolution is fully-connected to the output layer (the E-OBS land-gridpoints), using linear transformations, or sigmoidal ones for probability of rain (i.e. the parameter p of the Bernouilli distributions). The networks are trained to learn daily conditional distributions of precipitation (maximizing the loglikelihood of a mixed Bernouilli-Gamma distribution) and temperature (minimizing the mean square error, MSE), given the predictors; i.e., the network is forced to estimate the associated parameters to the mentioned distributions. For the case of precipitation, we follow a 3-parameter estimation approach which was introduced by Williams (1998) for feedforward neural networks, later adopted by Cannon (2008) and recently extended to CNNs in . In this approach, given a set of large-scale atmospheric predictors, the probability of rain, p, and the shape and scale parameters of a Gamma distribution ( and , respectively) are simultaneously estimated at each site by minimizing the negative log-likelihood of a Bernoulli-Gamma distribution. For temperature, we minimize the MSE, which is equivalent to minimize the negative log-likelihood of a Gaussian distribution for the conditional mean.
The potential of CNN topologies resides in the efficient treatment of complex spatial features. In climate downscaling, these models have the capacity to handle high-dimensional predictor spaces, automatically selecting the most relevant variables and geographical domains affecting each particular site (see Baño-Medina et al 2019; Baño-Medina 2020, for more details on CNN model interpretability). This aspect is crucial since state-of-the-art SD techniques such as GLMs are unable to treat this high-dimensionality without leading to overfitting, requiring thus some kind of humanguided feature selection (with the consequent loss of relevant information) in most of cases. In fact, to some extent, CNNs can be viewed as a natural extension to GLMs entailing an intelligent feature selection which allows to maximize the predictive capacity of the available predictor fields.
For these reasons, we only consider as benchmark three GLMs which have been previously used in Bedia et al (2018) and rigorously intercompared in VALUE experiment 1 (Gutiérrez et al 2019). As noted, regression-based SDMs typically undergo feature selection/extraction techniques to select a reduced set of optimal predictors. The GLMs used in this work differ only in the spatial character of the predictors considered. On the one hand, GLM1 (GLM4) uses as predictor local information at the closest (four closest) gridpoint(s) to the site's location. On the other hand, GLMPC builds on the leading principal components (PCs, (Preisendorfer 1988)) explaining the 95% of the variance of the predictor space. The PCs are calculated over the eight PRUDENCE regions as described in Gutiérrez et al (2019). Based on these three predictor configurations, we build three different GLM models at each location; a logistic regression to estimate precipitation occurrence, a Gamma-like regression with logarithmic link to estimate rainfall amount and a Gaussian-like regression to estimate surface temperature.
Though regression-based methods have proved successful to provide unbiased estimates of the mean, they are known to underestimate the local variability, especially for precipitation (Pryor and Schoof 2020). Several approaches have been proposed to alleviate this issue, from variance inflation (Von Storch 1999) to the inference of daily probability distributions conditioned to the given large-scale predictors (see, e.g., Williams 1998; Cannon 2008; Baño-Medina et al 2019). Despite challenges still exist (e.g., preservation of temporal autocorrelation and/or the spatial fields), these approaches have been shown to provide overall good results in terms of local variability. For this reason, we design all our models, both CNNs and GLMs, such that they estimate conditional gaussian (for temperature) and Bernouilli-Gamma (for precipitation) daily probability distributions. As a consequence, both deterministic and stochastic downscaled series can be obtained for all the SDMs developed in this study, depending on whether we limit the prediction to the estimated mean or we perform a sampling from the given daily PDF, respectively. In the case of deterministic precipitation, the binary series (0/1: no rain/rain) is obtained according to the threshold that matches the rainfall frequency in the train period to the observed value, whereas rainfall amount is directly given as the Gamma's expectance. For both deterministic and stochastic versions, the final predicted precipitation is obtained by multiplying the binary occurrence series by the continous amount series, the latter according to the conditional Gamma PDFs. For the case of temperature, only deterministic implementations of both CNNs and GLMs are considered, since the local variability of this variable is lower and stochasticity is not really needed. Table 2 summarizes the dimensionality of the models used in this work. Whereas CNN1 and CNN10 models are designed over a predictor space of 19 × 22 (latitude-longitude) with the 20 variables listed in Table 1 stacked as inputs (or channels, similar to RGB channels in computer vision images), GLMs take as inputs the predictor variables at the closest gridpoints to each site (20 and 80 for GLM1 and GLM4, respectively) or the PCs that explain the 95% of the total variance of the predictor space (this depends on the PRUDENCE region so we marked it with an 'x' in Table 2). In CNN1/10, the input layer is convolutionally connected to a set of 3 hidden layers with 50, 25 and 1/10 feature maps in a sequential manner. Padding is applied in the CNN1 model, keeping the input resolution constant, whilst CNN10 lack of this property, diminishing the spatial resolution throughout the hidden layers as a function of the kernel size ( 3 × 3 ). The output layer of the CNNs matches the E-OBS landgridpoints resulting into 3259 predictand sites for the precipitation/ temperature model. Due to the nature of the GLM optimization and since the predictor spaces do not overlap for the sites of interest, GLMs are formulated in single-site mode-one model per site,-in contrast with CNNs that operate in multi-site mode-all sites are simultaneously downscaled from a single precipitation (CNN1) or temperature (CNN10) model. The capacity of DL topologies to downscale to multiple sites at a time is commonly referred to as multi-task learning, and its benefits include: computational efficiency, implicit regularization and ability to estimate multivariate distributions (see Ruder 2017, for a review in multi-task deep learning). On the one hand, computational efficiency comes from the fact that a single model is trained and used to predict over the test samples. On the other hand, multi-task topologies forces the network to learn patterns that are useful to downscale at multiple sites, sharing knowledge in the hidden layers, and acting as a form of inductive bias that prevents the network from overfitting. In particular, this regularization property was tested for the topologies used in this study in Baño-Medina and Gutiérrez (2019), where multi-site CNNs attained better results than their equivalent single-site CNNs in terms of local reproducibility. However, it has to be noted that none of these benefits imply an improvement on the spatial consistency of the downscaled fields, and explicit modeling of multivariate distributions-e.g., daily multivariate Gaussian distributions-in addition to multi-site networks is needed (Cannon 2008).

Validation indices
We use in this work some of the indicators that have been consolidated in the VALUE validation framework (Maraun et al 2015). In particular, we consider P02, Mean and P98 for temperature and R01, SDII and P98 for precipitation (see Table 3 for details).
In particular, to assess the performance of the downscaled results obtained from EC-Earth in its historical scenario , we compute the biases (with respect to    scenario-for each temperature (precipitation) indicator, absolute (relative, in %) differences are given. In this regard, note that, although downscaling may be expected to modify the raw GCM projections at the local scale (due, e.g., to a better modeling of local phenomena such as orography), neither the spatial structure nor the magnitude of the raw climate change signal should be significantly altered over a sufficiently large region (see Manzanas et al 2020, for details). With this in mind, we also compute the existing differences in the delta changes as projected by the downscaled version and by the raw GCM ( delta down − delta GCM raw ). As such, for a particular SDM, large differences would indicate a bad extrapolation capability-understood as compatibility with the raw GCM projections.

Results
In this section we discuss the suitability of the different statistical methods presented in Sect. 2.2 (with a special interest in the CNNs) to generate plausible regional climate change scenarios.

Testing the perfect-prognosis assumption
We first asess if the predictors used in our SDMs are realistically simulated by the EC-Earth. Note that this model has been shown to consistently reproduce key large scale processes affecting the European climate, in particular storm tracks (Lee 2015), which makes it a suitable option for downscaling, as proposed in VALUE.
To avoid the potential issues related to the presence of systematic biases in the first and second moments of the GCM (see, e.g., Vrac and Vaittinada Ayar 2016; Nikulin et al 2018; Manzanas et al 2019), we used standardized anomalies (at a gridbox level) for both EC-Earth and ERA-Interim. Moreover, in order to avoid also the possible misrepresentation of the annual cycle, we have also assessed the effect of applying a simple monthly mean bias adjustment (BA) prior to standardization. BA consists in adjusting the EC-Earth ( x GCM ) monthly means towards the corresponding reanalysis values ( x REA ), gridbox by gridbox (Eqs. 1, where j refers to a particular variable, i = 1, 2, ..., 12 to the month of the year, and h and f denote historical and RCP periods, respectively). The reader is referred to Gutiérrez et al (2019) for further details. Figure 1 allows to assess the distributional similarity between the ERA-Interim's and EC-Earth's temporal series over the historical period 1979-2008. In particular, it shows the results from a Kolmogorov-Smirnov test for standardized (left) and bias adjusted+standardized (right) daily predictors over the entire year, winter and summer (in columns). Red crosses identify those gridpoints where the null hypothesis of the test -ERA-Interim and EC-Earth distributions are indistinguishable-can be rejected at a 5% significance level. In all cases, colors show the p-values (in the range 0 − 0.3 ) corresponding to the ERA-Interim vs. raw (with no transformation) EC-Earth comparison. For brevity, results are only shown for two illustrative variables, temperature at 1000 hPa (T1000) and specific humidity at 700 hPa (Q700), top and bottom row, respectively.
Both T1000 and Q700 present in general low p-values (below the significance level of 0.05), reflecting that EC-Earth and ERA-Interim distributions are significantly different over many regions. This is mainly due to the presence of systematic biases in EC-Earth, since the situation is substantially improved once standardization is carried out, regardless BA is applied or not (see the red crosses in both panels). If reanalysis and GCM predictors are compared over the entire year (annual distributions, left column in each panel) there is in general good distributional agreement for T1000 and Q700 over the domain (with a few exceptions in the Mediterranean for the case of Q700). However, when the comparison is undertaken for winter and summer (middle and right column in each panel, respectively), better results are found when BA is applied. These results prove that monthly bias adjustment helps to meet the perfect prognosis assumption, yielding better predictors for downscaling. Moreover, though not showed here for brevity, for other predictor variables-especially wind velocity components in southern Europe and specific humidity at other height levels,-BA is crucial to make reanalysis and GCM predictors compatible.

Downscaling performance in the historical period
As explained in the previous section, the SDMs introduced in Sect. 2.2 are first trained using ERA-Interim standardized predictors and subsequently applied to EC-Earth predictors, after bias adjustment and standarization. Figure 2 shows the validation results obtained for the indices listed in Table 3 for the historical period 1979-2008, calculated as relative (absolute) biases for precipitation (temperature). Results are shown for the raw EC-Earth outputs (first row) and for the different statistical downscaling methods (GLMs: rows 2-4 and CNN: row 5), considering E-OBS as the observational reference in all cases. This figure shows that EC-Earth exhibits moderate to large biases for both precipitation and temperature over vast parts of Europe, with a tendency to overestimate precipitation occurrence (the well-known 'drizzle effect') and underestimate precipitation intensity and extremes (indicating a systematic shrinkage of the distribution). For temperature, EC-Earth underestimates the mean and extremes in the Mediterranean (indicating a systematic shift of the distribution) and under/over-estimates the warm/ cold extremes in regions of central and Northern Europe (indicating a systematic shrinkage).
All the SDMs considered, largely reduce the biases encountered for centered statistics such as R01 (for precipitation) and the mean (for temperature). This is not surprising as they are designed to minimize the mean errors (w.r.t. the E-OBS observations) during the training process. An exception to this is precipitation intensity for those methods relying on local predictors, which overestimate intensity (particularly GLM4). Nevertheless, this problem is alleviated when spatial predictors are used, either PCs in the case of GLMPC or convolutions in the CNN.
In the case of precipitation, all SDMs underestimate extreme values (P98), which is due to the reduced local variability explained by large-scale predictors-smaller underestimation corresponds to those methods with a presence of more informative variables in the predictor set i.e., GLM4 and CNN (Baño-Medina et al 2019). Figure 2 shows two columns for P98, corresponding to the deterministic (DET.) and stochastic (STOCHASTIC) versions described in Sect. 2.2. With the exception of the GLMs using local predictors (in particular GLM4), a clear improvement is found for the stochastic versions, which yield substantially lower biases.
For temperature, all SDMs yield nearly negligible biases in most cases (especially for the mean) and extreme warm temperatures (P98). However, the three GLM-based implementations overestimate P2 in Scandinavia, where the EC-Earth model exhibits the largest biases. To a great extent, this is corrected by the CNN, which points out again the benefit of convolutional networks for climate downscaling purposes.
In agreement with the results previously found in Baño-Medina et al (2019), we confirm here that CNNs provide overall better results than GLMs. This is due to the inherent capacity of CNNs to automatically extract the important spatial features determining the local climate, which allows to properly model the complex relationships (both in space and in time) that are established between the local-and the large-scale and improves the out-of-sample generalization capacity, especially for precipitation.

Future climate projections: raw and downscaled signals
A key assumption for the secure application of statistical downscaling to produce climate change projections is that SDMs should be able to generalize and extrapolate to previously unseen (e.g. climate change) conditions-stationarity assumption. Following the recommendations done by Gutiérrez et al (2013), this was partially analyzed in Baño-Medina et al (2019) using an anomalous warm test period in comparison to that observed during training, obtaining consistent and unbiased downscaled predictions. However, a more robust analysis is needed to assess potential problems that may arise in the downscaled climate change signal produced by the different SDMs (as compared to the one given by the raw GCMs). The first row in Fig. 3 shows the delta changes projected by the EC-Earth for 2071-2100 (with respect to the baseline period 1979-2008), considering its raw outputs under the RCP8.5 scenario. For the precipitation (temperature) indicators, shown in the left (right) panel, relative (absolute) values are displayed.
According to this GCM, a decrease in rainfall frequency (R01) might be expected over the Mediterranean, whereas the intensity (SDII) would increase in mid and northern Europe. Extreme precipitation (as represented by P98) would increase all over the area of study. Temperature is projected to rise significantly all over Europe, up to 5 • C for the mean, but reaching even higher increases for extreme temperatures (P2 and P98) in northern and southern Europe, respectively. We want to remark that the goal here is not providing comprehensive climate change scenarios over Europe (which should build on multi-model ensembles), but to assess the impact that different techniques may have on the downscaled projections.
Note that, unless it can be justified by process understanding, significant deviations from the the global model's climate change signal over large regions could be an indicator of physically inconsistent and implausible downscaled results (Manzanas et al 2020). With this in mind, rows two to five in Fig. 3 show the differences between the downscaled (not shown) and EC-Earth (first row) delta changes, as given by the different SDMs considered (GLM: rows 2-4, CNN: row 5). Absolute (relative, in % ) differences are shown for the case of temperature (precipitation). White colors represent regions where the SDMs preserve the climate change signal given by EC-Earth, whereas brown/blue or green/red  Table 3, as obtained from the raw EC-Earth simulations (row 1) and the different SDMs considered (GLMs: rows 2-4 and CNN: row 5) for the historical period 1979-2008. In all cases, the observational reference used is E-OBS 1 3 colors indicate regions where downscaling reduce or enlarge it, respectively. For precipitation, both GLMs and CNN preserve the climate change signal for frequency (R01). However, GLMs (particularly GLM1 and GLMPC) enlarge the mean and extreme precipitation signals up to 20% and 40% for SDII and P98, respectively. In principle, there is no known physical mechanisms supporting such changes and could, therefore, be attributed to the overestimation of both indicators found for GLMs in Sect. 3.2. Nevertheless, the pattern found for biases in the historical period (Fig. 2) does not match with the climate change signal shown in Fig. 3. Differently, the CNN largely preserves the climate change signal given by the global model for all the indices considered, posing no challenges on the interpretation of the patterns obtained. A similar situation is found for temperature, for which CNN preserves to a great extent the global model's climate change signal both for the mean and extremes, with GLMs exhibiting some reduction in small regions over central and northern Europe (especially for P2).
In contast to what happens for standard GLMs, the results from this work evidence that CNNs provide plausible downscaled information (when fed by GCM predictors) for the provision of regional-to-local climate change scenarios.

Conclusions
Recently, Baño-Medina et al (2019) assessed the performance of convolutional neural networks (CNNs) for perfect prog statistical downscaling over Europe using "perfect" reanalysis data as predictors. Their results showed that CNNs can efficiently work with continental-sized domains, outperforming other well-established statistical models for particular forecast aspects. We extend here this work by analyzing the suitability of CNNs for downscaling climate change, applying the models to predictors from future GCM climate projections. As a first step we assess the performance of CNNs to downscale temperature and precipitation from the historical scenario of the EC-Earth model. For completeness, we also include in the analysis three different implementations of generalized linear models (GLMs), which ranked amongst the best ones in the VALUE intercomparison experiment . Our results indicate that statistical downscaling (and in particular CNNs) allows to reduce the systematic errors that are usually present in GCMs for the mean and extremes, providing more realistic climate information. We found that, as compared to methods based on spatial predictors, GLMs based on local predictors are more sensitive to the possible inconsistencies that may arise between reanalysis and GCM predictor data, yielding higher biases, particularly for precipitation amount metrics i.e., SDII and P98. This seems reasonable since these inconsistencies among datasets are directly fed to the models if no manipulation of the predictor space is carried out in the form of e.g., convolutional layers, resulting in deficiencies of the "perfect-prognosis" condition.
In a second step, we study whether or not CNNs provide a suitable alternative for the generation of reliable local to regional downscaled climate change scenarios, which, to the author's knowledge, has not yet been explored. We compare the downscaled climate change signals produced by CNNs with those obtained from the benchmarking GLMs. The suitability of the different methods tested for climate change applications is quantified based on the similarity with the raw projections given by the EC-Earth (under the RCP8.5 scenario). GLMs are found to yield local scenarios which are not fully consistent with the signals produced by the GCM, especially for the case of precipitation. Differently, the projections given by CNNs are comparable (to a great extent) to the change signals provided by EC-Earth. This suggests the adequacy of CNNs for the downscaling of localto-regional climate change scenarios building on the good generalization properties and stable behaviors under climate change conditions.
The results from this work may foster the use of CNNs for the generation of realiable climate change information on continental-sized domains, which is crucial for the implementation of adequate mitigation policies. To further corroborate these conclusions, we plan to extend the present work to other geographical regions and variables with different climatological properties under the umbrella of the international initiative CORDEX-ESD, whose objective is to produce high-resolution climate change information worldwide.

Appendix: Reproducibility of results
All the data used in this work (E-OBS observations, ERA-Interim and EC-Earth projections) are publicly available and accessible from the User Data Gateway (UDG), a THREDDS-based service from the Santander Climate Data Service which provides access to a wide catalog of popular climate datasets. These datasets can be remotely accessed using the open climate4R R framework . See also https:// github. com/ Santa nderM etGro up/ clima te4R for a complete description of the different packages forming this framework.
The standard SDMs considered here (different implementations of GLMs) are built with the downscaleR package , and the convolutional deep models are build using downscaleR.keras (https:// github. com/ Santa nderM etGro up/ downs caleR. keras), a wrapper that integrates keras-the state-of-the-art library in deep learning-within climate4R. To validate the predictions/projections, we use the set of indices defined in VALUE (see http:// www. valuecost. eu for details), which are available in the climate4R. value package.
The companion Jupyter notebook, accessible from the deepDownscaling GitHub repository of the Santander Meteorology Group (https:// github. com/ Santa nderM etGro up/ DeepD ownsc aling), describes all the steps necessary to fully reproduce the results presented in this manuscript, which were produced on a machine running under Ubuntu 18.04.3 LTS (64 bits), with 60 GiB memory and a multicore CPU composed of 16 processing units and 32 threads Intel(R) Xeon(R) CPU E5-2670 of 2.60 GHz. The computational times needed to train and predict in both historical and RCP8.5 scenarios are described in Table 4 for the statistical models tested. Note that these times do not entirely Table 4 Computational times needed to train and predict on both historical and RCP8.5 scenarios by the GLM and CNN models (in columns) for the case of precipitation (P) and temperature (T)

GLM1
GLM4 GLMPC CNN P 4h 15min 8h 21min 2h 30min 1h 11min T 2h 32min 2h 20min 1h 20min 50min depend on the method's nature-e.g., GLMs are solved analytically while CNNs required an iterative optimization procedure,-but also on the internal R libraries used to build the models, and on its single-or multi-site nature-e.g., only one model is trained for the CNNs.-Overall we observe that GLMs consume more resources and timing than CNNs, in particular for precipitation downscaling. Whilst CNNs estimate simultaneously the occurrence and quantity of rain through the Bernoulli-Gamma conditional PDFs, two independent models are needed for the GLMs, what increases considerably the computational times. This is ratified by comparing the results obtained for precipitation and temperature GLMs; the latter requiring half of the time used by the former (altough still overpassing the CNN). In addition, since for the GLMs each site is downscaled with different predictor spaces, all data processing, training and prediction steps need to be done by chunks of latitudes, leading thus to notably longer times than CNNs.