Harmonized evaluation of daily precipitation downscaled using SDSM and WRF+WRFDA models over the Iberian Peninsula

There have been numerous statistical and dynamical downscaling model comparisons. However, differences in model skill can be distorted by inconsistencies in experimental set-up, inputs and output format. This paper harmonizes such factors when evaluating daily precipitation downscaled over the Iberian Peninsula by the Statistical DownScaling Model (SDSM) and two configurations of the dynamical Weather Research and Forecasting Model (WRF) (one with data assimilation (D) and one without (N)). The ERA-Interim reanalysis at 0.75∘\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.75^{\circ }$$\end{document} resolution provides common inputs for spinning-up and driving the WRF model and calibrating SDSM. WRF runs and SDSM output were evaluated against ECA&D stations, TRMM, GPCP and EOBS gridded precipitation for 2010–2014 using the same suite of diagnostics. Differences between WRF and SDSM are comparable to observational uncertainty, but the relative skill of the downscaling techniques varies with diagnostic. The SDSM ensemble mean, WRF-D and ERAI have similar correlation scores (r=0.45\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hbox {r}= 0.45$$\end{document}–0.7), but there were large variations amongst SDSM ensemble members (r=0.3\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hbox {r}=0.3$$\end{document}–0.6). The best Linear Error in Probability Space (LEPS=0.001\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hbox {LEPS}= 0.001$$\end{document}–0.007) and simulations of precipitation amount were achieved by individual members of the SDSM ensemble. However, the Brier Skill Score shows these members do not improve the prediction by ERA-Interim, whereas precipitation occurrence is reproduced best by WRF-D. Similar skill was achieved by SDSM when applied to station or gridded precipitation data. Given the greater computational demands of WRF compared with SDSM, clear statements of expected value-added are needed when applying the former to climate impacts and adaptation research.


Introduction
Downscaling methods were developed in the 1990s to improve the spatial resolution of simulations by Global Climate Models (GCMs). The spatial resolution of GCMs is now typically 100-200 km, but some run at 16-40 km resolution thanks to advances in computing power (Davini et al. 2017). Nonetheless, most GCMs are still unable to simulate correctly the leading modes of climate variability (Delworth et al. 2012) or sub-grid scale weather phenomena (Zappa et al. 2013). In order to obtain more accurate local simulations, statistical and dynamical downscaling techniques were developed. The former applies empirical relationships between large-scale atmospheric circulation and local weather variables (von Storch et al. 1993). The latter is based on the numerical integration of equations describing mass, momentum and energy balances in the atmosphere at finer scales than the host GCM (Giorgi 2006;von Storch 2006).
Numerous statistical downscaling techniques have been developed, but most fall into one of three main categories as defined by Maraun et al. (2010). Perfect Prognosis methods apply statistical relationships between the large and local scales to adjust directly the outputs of climate models. Model Output Statistics use direct transfer functions between model and observed data to post-process model output. Weather Generators are stochastic models that modulate the spatial distribution and temporal dependence of local variables depending on large scale atmospheric conditions. Early comparisons of statistical downscaling methods for Europe ) and North America (Wilby et al. 1998), showed greater skill for mean temperature and precipitation than for their extremes. Those statistical methods applied to precipitation were more skilful at reproducing persistence of wet-spells than the distribution or frequency of rainfall. Additionally, performance tends to be better in winter (Timbal and Jones 2008;Yang et al. 2010) and for mid latitudes (Cavazos and Hewitson 2005), with the correlation coefficients between predictions and observations reaching values around 0.7 for monthly averages and about 0.5 for daily precipitation. For the Iberian Peninsula (henceforth IP), a comparison of different techniques for precipitation downscaling in the Ebro Valley, including machine learning algorithm Random Forest (RF), a Multiple Linear Regression (MLR) model and analogues applied according to a Perfect Prognosis approach, concluded that RF and MLR did not represent any significant improvement over simpler analogues (Ibarra-Berastegi et al. 2011). Improved results were also found by  by the search of analogues in the space of the canonical correlation coefficients.
Statistical downscaling methods are known to have limitations. Long and accurate meteorological records are needed to calibrate these models and the relationship between large and local scale must be assumed stationary. In addition, downscaled results are ultimately dependent on the quality of inputs from GCMs (Vrac and Vaittinada Ayar 2017). Previous studies show that the Statistical DownScaling Model (SDSM) (Wilby et al. 2014) provides good estimates of daily temperature , total precipitation (Wetterhall et al. 2006(Wetterhall et al. , 2007 and areal rainfall (Hashmi et al. 2011). However, estimates of extreme precipitation may be less reliable for arid climate regimes and/or during dry seasons (Wilby and Dawson 2013).
Dynamical downscaling models nested within a GCM can provide more accurate climate simulations than the coarse global model itself (Jones et al. 1995;Foley 2010;Rummukainen 2010;Feser et al. 2011;Önol 2012). In these cases, the GCM provides the boundary conditions that drive the regional climate model (RCM). Importantly, even if only boundary conditions are provided to the model, RCMs are able to reproduce small-scale features dependent on the surface boundary of the domain such as orographic precipitation or large water bodies (Rockel et al. 2008;Leung and Qian 2009). That is mainly achieved by the finer spatial resolution of these models, which is typically less than 25 km.
Nevertheless, RCMs are also known to have limitations, foremost of which is the rising computational cost as the resolution and/or area of the domain increases. This can translate into simulations running for months or even years (depending on processor speeds and storage). The spatial resolution of the RCM domain also determines the weather features that can be simulated by the models, and those that cannot must be parameterized. Such outputs contain significant biases in temperature and precipitation (Varis et al. 2004;Ines and Hansen 2006;Christensen et al. 2008;Teutschbein and Seibert 2012;Turco et al. 2013). For example the Weather Research and Forecasting Model (WRF, ARW) (Skamarock et al. 2008), is known to have a cold bias in summer temperature over the IP (Fernández et al. 2007;Argüeso et al. 2011;Jerez et al. 2012) which can be reduced by the 3DVAR data assimilation scheme (González-Rojí et al. 2018). The spatial variability of precipitation over the IP is well captured by WRF (Cardoso et al. 2013), but there are recognized limitations in some monthly totals (Argüeso et al. 2012).
Comparisons of statistical and dynamical techniques are not straightforward to perform because of their different inputs, spatial (and temporal) resolution, and methods of calibration. Moreover, the outputs of the former are typically point-scale and the latter grid-scale which may conceal true differences in model performance (Tustison et al. 2001). In the past, nearest neighbour or bilinear interpolation techniques have been used to improve comparability. In order to compare highly anisotropic fields such as precipitation, the nearest neighbour technique produces more accurate results than the bilinear interpolation as it does not smooth the fields (Accadia et al. 2003;Casati et al. 2008;Moseley 2011). Ideally, configuration of both downscaling techniques would be as similar as possible taking into account the restrictions associated with each model, including inputs from the same GCM or Reanalysis.
There have been many comparisons between dynamical and statistical downscaling focusing on hydrological applications (Fowler and Wilby 2007). These include assessments of downscaled future projections (Mearns et al. 1999;Hundecha et al. 2016;Onyutha et al. 2016) or current weather conditions Haylock et al. 2006;Huth et al. 2015;Casanueva et al. 2016;Vaittinada Ayar et al. 2016;Roux et al. 2018). These studies show that statistical and dynamical methods have comparable skill at simulating the present climate and should be regarded as complementary tools Haylock et al. 2006;Osma et al. 2015). However, there remains considerable scope for method refinement around the many decisions involved in downscaling model set-up and simulation. These choices include the source and spatial resolution of the driving boundary conditions used in the dynamical downscaling model; the optimum suite of large-scale variables for statistically downscaling (on a site-by-site basis); the period of record used for model calibration or spin-up; whether or not to use data assimilation (for dynamical downscaling of past climate); and the choice of diagnostics for assessing downscaling model skill and value-added to coarser resolution GCM inputs.
Our objective is to minimize the factors that can distort model-dependent differences in skill between statistical and dynamical downscaling methods. We do this by analysing daily precipitation downscaled over the IP via the Statistical DownScaling Model (SDSM) (https ://www.sdsm.org.uk/) and two configurations of the dynamical WRF model (one with and one without 3DVAR data assimilation). In order to fairly compare these models, we harmonize the experimental inputs, outputs and diagnostics. In all our experiments, the large-scale variables fed into the downscaling originate from the ERA-Interim reanalysis (Dee et al. 2011), and the nearest neighbour technique is used to identify grid-cells closest to station data. There have been other comparisons between WRF and various statistical downscaling methods Gutmann et al. 2012;Casanueva et al. 2016) but, as far as the authors are aware, there has been only one previous direct comparison between SDSM and WRF (for China) and, in this case, WRF did not include data assimilation (Tang et al. 2016). The present study assesses the strengths and weaknesses of both downscaling techniques using a range of daily precipitation diagnostics for the IP. Furthermore, we determine whether differences between downscaling techniques exceed differences between observational datasets. Our downscaled precipitation series are either point measurements or have a 15 km resolution, which are finer scales than the reanalysis (0.75 • ) and are comparable to other gridded precipitation datasets such as Spain02  or SPREAD (Serrano-Notivoli et al. 2017), with around 10 km and 5 km of spatial resolution respectively. This paper is organised in five main parts. In Sect. 2, we introduce the study area then describe the techniques for comparing and validating our downscaling models. Section 3 presents our evaluation of precipitation downscaled by SDSM and WRF for 21 stations using a common validation period (2010)(2011)(2012)(2013)(2014). In Sect. 4 we discuss the key insights emerging from the analysis, and in Sect. 5 we conclude with some remarks about the wider implications of the research.

Study area
The precipitation regimes of the IP are influenced by sea level pressure patterns over the Atlantic Ocean and by convective storms that mostly develop in the south-eastern part of Spain (Zorita et al. 1992;Rodríguez-Puebla et al. 1998;. The IP is a challenging area to undertake regional climate downscaling experiments because of these contrasting rainfall mechanisms and marked topographic gradients which broadly delimit four regimes (Kottek et al. 2006;Peel et al. 2007;Lionello et al. 2012;Rubel et al. 2017): (1) (Semi)Arid climate, in some locations of the Ebro basin and southeastern IP; (2) Mediterranean, in the southwestern IP; (3) Oceanic, located mainly in the northern IP; and (4) Alpine climate, in the mountain ranges. However, inter-annual variability of precipitation cannot be explained entirely by changes in atmospheric circulation; other factors need to be taken into account, including air temperature and humidity (Goodess and Jones 2002). Particularly, European precipitation extremes are influenced by the North Atlantic Oscillation (Haylock and Goodess 2004;Zveryaev et al. 2008) and East Atlantic oscillation (Rodríguez-Puebla et al. 1998;Sáenz et al. 2001;Zveryaev et al. 2008) which modulate the location of Atlantic storm tracks.

Data
Several datasets were used to compare the downscaling techniques: • ERA-Interim (ERAI) reanalysis atmospheric data at 0.75 • horizontal resolution ( ∼ 80 km ) were obtained from the Meteorological Archival and Retrieval System (MARS) repository at ECMWF. These data were used as the boundary conditions for WRF, and for creating the large-scale input variables for SDSM. These data were also used to drive the validation experiments of both downscaling techniques. • The European Climate Assessment and Dataset project (ECA&D) provides daily precipitation data from land stations (Klein Tank et al. 2002). Daily precipitation occurrence and amount at these sites were used as the local-scale data for calibrating SDSM, as well as for validating model experiments. Twenty-one stations were chosen (red dots in Fig. 1), evenly spaced over the IP and without oversampling any region. Fourteen stations were available for Portugal, but only Lisbon had data during our validation period. Further informa-tion about the chosen stations can be found in Online Resources 1. Four regions were also defined according to their climate regime: Northern, Central, Mediterranean and SouthWestern regions (in blue, green, pink and orange respectively in Fig. 1). These regions are similar to those defined by Serrano et al. (1999) and reflect the spatial patterns of long-term annual precipitation obtained by Rodríguez-Puebla et al. (1998). The Northern region encompasses the Oceanic climate, the Mediterranean region is related to the (Semi)Arid climate and the SouthWestern region encloses the Mediterranean climate. The Central region is a mix of the Oceanic and Mediterranean climates. • The ensembles observations (EOBS) Version 12.0 dataset (Haylock et al. 2008;van den Besselaar et al. 2011) was included as a validation dataset. This provides daily precipitation at 0.25° longitude and latitude grid-resolution ( ∼ 20 km in mid-latitudes). Note that EOBS was built using ECA&D thus, the two datasets are not independent. Additionally, some studies suggest that EOBS is biased toward lower values of precipitation even if the correlations overall are high (Hofstra et al. 2009;Kjellström et al. 2010). For the IP, there are known deficiencies in the south because of data scarcity ) and near mountainous regions (González-Rojí et al. 2018). • Tropical Rainfall Measuring Mission (TRMM) data (Wang et al. 2014) were used to evaluate both downscaling techniques. These data are available every 3-h at 0.25 • longitude and latitude ( ∼ 20 km ) resolution. In order to compare with other observational datasets and experiments, TRMM data were aggregated to daily values. Previous studies suggest that TRMM estimates differ from ground-based observations (Nesbitt and Anders 2009;Condom et al. 2011;Hunink et al. 2014), with a precipitation rate-dependent low bias (Huffman et al. 2007;Hashemi et al. 2017). • Version 2.2 of the Global Precipitation Climatology Project (GPCP) (Huffman et al. 2001) dataset was also included for downscaling model evaluation. Daily precipitation data are available at 1° longitude and latitude grid resolution ( ∼ 100 km ). GPCP reproduces largescale precipitation patterns, but some errors in the estimates are observed at regional scales (Janowiak et al. 1998), particularly in areas with sparse gauges (Beck et al. 2017).

SDSM set-up
SDSM (version 5.2) (Wilby et al. 2014) is defined as a conditional weather generator, as large-scale atmospheric variables (called predictors) are used to simulate time-varying parameters describing local variables (predictands) specified by the user. Downscaling relationships are hence established between regional predictors from reanalyses or GCMs and local predictands such as daily precipitation (used here) or temperature. Several steps must be followed to run SDSM. Model calibration begins by selecting locations with observational data, in this case, the 21 stations (in the first experiment) or nearest EOBS grid cells (in the second experiment). Station latitude and longitude determines the closest reanalysis grid cell (herein ERAI), for which a predictor set is extracted. Available predictors include: downward shortwave radiation flux, mean sea level pressure, precipitation, near-surface specific humidity, mean temperature at 2 m and both wind components, wind strength, geopotential height, vorticity, divergence and relative humidity (all at 500 and 850 hPa) (Cavazos 2000;Wilby and Wigley 2000;Schoof and Pryor 2001). Others also use moisture flux (Yang et al. 2010) or total precipitable water (Timbal and Jones 2008) as predictors.
Once the predictand and predictor sets are assembled, relationships between them are explored within the software. A three-step calibration strategy was implemented to minimize subjectivity in the choice of predictor variables. First, any candidate predictor variable with explained variance (R-squared) greater than 0.1 (for each month of the monthly analysis) was short-listed. Second, those with statistically insignificant partial correlation ( p > 0.05 ) were eliminated. Finally, the predictor with weakest partial correlation was removed until the point at which only significant variables remain. This implementation of SDSM follows the procedures customarily applied before (Wilby et al. 1998Gulacha and Mulungu 2017). Examples of more specific applications of SDSM can be found elsewhere (Hanssen-Bauer et al. 2005;Huth 2005;Crawford et al. 2007;Babel 2013, 2014;Wilby and Dawson 2013).
Having selected the predictors, SDSM is calibrated by ordinary least squares with a fourth-root transformation applied to daily rainfall amounts (to approximate a normal distribution). During simulations, white noise is added to replicate missing variance and thereby generate an ensemble that represents model uncertainty. Here, SDSM was set up to generate a 20-member ensemble for each station, having calibrated the model using data for the period 1979-2009 and validated against 2010-2014.
A second experiment was performed with SDSM to check whether model skill changes when area-average (gridded) precipitation is used for calibration instead of point (station) precipitation. This experiment (henceforth described as SDSM-EOBS) uses data from the closest EOBS cell to the station. Under this arrangement, SDSM-EOBS and WRF now have the same source of downscaling inputs (ERAI), validation period (2010-2014) and similar horizontal scale (gridded data).

WRF model set-up
Two simulations were run using version 3.6.1 of WRF, forced by ERAI. Both begin on 1st of January 2009, but the whole year 2009 was used to spin-up the model and establish correct land-atmosphere fluxes. Hence, only data covering the period 2010-2014 are analysed below.
In the first experiment (WRF-N), boundary conditions drive the model after the initialization. The second experiment (WRF-D) is configured as WRF-N from the point of view of the physical parameterizations. These are: WRF five-class microphysics; RRTMG scheme for both shortwave and long-wave radiation; MYNN2 Planetary Boundary Layer scheme; the Tiedtke scheme for cumulus convection; and the NOAH Land Surface Model. However, 3DVAR data assimilation (Barker et al. 2012) is run every six hours (00, 06, 12 and 18 UTC) using observations from NCEP ADP Global Upper Air and Surface Weather Observations (PREPBUFR dataset) inside a 120-minute assimilation window centered on those times. Both experiments use highresolution sea surface temperature (SST) field NOAA OI SST v2 (Reynolds et al. 2007), which is updated daily.
The background error covariance matrices used in the 3DVAR step of WRF-D were prepared such that they change from month to month. Ninety days computed from 13-months (from January 2007 until February 2008) of WRF 12-h runs (initialized at 00 and 12 UTC) are used to prepare background error covariances for every month. These 12-h runs are centred around the month that the background error covariance is prepared for, such that 90 days in December, January and February are used to produce the background error covariance for January and so on. The background error covariances are adjusted to the domain and the physical parameterizations used in this study following Parrish and Derber (1992), namely, the cv5 method of WRFDA (Barker et al. 2012).
The WRF-Domain was centred on the IP but covers much of Western Europe and north-west Africa ( 20 Fig. 2). The horizontal resolution is 15 × 15 km 2 with 51 vertical levels. Due to the distance between the margins of the domain and the IP, boundary effects from nudging (magenta region in Fig. 2) are discounted (Rummukainen 2010). The mountainous regions of the IP (showed by the GLOBE dataset in Fig. 2) are recognizable in a domain with spatial resolution 15 km, even though topographic effects are still underestimated.

Analytical methods and diagnostics of model skill
The analyses are organised in three parts. First, we examine the calibration of SDSM by evaluating the predictors selected for each station. We assess whether any distinct regional patterns emerge in, for example, the spatial distribution of explained variance in daily precipitation totals. That way, we will be able to check if some stations with the same characteristics can be calibrated in the same way. Second, precipitation downscaled from ERAI by both SDSM and WRF was compared with the observational datasets described above (i.e. EOBS, TRMM and GPCP). Different sources of precipitation were used to explore downscaling model uncertainty relative to observational uncertainty. This will help us to check how well the downscaling techniques perform in the IP. As these validation datasets are gridded, the nearest cell to the station was selected. Evaluation of time-series against station data was performed using Taylor diagrams (2001), which enable simultaneous, graphical comparison of downscaling model and observational root mean squared error (RSME), Pearson's correlation (r) and standard deviation (SD). The statistical reliability of the differences shown by the Taylor diagrams was assessed via bootstrap estimation (not shown). For brevity, Taylor diagrams are shown for representative stations in each climatic region, but all others can be found in the supporting Online Resources. The statistical confidence of the correlations was examined using 1000 bootstrap time series created with replacement and compared with observational data. Distributions of correlations between each bootstrap and observed series are presented as box and whisker plots, again organized by climatic region.
Third, in order to see which downscaling experiments are the best at reproducing the observed precipitation, the outputs were compared from the three of them: WRF-D, SDSM-stations and SDSM-EOBS (to assess sensitivity of results to the resolution of the predictand). Only WRF-D was included in this comparison as results from Sect. 3.2 will show that it consistently outperforms the experiment without data assimilation (WRF-N) at reproducing the observed weather. The same result was found by González-Rojí et al. (2018). Diagnostics are derived for each SDSM ensemble member and the ensemble mean in order to compare the deterministic output of WRF (single realization) with the probabilistic output of SDSM (20-member ensemble). The following statistical tests were applied in each case: • The Linear Error in Probability Space (LEPS) (Ward and Folland 1991) measures how well WRF and SDSM predict the distribution of observed precipitation at each station. The LEPS score varies betwen 0 and 1, with 0 representing a perfect model. • The Brier Skill Score (BSS) shows the value-added by the various downscaling models to precipitation derived from ERAI (our reference model) García-Díez et al. 2015). The BSS was calculated following von Storch and Zwiers (1999) using the error variances of the experiments (i.e. WRF-D simulation and SDSM, the forecast F on each case) relative to ERAI (the reference R): w h e r e t h e e r ro r va r i a n c e i s d e f i n e d a s x i are forecast data and O i are observed data. The BSS varies between − 1 and 1 where positive values indicate that the experiment improves the reference forecast; conversely, negative values signal that the reference performs better than the experiment (i.e. the downscaling does not add value to the forcing model). • Several diagnostics were derived noting that, because there are only five years of simulations, extreme rainfall can not be assessed. Instead, we have characterized how well WRF and SDSM simulate the following widely used indices (Haylock et al. 2006;Wilby and Yu 2013;Nicholls and Murray 1999): • Absolute mean daily precipitation (pav)-the average precipitation of all days; • Wet-day intensity (pint)-the average precipitation on days with more than 1 mm; • 90th percentile wet-day total (pq90)-the 90th percentile of precipitation on days with more than 1 mm; • Maximum consecutive dry days (pxcdd)-the number of consecutive days with precipitation less than 1 mm; • Wet-day probability (pwet)-the number of days with precipitation exceeding 1 mm divided by the number of days of the analysed period. • Maximum 5-day precipitation total (px5d)-the maximum precipitation total in any five consecutive days.

SDSM calibration
We begin this section by presenting the predictor suites created for SDSM at each station (Fig. 3). The most frequently selected predictor variables were: precipitation (PREC, 21 sites), downward shortwave radiation flux (DSWR, 20 sites), 850 and 500 hPa geopotential height (H850, 16 and H500, 10 sites respectively), mean sea level pressure (MSLP) and relative humidity at 500 hPa (R500, 9 sites). Other predictors were selected less frequently: zonal wind at 850 hPa (U850, 5 sites), meridional wind at 500 hPa (V500, 5 sites), zonal wind at 500 hPa and meridional wind at the surface (U500  and VSUR, 4 sites), and zonal wind at the surface (USUR, 3 sites). Hence, SDSM uses information about the state of the atmosphere as well as the incident radiation during calibration; variables related to atmospheric moisture are also important. The most frequently selected predictors vary by region. According to Table 1, downward shortwave radiation flux, precipitation, 850 hPa geopotential height and zonal wind, relative humidity and 500 hPa geopotential height are favoured for the northern region. Downward shortwave radiation, precipitation and relative humidity at 500 hPa are important predictors in the centre of the IP. Radiation, precipitation, zonal wind at 500 hPa and geopotential height at 500 and 850 hPa emerge for the Mediterranean region. Finally, radiation, geopotential height at 850 hPa, precipitation, meridional wind at surface, geopotential height and relative humidity at 500 hPa, and zonal wind at 850 hPa are prominent in the southwest.

SDSM Experiment − Predictor Suite
The -squared test was used to check for dependency of the frequency of predictor variable selection by site latitude, longitude, elevation and annual precipitation during the calibration period 1979-2009. No statistically significant, dependencies of predictor suite on site were found. This suggests that the optimum predictor set cannot be inferred from physiographic properties and that each site has to be calibrated on a case-by-case basis. Given the large area of the IP and known spatial variability in precipitation, a single multi-site version of SDSM (Wilby et al. 2003) was also deemed unsuitable.
Having established the predictor suites, the explained variance ( R 2 ) during the calibration period is presented in Fig. 4. R 2 varied between 15 and 40 percent with mean value 22%. Vigo and Córdoba achieved 39% and 32% respectively which are comparable to Gulacha and Mulungu (2017) for different periods and regions. Relatively high R 2 were also obtained in the Cantabrian region. Conversely, the lowest R 2 were observed at the Mediterranean coast, particularly near the Ebro basin and Barcelona. Overall, a northwest-southeast gradient in R 2 is evident across the IP which reflects dominance of large scale weather systems near the Atlantic (higher R 2 ) to local convective precipitation near the Mediterranean (lower R 2 ). These results are consistent with those reported by Goodess and Palutikof (1998).
The predictor suites are similar for SDSM-EOBS except at five stations: Lisbon, Ciudad Real, Lleida, Almería and Vigo. The main differences relate to changes in the height of the variables (near surface, 500 or 850 hPa) or with a reduction in the number of predictors used. Ciudad Real also included the additional predictor VSUR. The R 2 values improve at all stations (except for Vigo) and has mean value 28%. The predictor suites and R 2 values are provided for all sites in Online Resources 3.

Comparison of observational datasets and downscaling experiments
The Northern region has four stations, namely: A Coruna, Vigo, Gijón and Santander. Here, we present only the Taylor diagram for Gijón (Fig. 5); others are available in Online Resource 4. The correlation values for SDSM mean and WRF-D are in the range 0.6 and 0.8. However, even if the SD is quite similar to that observed, WRF-D outperforms the SDSM mean for most sites. Similar results are observed for ERAI and WRF-N for the stations included in this region, but the correlation and the SD are not as good for SDSM and WRF-D. In this case, the correlation ranges between 0.4 and 0.65 and the SD is worse. The EOBS dataset has the best correlation (0.95), compared with TRMM and GPCP (which   range between 0.3 and 0.4, and with much worse RMSE than the downscaling). SDSM ensemble members overestimate the SD in every station, and have correlation values in the range 0.3-0.5; the SDSM ensemble mean underestimates the SD but presents better correlation and RMSE scores than the ensemble members. The Central Region has five stations, namely: Pamplona, Soria, Madrid, Valladolid and Daroca. Here we present the results for Soria and Madrid stations (Fig. 6) as representatives of the region; other Taylor diagrams are available via Online Resource 5. Similar correlations emerge for WRF-D, ERA and SDSM mean, ranging between 0.5-0.7, 0.4-0.6 and 0.5-0.6, respectively. The SD is underestimated by WRF-D, ERAI and SDSM mean in Soria (and Pamplona), but it is accurately simulated by WRF-D and ERAI in Madrid (and Valladolid and Daroca). The correlation, SD and RMSE obtained by TRMM outperformed GPCP, but those scores are not comparable to those obtained by the downscaling experiments. Again, the EOBS dataset has the highest correlations (above 0.9 for each station). Most members of the SDSM ensemble overestimate the SD and have weaker correlations than the SDSM mean.
The Mediterranean has five stations, namely: Lleida, Barcelona, Murcia, Almería, Castellón de la Plana. Here we present the results for Lleida and Murcia stations (Fig. 7; other Taylor diagrams are available in Online Resource 6). Similar correlations are observed for WRF-D, ERAI and the SDSM mean, with values ranging between 0.4 and 0.6. However, the RMSE is better for WRF-D and the SDSM ensemble mean compared with other datasets. TRMM and GPCP datasets achieved the lowest correlations. The downscaling experiments are always located between these datasets and EOBS in the Tailor diagram. The members of the ensemble reproduced the SD for some stations (Lleida and  other regions. However, these datasets still obtain the worst results in this region. The EOBS dataset again outperforms the others. Significance of all the above results was assessed via bootstrap with resampling. Correlation values obtained for the 1000 time-series created for each experiment are shown in Fig. 9. The EOBS dataset achieved the best results for all regions and the other observational datasets (GPCP and TRMM) the worst. Correlations for both WRF experiments, ERAI and the SDSM ensemble mean fall within the range of the observational datasets. The WRF-D experiment achieved higher correlations than WRF-N in each region, but the difference was greatest in the Mediterranean region.

Comparison of downscaling techniques
Correlation values for ERAI tend to be lower than those for WRF-D and the SDSM ensemble mean, particularly in the North. However, we cannot differentiate which experiment (WRF-D or SDSM ensemble mean) is best at simulating precipitation over the IP based on these results alone. Thus, other diagnostics are applied to these experiments, and each member of the SDSM ensemble. Furthermore, we also assess the extent to which the skill of the SDSM mean and ensemble members depends on station or gridded precipitation.
The upper panel of Fig. 10 presents the LEPS for WRF-D, the mean of the SDSM experiments and individual members of both ensembles compared with observed precipitation at each station (D-STAT, SDSM-STAT and EnsSDSM-STAT, SDSM_EOBS-STAT and EnsSDSM_ EOBS-STAT respectively). Figure 10 shows that WRF-D has superior LEPS scores to the SDSM mean except for Vigo. Conversely, WRF-D is outperformed by individual SDSM members at all stations except Santander, Castellón de la Plana, Tarifa, Córdoba and Lisbon.
The lower panel of Fig. 10 shows similar results for SDSM-EOBS. Again, WRF-D outperforms the SDSM ensemble mean (except for cells overlying Vigo and Ciudad Real). As before, individual SDSM-EOBS members are superior to WRF-D at the majority of sites (with the exception of the same stations before plus A Coruna and Pamplona).  Additionally, the relationship of BSS with the elevation of EOBS and WRF grids was explored, but no significant connection was found between them. Finally, seasonal precipitation diagnostics were derived for each station, namely: mean precipitation (pav), precipitation intensity (pint), precipitation 90th quantile (pq90), maximum consecutive dry days (pxcdd), wet-day probability (pwet) and maximum five-day precipitation total (px5d). Figure 12 shows how these compare when based on observed data, WRF-D, SDSM-stations mean, SDSM-EOBS mean and the ensemble members of both SDSM experiments. For brevity, we limit our findings to winter and summer, however, results for spring and autumn are provided in Online Resources 8 and demonstrate similar behaviour to winter and summer respectively.
WRF-D outperforms SDSM when simulating mean precipitation (pav) in winter (Fig. 12, top panel). The median value for stations is 1.82 mm, compared with 1.70 mm for WRF-D, 1.52 mm for the SDSM-stations mean, 1.59 mm for In summer, observed pav is lower as expected (0.63 mm for stations) (Fig. 12, lower  panel). SDSM still has a dry bias but is now closer (0.44 and 0.59 mm) than WRF-D (0.23 mm). Precipitation intensity (pint) in winter is simulated well by the members of both SDSM ensembles (with median 7.59 mm for SDSM-stations, 7.27 mm for SDSM-EOBS, compared with 7.57 mm for stations). WRF-D is too dry (5.88 mm), but less biased than the mean of both SDSM experiments (4.94 mm for SDSM-stations, and 4.83 mm for SDSM-EOBS). In summer, the individual members of both SDSM ensembles obtained similar results to the observations.
The winter 90th percentile of precipitation (pq90) is underestimated by the mean of both SDSM experiments (with median 11.0 mm for SDSM-stations and 10.2 mm for SDSM-EOBS, compared with 17.2 mm for stations). However, the spread of the members of the SDSM ensembles is similar to observed precipitation (16.7 mm and 14.9 mm respectively). WRF-D also underestimates pq90 (13.4 mm). In summer, pq90 is significantly underestimated by WRF-D and the means of both SDSM experiments. In this case, the SDSM-stations ensemble (14.8 mm) is closer to observations (16.6 mm).
The median maximum consecutive dry days (pxcdd) for the stations is 45 days. All of the experiments underestimate the number of days: WRF-D (34 days), the means of both SDSM experiments (34 and 33 days) and the members of both SDSM ensembles (30 days in both cases). In summer, as expected, pxcdd is larger (55 days) and the behaviour of downscaling models differ from winter. Now the members of both SDSM ensembles overestimate the number of days (65 and 67 days respectively), but not as much as WRF-D (81 days). The SDSM ensemble means match observations.
The mean probability of occurrence of a wet-day (pwet) during winter is 0.24 for stations. WRF-D and the ensemble members of both SDSM experiments agree with observations (0.24 for WRF-D and SDSM-EOBS, 0.23 for SDSMstations). However, both the ensemble means overestimate pwet (0.35 for SDSM-stations and 0.34 for SDSM-EOBS). In summer, pwet is lower as expected (0.08). Both SDSM ensemble means slightly overestimate the probability (0.10), while WRF-D (0.04) and members of both ensembles (0.07) underestimate pwet.
Finally, the median maximum 5-day precipitation total (px5d) for winter at stations is 79.0 mm. This diagnostic is understimated by all downscaling models as follows: 76.6 mm (SDSM-stations members) and 68.3 mm (SDSM-EOBS members), 64.2 mm (SDSM-stations mean), 55.1 mm (WRF-D) and 53 mm (SDSM-EOBS mean). In summer, observed px5d is lower (47.3 mm), and the downscaling models exhibit the same pattern of low bias, except for SDSM-stations ensemble members, in which case the bias is positive.
Differences in downscaling skill between winter and summer reflect the contrasting precipitation mechanisms across the IP. Winter precipitation is associated with large frontal systems originating from the Atlantic Ocean Gimeno et al. 2010;Gómez-Hernández et al. 2013), so the probability of precipitation is relatively high (and dry-spells lengths low). Conversely, in summer, there is more convective precipitation (Zorita et al. 1992;Rodríguez-Puebla et al. 1998;) and the number of consecutive dry days is higher.
The above seasonal variations are reflected in downscaling model skill. WRF-D performs better in winter, when large-scale precipitation is predominant as previously shown by Cardoso et al. (2013). Similarly, SDSM is able to reproduce the persistence of rain or dry conditions as reported by Goodess et al. (2007). Furthermore, SDSM simulates total precipitation and areal rainfall well as shown by Wetterhall et al. (2007) and Hashmi et al. (2011).

Discussion
Use of precipitation as a predictor variable for SDSM calibration departs somewhat from convention but is defensible. Dynamical models simulate precipitation based on largescale and convective processes (such as precipitation from fronts or cumulonimbus clouds). However, neither convective nor microphysical processes are taken into account by other grid-scale predictors, so ERAI precipitation adds important independent information. Additionally, since the predictor suite used in the calibration of SDSM does not suffer from multicollinearity, precipitation is adding explanatory power. Other studies have also shown that precipitation produced by a numerical model can be helpful for statistical downscaling (Schmidli et al. 2006) and that the correlation skill of statistical techniques using precipitation as a predictor yields is improved over conventional downscaling (Widmann et al. 2003). Moreover, Wilks (1992) conditioned the local parameters of a stochastic daily weather generator using monthly precipitation from coarse resolution models, whereas Fealy and Sweeney (2007) designed a statistical technique to predict rainfall occurrence and amount, basing their selection of predictors on strength of correlation with precipitation.
The correlation, SD and RMSE achieved by the statistical and dynamical downscaling models against observed station data lie between those obtained for different observational datasets (EOBS, TRMM and GPCP). This not only happens on a site by site basis, but more generally at the regional level. This shows that, even though dynamically or statistically downscaled precipitation might seem far from perfect according to the correlation or RMSE metrics, they have comparable levels of agreement as between different sources of observational data. This further highlights the uncertainty in precipitation products used to evaluate downscaling methods. Such discrepancies can not simply be attributed to representativeness error because of the horizontal resolution of different datasets. For instance, GPCP and TRMM precipitation estimates involve different satellites, spatial coverage and merging with rain-gauge data, whereas EOBS precipitation estimates are based on gridding of point-based station data.
Dynamical downscaling with data assimilation outperforms the experiment without. Hence, 3DVAR data assimilation improves the quality of the simulations made with WRF, as shown by the Taylor diagrams and the bootstrap analysis (see Figs. 5,6,7,8,9). This is consistent with Navascués et al. (2013), Ulazia et al. (2016Ulazia et al. ( , 2017 and González-Rojí et al. (2018). However, such simulations are 55% more computationally demanding to perform, so our present set of WRF experiments were limited to five years. Longer simulations would be needed to reliably estimate the skill of WRF at downscaling extreme precipitation events.
The scores of the precipitation diagnostics obtained by both SDSM experiments are similar. Hence, this model is able to simulate realistic estimates of precipitation independently of the resolution of the predictands (i.e. whether point-or grid-based). Our results show that the correlation with observations and RMSE are much improved by the SDSM ensemble mean compared with individual members. Conversely, the SD deteriorates when the ensemble mean is calculated. Average precipitation does not seem to be affected, but there are important differences in the behaviour of the individual members and the mean of both SDSM ensembles for precipitation intensity, 90th percentile, wetday probability and maximum five-day precipitation.
Since the same source of predictors was used in all our downscaling experiments (ERAI) and gridded output was used to calculate diagnostics, we have undertaken a harmonized evaluation of the dynamical and statistical techniques. According to the results, the most skillful downscaling technique varies according to the choice of diagnostic. Overall, WRF-D produces more consistent results (better scores in the Taylor diagrams and some precipitation diagnostics), but it is clear that SDSM is able to produce comparable (or even better) results to WRF if the ensemble is taken into account. This result agrees with previous research. For example, Schmidli et al. (2007) reported that over flat terrain, both downscaling techniques showed similar results. Gutmann et al. (2012) compared a statistical model with WRF in the mountainous regions of Colorado, and found that statistical downscaling was able to improve the results of the original model. Casanueva et al. (2016) compared eight RCMs with five statistical downscaling methods over continental Spain, and found that statistical methods outperformed RCMs in terms of seasonal mean precipitation.
Data assimilation is not an option (due to the lack of observations) for climate change or seasonal forecast applications of WRF. Rather, only N-type WRF simulations can be performed. Additionally, in these applications, the correlation coefficient is not a fair verification index, and, as shown by Figs. 10 and 12, if the temporal occurrence of precipitation is not taken into account members of the SDSM ensemble perform better than WRF-D.
Another key-aspect to consider when one of the downscaling techniques must be chosen is the time and computational cost. The calibration step of SDSM is the most demanding and time-consuming part of this downscaling technique. Even so, the time expended on this task is modest compared with the investment require to run WRF. Moreover, the storage required for all the sites downscaled by SDSM was tiny ( < 100 Mb ) when compared with each run of WRF for the whole domain of the IP. The storage needed by the raw data for WRF-N is 2 Tb and 4.7 Tb for WRF-D. On the other hand, WRF provides the spatial distribution of simulated precipitation (and many other variables) over the whole domain compared with just 21 stations (or EOBS cells) from SDSM.
Finally, it is important to note that the selection of SDSM predictors (including precipitation) and high resolution gridded inputs means that the statistical downscaling is now more closely approximating the ingredients of the dynamical model. We have statistically downscaled ERAI data using local information from EOBS gridded data with SDSM and found that the results are comparable to those obtained with local information from station data. We see a growing opportunity for deploying both downscaling techniques in combination (Díez et al. 2005;Fernández-Ferrero et al. 2009). For example, dynamical modelling could give the spatial coverage and persistence, whereas statistical downscaling could improve the precision of highly local weather metrics (Roux et al. 2018).

Conclusions
This study shows how more consistent evaluation of statistical and dynamical downscaling techniques can be achieved by minimizing differences in model set-up, inputs and outputs. Moreover, the observational datasets against which downscaling techniques are routinely compared should not be considered as absolute truth.
We evaluated daily precipitation simulated by two downscaling techniques: WRF (boundary forcing by ERAI, with and without 3DVAR data assimilation) and SDSM (using predictors also from ERAI fit to station and gridded precipitation). By harmonizing the inputs (ERAI), validation period (2010-2014), output format (EOBS gridded precipitation) and diagnostics, the remaining differences can be more confidently attributed to the downscaling techniques themselves rather than experimental set-up. The complex landscapes and climate regimes of Iberia provide a diverse and challenging set of conditions for testing both models.
We confirm that SDSM must be calibrated on a site by site basis -exploration of the predictor sets in relation to station elevation, latitude or longitude found no coherent regional patterns or gradients. Having followed consistent calibration procedures, the statistically downscaled series were no more different to WRF series, than variations between observational datasets (EOBS, TRMM and GPCP). Our results also showed that WRF-D (with data assimilation) yields more skillful precipitation than WRF-N (without assimilation). The comparison of our experiments showed that no downscaling technique was superior across all verification metrics. According to our results, comparable correlations are obtained for the SDSM ensemble means, WRF-D and ERAI for the regions studied. However, individual members of the SDSM ensembles, were generally less skillful than WRF-D. Focusing on the precipitation metrics, the skill of SDSM was similar whether station or gridded precipitation was used for calibration. Thus, the different spatial scales involved in the station versus gridded precipitation data problem do not apparently play an important role, at least at the 20 km range checked in this study.
The best LEPS values were achieved by SDSM ensemble members at most stations (16 for SDSM-stations and 13 for SDSM-EOBS). Both WRF-D and the members of the SDSM ensemble outperformed the SDSM mean in all cases. In contrast, the BSS showed that SDSM ensemble means and WRF-D added value to the prediction of precipitation when compared with unadjusted ERAI. This was not the case for the SDSM ensemble members. This is because the BSS index accounts for the temporal occurrence of precipitation, whilst SDSM stochastically generates individual series that are not expected to match observed series.
WRF-D outperformed SDSM when simulating winter daily mean precipitation, whereas SDSM ensemble members were most skilful for precipitation intensity and the 90th percentile. This was also the case in summer. WRF-D and SDSM ensemble members reproduced observed maximum consecutive dry-days and the probability of a wet-day in winter. Maximum 5-days precipitation totals were underestimated by both downscaling techniques. In summer, WRF-D overestimated maximum consecutive dry-days and underestimated the probability of a wet-day. Both SDSM experiments behaved similarly to winter. According to this seasonal analysis, we conclude that SDSM can outperform WRF-D when simulating these indices, but WRF-D presents more consistent results between seasons.
Ultimately, these downscaled products could be useful for analysis of the long-term water balance of western Europe. SDSM generates similar results to WRF at grid scales, but can simulate weather variables for much longer periods, at lower computational costs and in considerably less time.
Hence, clear statements about the expected value-added are needed when applying WRF to climate impacts and adaptation research. However, further research is needed to explore the extent to which different types of precipitation mechanism (e.g. intense local convection or widespread frontal system) are reproduced as well by SDSM as by WRF, rather than pooling all rain days as in the present study. There is also scope for detailed analysis of multi-season extremes, or simulations of inter-annual variability.