The predictive capacity of the high resolution weather research and forecasting model: a year-long verification over Italy

Numerical models are operationally used for weather forecasting activities to reduce the risks of several hydro-meteorological disasters. The overarching goal of this work is to evaluate the Weather Research and Forecasting (WRF) model predictive capabilities over the Italian national territory in the year 2018, in two specific cloud resolving configurations. The validation is carried out with a fuzzy logic approach, by comparing the precipitation predicted by the WRF model, and the precipitation observed by the national network. The fuzzy logic technique, by considering different intensity thresholds, allows to identify the reliable spatial scales of the forecasts. The same approach is applied to evaluate the performances of COSMO-2I model, a state-of-the-art numerical model configuration used for operational activities. For the entire year, except for summer, the model predictive capabilities are high, with useful forecasts for structures of medium intensities down to O(10 km) length scales. In summer the skills decrease mainly because of localization errors. The work aims to provide a robust evaluation of the forecast performances of another convection permitting operational meteorological models currently available in Italy.

has been exploited, using cutting edge observations and techniques. For example, a new operator for the assimilation of radar data was developed by Lagasio et al. (2019c), or high resolution satellite-derived maps were assimilated by Lagasio et al. (2019b) and Lagasio et al. (2019a), showing a positive impact on the forecast of heavy rainfall events in Italy. CIMA is also actively involved in research activities about the impacts that the assimilation of high resolution water vapor maps has on the forecast of heavy rains . These activities also contribute to the development of a future geosynchronous SAR (Synthetic Aperture Radar) satellite mission (Ruiz Rodon et al. 2013;Monti Guarnieri and Rocca 2017;Monti Guarnieri et al. 2018;Lagasio et al. 2020), which would provide unprecedented observations of the water cycle on Earth. Starting from 2018 CIMA has provided the outputs of two WRF modelling setups to ARPA Liguria (the Regional Agency for Environmental Protection of Liguria) and to the Italian Civil Protection Department for operational purposes, in particular an open loop setup covering the entire country (see next section for the details), and a setup with DA covering the northern part of Italy (not shown).
The overarching goal of this work is to evaluate the predictive capability of the WRF model operated by CIMA at 1.5 km horizontal grid-spacing over the Italian peninsula during the year 2018, with a focus on precipitation. Its performances are compared with a state-of-the-art operational configuration of the COSMO model, operated by ARPA Emilia Romagna at roughly 2.8 km grid-spacing (ARPAE 2020). The validation counterpart are the rainfall observational data obtained by the merging of rain gauge measurements and meteorological radar rainfall estimates. Statistics concerning this comparison are presented through a series of indices (POD, FAR, FSS) based on a fuzzy logic approach (Ebert 2008).
The verification of numerical models is highly demanded, especially for civil protection applications. This work provides the first predictive capability assessment and model comparison between WRF and COSMO models at convection permitting scales. A similar work was carried out by Oberto et al. (2012), who compared COSMO-I7 and WRF-NMM (Nonhydrostatic Mesoscale Model) for 2007 and 2008 on the Italian territory. They compared the model outputs with a grid-spacing of 7 km, showing that the two models have comparable biases on different precipitation intensities. This paper is structured in the following way: Section 2 describes the numerical model setups, the observational data sets and the verification method. Results are presented in Section 3, while discussions and conclusions are given in Section 4.

Model configurations
The WRF model is an open source code conceived and developed since the mid 90's by NCAR (National Center for Atmospheric Research), NOAA (National Oceanic and Atmospheric Administration), U.S. Air Force, Naval Research Laboratory, University of Oklahoma, and the Federal Aviation Administration. WRF is a mesoscale forecasting system that solves the non hydrostatic fully compressible Euler equations on an Arakawa-C grid with mass-based terrain following coordinates. It is designed for both research and operational applications, capable of operating at spatial resolutions from hundreds of meters to hundreds of kilometers (Powers et al. 2017).
This work focuses on an operational setup currently running at CIMA on behalf of the Italian Civil Protection Department, namely the WRF-OL configuration. It is the Open Loop (OL) configuration without data assimilation. It has three two-way nested domains with 13.5, 4.5, and 1.5 km grid-spacing and with 50 vertical levels ( Fig. 1). The innermost domain grid is composed of 943 × 883 points. Initial and hourly boundary conditions are taken from the NCEP-GFS with 0.25 • grid-spacing. The model runs a 48 hours forecast every day, starting from 00 UTC.
WRF-OL is configured with the following physical parameterizations. The Yonsei University scheme (Hong et al. 2006) is chosen for the planetary boundary layer turbulence closure; the RRTMG shortwave and longwave schemes (Iacono et al. 2008;Mlawer et al. 1997;Iacono et al. 2000) are used for radiation; and the Rapid Update Cycle (RUC) scheme is chosen as a multi-level soil model (6 levels) with higher resolution in the upper soil layer (0, 5, 20, 40, 160, 300 cm) (Smirnova et al. 1997;Smirnova et al. 2000). For consistency with the GFS initial and boundary conditions, the New Simplified Arakawa-Schubert scheme (Han and Pan 2011) is used with the following distinct approaches. No cumulus scheme is activated in the two innermost domains (4.5 and 1.5 km grid-spacing), because the grid-spacing enables to resolve the convection dynamics, while it is activated in the outermost domain (13.5 km grid-spacing).
In the present work, a comparison between WRF-OL and COSMO-2I (ARPAE 2020) is performed. The COSMO-2I domain covers the entire Italian territory, the boundaries conditions are provided by COSMO 5M, which is nested in the ECMWF-IFS global model at 0.125°grid-spacing, while initial conditions are provided by the high resolution KENDA-LETKF deterministic analysis (Schraff et al. 2016). The grid is setup with 2.8 km horizontal grid-spacing and 65 vertical levels. Forty-eight-hour forecasts are provided with two daily runs at 00 and 12 UTC. All details on the setup of the COSMO-2I and COSMO 5M models can be found at ARPAE (2020).

Data description
This work aims to evaluate the WRF-OL precipitation predictive capability by comparing the quantitative precipitation forecast (QPF) fields obtained from the model runs with the observational quantitative precipitation estimates (QPEs).
QPF fields are directly obtained from the model outputs on their native grid. Since WRF forecasts are 48 hours long we will refer to RUN1 as the first 24 hours of the forecast and as RUN2 for the subsequent 24 hours.
Two observational rainfall datasets are considered: the ground-based meteorological station networks and the Italian Radar Network (IRN), both managed by the Italian Civil Protection Department. In particular, the ground-based sensors, that are thermometers, rain gauges, hygrometers and anemometers, belong to two networks: the Functional Centers Network (FCN) and Public Administrations Unique Network (PAUN). In total, there are 5222 ground-based sensors, among which 3551 are from PAUN and 4881 are from FCN (some stations belong to both networks). They have been collecting data since 2004 and 2006, respectively. Data from 4366 rain gauges from the two ground-based networks are used in the present work.
Radar data come from the IRN mosaic operated by the Italian Civil Protection (Vulpiani et al. 2008), that covers the whole Italian territory. In particular, hourly SRT (Surface Rainfall Total) maps are obtained opportunely summing the SRI (Surface Rainfall Intensity) maps, provided every 10 minutes.
The QPEs used in this study are obtained by merging the ground based precipitation data provided by both aforementioned rain gauges networks (stations belonging to both networks are counted only once) together with SRT maps from the IRN mosaic. The merging is performed with the Rainfusion method (Pignone et al. 2013;Sinclair and Pegram 2005;Silvestro et al. 2016) producing hourly maps at 1.5 km × 1.5 km horizontal resolution.

Fuzzy logic analysis
A fuzzy logic analysis allows a comparison between observed and forecast data, with the advantage that the final error is not calculated pointwise, but allowing a spatial window in which the comparison is performed (Ebert 2008). In this study, we exploit the anywherein-the-window fuzzy logic approach, as described by Ebert (2008). This is a special case of "minimum coverage" technique, which is an example of Neighbourhood Observation-Neighbourhood Forecast strategy. Ten thresholds of precipitation intensities, namely, 0.1, 0.2, 0.5, 1, 2.5, 5, 7.5, 10, 12 and 15 mm/3 h are considered and the agreement between QPE and QPF is evaluated starting from a single COSMO-2I pixel (2.8 km × 2.8 km = 7.84 km 2 ) up to squares with 65 pixels per side (182 km × 182 km 33000 km 2 ). The COSMO-2I grid is chosen as the common grid for the validation (for both modelled and observational products), because it is the coarsest. Remind that, in the fuzzy logic approach, an event is defined when the rainfall intensity (either observed or predicted) overcomes the given threshold (Ebert 2008). Three scores are calculated: the Fractions Skill Score (FSS), the Probability Of Detection (POD), and the False Alarm Ratio (FAR) (Ebert 2008).
The FSS is the main index summarizing the potential of a fuzzy logic verification. It directly compares the forecast and observation fields on a certain area affected by an event (defined when the precipitation exceeds a certain threshold in the unit time), gradually increasing the spatial dimension of the box on which the verification is carried out. It is given by where N is the number of verification boxes in the domain under study and P is the fraction of each single box in which the event occurs (the subscripts fcs and obs stand for "forecast" and "observed", respectively). The FSS ranges from 0 (complete disagreement) to 1 (perfect agreement). The FSS is equal to 0 if there are no expected events but they do occur or, vice versa, if no events that have been foreseen occur. The FSS value above which the forecast is considered useful (better than the random data) is given by FSS useful = 0.5 + f 0 /2, where f 0 is the fraction of the domain covered by the observed event (Roberts and Lean 2008). The smallest spatial window for which FSS ≥ FSS useful is considered to be the useful scale.
As the dimensions of the spatial windows increase, the index tends asymptotically to a value between 0 and 1. The closer this value is to 1, the less the forecast is biased. The FSS is sensitive to rare events, which are intense precipitation peaks on limited areas. The POD is the ratio of the correctly forecast events and the events that actually occurred (range: 0-1, perfect value: 1), namely POD = hits hits + misses .
The FAR is the proportion of forecasts of the event that did not occur (range: 0-1, perfect value: 0). It is calculated as FAR = false alarms hits + false alarms and it measures the probability of false detection. For both POD and FAR computation, contingency tables are considered following the 'minimum coverage' technique introduced above, for different box dimensions and rainfall intensity thresholds. These three scores have different meanings: the FSS measures how the skill of precipitation forecasts varies with spatial scale (Roberts and Lean 2008); POD and FAR indicate, respectively, the probability of detection of a certain kind of event and the rate at which it is likely to forecast an event which actually does not occur.
This analysis is carried out with three-hourly QPE and QPF in the period between March 2018 and February 2019, on a seasonal basis, following the conventional definitions of: spring MAM, summer JJA, fall SON and winter DJF. Considering seasons instead of months enables to have larger numbers of events and, thus, robust statistics. The whole Italian territory is considered, producing charts with the performances of the first 24 hours (RUN1) and second 24 hours (RUN2) of the forecasts in terms of FSS, POD and FAR. Both QPE and QPF are interpolated on the COSMO-2I grid at 0.025°(roughly 2.8 km), excluding the sea, as discussed above.

Results
From an operational forecast user perspective, fuzzy verification gives important information on the scales and intensities at which the forecasts should be trusted. The analysis of the FSS, FAR and POD scores on a seasonal basis provides a deep insight on the model predictive capability for the QPF field: a good QPF performance corresponds to high values of FSS, high POD values and low FAR values. In particular, the FSS, FAR and POD indices a b c d Fig. 2 Seasonal FSS scores for WRF-OL RUN1. Empty circles denote useful FSS values are shown as functions of precipitation threshold (abscissas) and spatial scale (ordinates), namely the dimension of the side of the verification box. For example, by looking at the lower right corner of these charts, it is possible to assess the model performances for events characterized by small spatial scales and high intensity, that are of primary importance for civil protection purposes.

A fuzzy logic year-long comparison between WRF-OL and COSMO-2I
Figures 2 and 3 show the RUN1 FSS indices for WRF-OL and COSMO-2I, indicating that the performances of the two models are comparable. A distinct seasonal behaviour is visible in both figures, with useful forecast covering a wider range of spatial scale and threshold combinations in spring, autumn and winter. More in detail, consider the following classes of rainfall: light rain for intensities lower than 1.5 mm/3 h; moderate rain between 1.5 and 12 mm/3 h; and heavy rain higher than 12 mm/3 h (UK Met Office 2102). For light rain, the spatial scale at which the forecast is reliable is of the order of 10 km in spring, 50 km in summer, 15 km in fall and roughly 20 km in winter. For heavy rain, instead, the reliable spatial scale is of the order of 200 km or higher (especially in summer, when no reliable scales is found for WRF-OL, and in winter, for both models). Within the range of moderate rainfall intensities, the reliable spatial scale, defined on the FSS useful values, as discussed above, varies significantly and connects the light and the heavy rainfall values.
On the one hand, the most noticeable difference between WRF-OL and COSMO-2I concerns the heavy rainfall intensities. In particular, it generally appears that WRF-OL is statistically less reliable than COSMO-2I. However, the usefulness of a forecast on scales of the order of 200 km, reached by COSMO-2I, is also questionable. On the other hand, WRF-OL shows better performances during winter, for both light and moderate rainfall intensities. Considering the WRF-OL configuration, the fact that the seasonal FSS skill variations are particularly strong for the low precipitation accumulation thresholds can be interpreted in another way. The fact that the set of useful combinations of spatial scales and thresholds intensities visibly decreases in summer by shrinking in the upper part of the chart means that the performances at high precipitation threshold do not change much over the course of the year. From a forecaster point of view, thus, it becomes more difficult to predict the location of the precipitation event, independently of its intensity. For COSMO-2I, an additional performance degradation is visible in winter.
To interpret the FSS skill behaviour, it is insightful to look at the FAR and POD indices, shown in Figs. 4 and 6 for WRF-OL and in Figs. 5 and 7 for COSMO-2I, respectively. At a first glimpse, these figures indicate that WRF-OL has generally higher FAR and POD indices for all scales and intensities throughout the year, with respect to COSMO-2I. This indicates that WRF-OL tends to produce rainfall events more frequently than COSMO-2I, resulting in higher probability of properly forecasting a rainfall event (Figs. 6 and 7), but also of forecasting events that do not actually occur (Figs. 4 and 6).
Also the seasonal behavior of FAR and POD indices confirms that the wrong localization of the precipitation event is responsible for the poor summer performances. In fact, similarly to what is observed for FSS, a significant difference is observed between summer and winter skills for small spatial scales, while smaller or absent seasonal variations are found at large spatial scales, even at high threshold intensities. This suggests that only when considering large verification areas the match between observed and simulated rainfall objects is accomplished. Once again, a degradation of the COSMO-2I performances in winter is detectable. Figures 8 and 9 show the FSS indices (and the corresponding reliable scales) for the RUN2 of both models, on a seasonal basis. Other than the seasonal behavior observed for the RUN1 fuzzy logic analysis (poorer performances in summer for both WRF-OL and COSMO-2I, and poor performances in winter for COSMO-2I), the first thing to notice a b c d is that WRF-OL has a wider range of reliable spatial scales and rainfall thresholds with respect to COSMO-2I. This happens in all seasons, except for the moderate rainfall intensities at very large spatial scales in fall. In particular, in winter and spring, WRF-OL is found to produce reliable forecasts for moderate rainfall intensities at O(50 km) scales, whereas COSMO-2I is reliable at O(100 km) scales or beyond, in the same range of rainfall intensities. As expected, all performances for RUN2 are worse than those for RUN1.

Conclusions
CIMA runs the WRF model in the framework of the institutional cooperation with the Regional Agency for Environmental Protection of Liguria and the Italian Civil Protection Department. In particular, WRF is run at CIMA both for hydro-meteorological research and operational activities. Using a fuzzy logic approach, the performances of a cloud-resolving WRF configuration, named WRF-OL, are evaluated and compared with another state-of-the-art operational model, COSMO-2I. Such an approach enables to avoid the well-know double penalty issue in the forecast verification, where small spatial mismatching between observations and forecasts produce large pointwise errors.
The fuzzy logic analysis proves that the convection-permitting WRF-OL setup is able to represent, with good accuracy, the weather conditions throughout the year. In fact, it produces reliable forecasts over the same ranges of combinations of spatial scales and rainfall intensities of the state-of-the-art operational COSMO-2I model (with slightly better performances in winter). This means that it can be fruitfully used as an operational tool, in the perspective to reduce the risk linked to natural phenomena involving harsh weather conditions. a b c d The skills of the model are tested in a range of spatial scales going from a few kilometers (namely, 2.8×2.8 km 2 ) to roughly 200×200 km 2 , and accumulation precipitation thresholds from 0.1 mm/3 h to 15 mm/3 h. Except for the combinations of high precipitation threshold and very small spatial scales, the model is able to reliably reproduce the weather phenomena for most of the year with reliable scales that get as fine as 10 km for light rains in spring, fall and winter, and as coarse as 100 or 200 km for heavy rains in summer. It is during summer that the predictive capabilities reach a minimum, especially in terms of rainfall localization. The low skills in summer, that are due to the nature of precipitation, i.e. mostly localized and convective, are a common feature of many NWP models (Oberto et al. 2012).
It emerges that WRF-OL tends to produce higher false alarms than COSMO-2I, but it also misses less events, having a higher probability of detection. COSMO-2I appears to underestimate the light precipitation events. This is in line with previous results that found similar behaviors with coarser grid-spacing of the same models (Oberto et al. 2012). Thus, it indicates that it is not the higher resolution that solves this kind of systematic errors, but the responsible has to be found among the numerical procedures of the codes, either in the equation formulation and approximations, or among the numerous physical schemes.
A degradation of the forecast skills is found from the first day of forecast to the second one. In particular, two runs of the model (with two different initialization) referring to the same verification time show remarkable differences: the RUN1, which has an initialization nearest to the time instant of interest is more skilful than the RUN2. In particular, the reliable scales of the two models are coarser in the RUN2 forecasts. WRF-OL, however, shows better skills with respect to COSMO-2I in all seasons and especially in winter and spring for moderate rainfall intensities.
Precipitation forecasts are affected by relatively high uncertainties, due to the inherent unpredictable nature of precipitation, which is particularly evident over the Italian territory, where the orographic features close to the coastlines, may generate very complex local weather conditions. a b c d Fig. 9 Seasonal FSS scores for COSMO-2I RUN2. "x" symbols denote useful FSS values Future works will be devoted to validate the performances of a WRF operational configuration where radar reflectivity observations are assimilated to improve the model forecast capability, by means of 3DVAR data assimilation.