1 Introduction

Cloud cover plays important role in many atmospheric processes. Not only does it regulate Earth’s water cycle, but also its energy budget, and therefore radiative processes on the surface and atmospheric chemistry, and also interacts with aerosols in the atmosphere. Cloudiness affects ozone and other secondary pollutant formation by limiting incoming radiative fluxes to the surface layer. In meteorological and chemical transport models, e.g. WRF-Chem (Grell et al. 2005; Madronich 1987; Tie et al. 2003; Wild et al. 2000), cloud cover information is passed on to photolysis schemes, thus influencing nitrogen dioxide (NO2) oxidation rates.

Cloud amount and cloud type are one of the most difficult meteorological parameters to predict. Cloud formation and dynamics depend on a wide variety of factors and processes, which are not accounted for in the model explicitly, simply because the atmospheric system is too complex and the current computational power is insufficient to resolve them. For these reasons, there is a need to apply approximations, which increase the uncertainty of cloud cover prediction (Johnson et al. 2015; Van Lier-Walqui et al. 2012). Since cloud microphysics interacts with many other elements of the weather system resolved by the model, those uncertainties are replicated and have an adverse effect on the overall forecast quality. In air quality modeling, it also affects estimation of pollutant concentrations, particularly ozone and other photochemical smog compounds, by regulating the amount of solar energy transferred to the surface.

There are many data types that cloud cover forecast verification can be based on (Bretherton et al. 1995). The most commonly used and longest data series that can be acquired are cloud fraction reports from ground-based weather stations (e.g. Qian et al. 2012). Surface data are easily accessible in real time and widely used for verification of many other meteorological parameters, such as temperature, pressure or wind speed, but with cloud cover there are some setbacks. As the density of stations may be sufficient for other meteorological variables, cloudiness measuring network is very irregular and stations are located predominantly on land, so there is disproportion in data density over land and marine areas. There are also manual and automated stations, and the two different methods of gathering cloud fraction information may provide different outcomes (WMO 2008). Additionally, the number of synoptic stations worldwide has been decreasing (Peterson and Vose 1997; Vose et al. 1992). Another issue is the frequency of the provided data—surface stations usually report at synoptic times, whereas regional meteorological models provide data at finer temporal resolution (1 h or less). Finally, there is more than one definition of cloud fraction and there are difficulties in transforming it into a variable that would be suitable for model verification.

One data source that solves the problem of irregular and sparse coverage of surface data are meteorological radars; however, they are designed to detect precipitation rather than cloud cover and are not commonly used for that purpose. Finally, there are satellite images, which not only have very large spatial extent, but also high spatial and temporal resolution and data are homogenous across the globe. Satellite data provide images in over a hundred spectral bands which allow the diagnosis of a variety of cloud products, from an unprocessed visible image to cloud mask, cloud top height, liquid water content, or brightness temperatures. Although these data are not always available in real time and go back only a few decades, it may serve a variety of applications related to model verification. There are two main types of satellites providing data for meteorological purposes: geostationary (e.g., the Meteosat series; Fensholt et al. 2011) and polar-orbiting (e.g. NASA’s Terra and Aqua; King et al. 2003). The main advantage of low Earth orbit satellites is their high spatial resolution, which may be even less than 1 km (down to 250 m at sub-satellite point in case of MODIS) and small distortions of the image. However, their orbit characteristics result in the data being available at irregular times, approximately 3–4 times a day. Geostationary satellites, on the other hand, which stay above a fixed point on the equator, have high temporal resolution (15 min for Meteosat Second Generation), but spatial resolution is much lower than the polar-orbiting satellites. Meteosat MSG has 1 and 3 km resolution at sub-satellite point for High Resolution Visible (HRV) and infrared channels, respectively, and it decreases toward the edges of the image. The downside is that their coverage is limited by the satellite’s field of view, so polar regions are either invisible or excluded because of large distortions.

Satellite imagery can be processed into a variety of products, and therefore enable various approaches to meteorological model verification (Tuinder et al. 2004). One of them is comparison of brightness temperatures (Zingerle and Nurmi 2008; Söhne et al. 2008). It is usually not a parameter produced directly by meteorological models, but requires additional post-processing from other model output variables. Much more straightforward approach is to use cloud mask, which can be easily derived from cloud fractions at model levels (Crocker and Mittermaier 2013). Satellite cloud mask is derived from multiple spectral channels, usually based on visible light and supported by infrared wavelengths, through a series of cloud detection tests. These data can then be compared with the modeled cloud mask to evaluate its results.

Meteorological model evaluation can also be based on various methods; one of them, referred to as categorical verification, uses grid-to-grid comparison, and another, object-based verification method, presents the features being verified as objects. In this study, we use both approaches to compare and quantify the differences between the cloud mask derived from the Weather Research and Forecasting (WRF) meteorological model simulation and satellite data. Four different microphysics parameterizations are tested for a selected period, favorable to formation of tropospheric ozone. Finally, for two parameterizations of microphysics, ozone concentrations are calculated with the WRF-Chem model, and the role of microphysics scheme on modeled O3 is also described with the example of the episode of high ozone concentrations observed in central Europe.

There are two main aims of this study. The first aim is to evaluate the WRF model performance for cloud cover, using satellite data and objective verification approach, and to test the model sensitivity to various microphysics schemes. The second aim is to examine the sensitivity of the WRF-Chem modeled ozone to the selected microphysics schemes. Simulation providing the highest model-measurements agreement will be used in further studies of tropospheric ozone in Poland.

2 Data and Methods

2.1 Study Area and Period

The analysis is performed for the area of Poland, which is characterized by transitional type of climate, with polar continental and polar maritime air masses being the two main drivers of weather conditions. This makes weather in Poland very changeable and difficult to predict. Episodes with stagnant anticyclone, providing many sunshine hours, high temperatures and low wind speeds, are not uncommon. This type of weather is very favorable for ground-level ozone formation, which is a major issue particularly for large cities and their peripheries. The EU Directive 2008/50/EC goal for 2010 has not been met and threshold values are still being exceeded (Krzyścin et al. 2013; Staszewski et al. 2012). Because one of the main aims of this study is to quantify the impact of selected microphysics parameterizations on air quality modeling, the test period is a high ozone episode of June 17th–July 4th, 2008. At that time, a vast anticyclone prevailed over Poland (Fig. 1), with low wind speed and high temperatures, which allowed photochemical smog to form in large cities and high concentrations of ground-level ozone were observed in Poland. The threshold value for 1-h average of 180 μg m−3 set by aforementioned EU Directive was exceeded at four stations in Poland at least once.

Fig. 1
figure 1

Synoptic situation for the first day of the study period (17.06.2008). Similar conditions prevailed throughout the whole period (17.06-04.07.2008)

2.2 The WRF Model

In this study, a multi-scale meteorological model, the Weather Research and Forecasting (WRF) version 3.5 (Skamarock and Klemp 2008) is used for the area of Poland. Simulations are performed for three one-way nested domains with grid size of 45 km × 45 km for the outermost, 15 km × 15 km for the intermediate, and 5 km × 5 km for the innermost domain, covering the area of interest. The model has 38 vertical layers with model top at 50 hPa. The domain configuration is presented in Fig. 2. Four simulations were run, each with a different microphysics parameterization—Purdue Lin (Lin et al. 1983), Eta Ferrier (Rogers et al. 2005), WRF Single-Moment 6-class (Hong and Lim 2006), and Morrison 2-Moment (Morrison et al. 2009), referred to as SIM1, SIM2, SIM3 and SIM4, respectively. Purdue Lin and Morrison schemes are currently the only two microphysics options that account for aerosol direct effects and are both widely used in WRF-Chem simulations (Forkel et al. 2015; Saide et al. 2012; Zhang et al. 2012). Eta Ferrier and WSM 6-class are also used in many applications, including model evaluation based on satellite data (Grasso et al. 2014; Otkin and Greenwald 2008), studies of model sensitivity to microphysics for convective conditions (Hong et al. 2009) and heavy precipitation episodes (Segele et al. 2013). Other physics options remained the same for all model runs and include the Kain-Fritsch cumulus scheme, Yonsei University PBL scheme, unified Noah land-surface model, and RRTMG (Iacono et al. 2008) and RRTM (Mlawer et al. 1997) shortwave and longwave radiation, respectively. The model was initialized by the ERA-Interim data, available every 6 h with 0.7° × 0.7° horizontal resolution.

Fig. 2
figure 2

WRF model domain configuration. D01, d02 and d03 domains have spatial resolution of 45 km × 45 km, 15 km × 15 km, and 5 km × 5 km, respectively. Results from domain d03 are analyzed

After evaluation of the cloud cover mask for the four WRF model simulations, the best and the worst configurations, in terms of the agreement with the satellite data, were used for the WRF-Chem model runs for the end of the study period—June 31st to July 4th. Details for the WRF and WRF-Chem model configurations are provided in Table 1. Because the differences between the two model runs are of interest here, the simple approach was applied, including restriction of the temporal variations in emissions from nature, while the TNO MACC II emissions (Kuenen et al. 2014) are assumed constant during the entire simulation. The chemical boundary conditions of trace gases consist of idealized, northern hemispheric, mid-latitude, clean environmental profiles based upon the results from the NOAA Aeronomy Lab Regional Oxidant Model (Liu et al. 1996). With all these simplifications it was computationally efficient to study the impact of microphysics parameterization on ozone concentrations, but it also influenced the chemistry model agreement with the measurements.

Table 1 Physics and chemistry parameterizations used in WRF and WRF-Chem model runs

2.3 Measurements for Model Evaluation

The dataset used for evaluation of the model results is the cloud mask product, derived from the Meteosat Second Generation (MSG) SEVIRI instrument satellite imagery (Derrien and Raoul 2010). This geostationary satellite offers high and constant temporal resolution, consistent with the WRF model output times (1 h), which is why it has been chosen over MODIS even despite its lower spatial resolution.

For generation of this product, a High Resolution Visible (HRV) channel and 11 infrared channels, particularly useful for nighttime hours and necessary for distinction of clouds from e.g. snow cover, were used. Data are available every 15 min, but here the images at full hours were used to match the WRF model output. The final cloud mask product is obtained from Eumetsat, after a series of tests determining whether each grid cell is clear or cloudy. Cloud mask is a pessimistic field, which means that a grid cell can be classified as clear of clouds only if it passes every test. The full methodology of generation of the cloud mask product is described by Derrien and Raoul (2010). HRV channel has a 1 km × 1 km resolution at sub-satellite point, whereas the remaining channels have 3 km × 3 km grid. The final product resolution is reduced to the lower grid resolution. Because of the curvature of the Earth, resolution decreases with distance to sub-satellite point and for Poland it drops to approximately 6–7 km. This is close to the spatial resolution of the inner domain (d03) of the WRF model (5 km × 5 km) and the satellite data are resampled to the WRF grid for the spatial comparison.

For evaluation of other meteorological parameters, data from 57 synoptic stations in Poland were used. Ozone concentrations modeled with WRF-Chem were compared with hourly data derived from AirBase, from urban (Wrocław—Korzeniowskiego, WRK), suburban (Wrocław—Bartnicza, WRB), and regional background station (Śnieżka, SNI) in SW Poland.

2.4 Evaluation of the Model Results Using the Cloud Cover Mask

There are multiple approaches that can be adopted to verification of cloud cover modeling. Here, two methods are used to evaluate the simulation results. First is categorical verification, which is probably the most widely used method. It is based on grid–to-grid comparison of measured and modeled values. Then, a contingency table is built, based on which various skill scores may be calculated. The main weakness of this method is underestimation of model skill when analyzed phenomena are shifted in space. To overcome that weakness, an objective verification method can be used. This approach was initially developed for rainfall data and that is how it is commonly used, but it can be adopted to other applications, including cloudiness (Crocker and Mittermaier 2013). In this approach, it is not grid cells, but objects, that are analyzed. An object is a continuous area that fulfills certain criterion, e.g. occurrence of precipitation or cloud cover. In this paper both approaches are used for evaluation of cloud cover simulations and the results are compared. A comparison of example maps, including percentage of area covered by clouds and number of cloud patches for satellite and WRF simulations, is also made.

2.5 Categorical Verification

Categorical verification involves a simple and intuitive approach that compares corresponding grid cells of observation and forecast. It can be applied to any phenomenon with values broken into categories; however, the most common use is for binary forecasts, e.g. occurrence of rainfall or cloud cover. In this case, a 2 × 2 contingency table is built, presenting the count of grid cells falling into each of four categories: hits, misses, false alarms, and correct negatives (Table 2). A number of error measures can be calculated based on these data, four of which were selected for this study: Threat Score (TS), Probability of Detection (POD), False Alarm Ratio (FAR), and Frequency Bias Index (FBI; Table 3). Threat Score, also known as the Critical Success Index, measures the fraction of observed and forecast events that were correctly forecast (Gilbert 1884). The range of values is from 0 (no skill) to 1 (perfect score). It is sensitive to climatological frequency of the event and produces lower scores for rare events (Schaefer 1990). However, it allows to compare different model runs for the same domain and period of time, which is one of the aims of this study. Probability of Detection, also known as Hit Rate, measures the fraction of observed events that were correctly forecast. It also ranges from 0 (no skill) to 1 (all observed events were predicted). It is sensitive only to misses and hits and can be improved by overforecasting (Jolliffe and Stephenson 2003). Probability of Detection is usually used with False Alarm Ratio (probability of false detection), which measures the fraction of “yes” forecasts that were false alarms. The range of values is from 0 (no false alarms) to 1 (all “yes” forecasts were incorrect). Opposite to POD, it can be improved by underforecasting (Wilks 2006). Frequency Bias Index determines whether the model is under- or overforecasting the analyzed phenomenon. It ranges from 0 to infinity, with 1 as the perfect score. It should be noted that FBI is not a measure of model accuracy since it does not provide information on the magnitude of forecast errors (Jolliffe and Stephenson 2003). A summary of skill scores used in this study is provided in Table 3. Because in categorical verification only respective grid cells are compared, the so-called double penalty problem is an important issue. For example, when the forecast is even slightly shifted in space, the error may be counted twice—once as a miss, and once as a false alarm. It may falsely reduce the score of model skill, as the event is, in fact, forecasted. In objective verification methods this issue is eliminated, because it is the objects, not individual grid cells, that are analyzed, and the distance of horizontal shift is also being accounted for as a part of the SAL measure.

Table 2 Contingency table used for categorical verification
Table 3 Skill scores calculated for cloud cover based on contingency table (above; a hit, b false alarm, c miss, d correct negative)

2.6 Objective Verification

The Structure–Amplitude–Location (SAL) method was originally developed as a tool for verification of precipitation field forecasts (Wernli et al. 2008). After simplification, the approach can be successfully applied also for other binary variables, such as cloud mask, which has been done previously, for example, by Crocker and Mittermaier (2013) or Zingerle and Nurmi (2008).

First, separate event fields need to be identified within a given domain. These objects are then compared to the respective observed fields, e.g. from Doppler radars or, in this case, satellite images. Afterward, geometric features of the objects, in this case–cloud cover (C obs—cloud cover from satellite image, C mod—from the model), are compared. The first parameter is structure (S), which is defined as the average volume of objects, but because cloud mask field is uniform, it can be treated as a flat object and the structure component describes only its size (denoted as V in Eq. 1). S takes values from −2 to 2, where negative values mean that model underestimates average size of objects and positive values mean overestimation

$$S = \frac{{V({C_{\rm mod }}) - V({C_{{\rm{obs}}}})}}{{0.5[V({C_{\rm mod }}) + V({C_{{\rm{obs}}}})]}}.$$
(1)

The second component of the SAL measure is amplitude (A), which calculates the domain-average cloud field. It can be interpreted as the degree to which the model is over- or underestimating the total amount of clouds in the domain. For data with continuous values, the size is understood as the total volume of objects, whereas for binary data it is the total area (D in Eq. 2). A takes values from −2 to 2 as well, with negative values meaning underestimation of total cloud amount within the domain and positive values—overestimation. Please note that structure and amplitude components of SAL are nonlinear, for example S = −1 means that model underestimates average cloud size three times, and similar statement is true for amplitude. In general, S and A values depend on observed total cloud amount and cloud size and therefore cannot be directly compared to studies for another region or episode. However, it allows to assess performance of different models for a fixed domain

$$A = \frac{{D({C_{{\rm{mod}}}}) - D({C_{{\rm{obs}}}})}}{{0.5[D({C_{{\rm{mod}}}}) + D({C_{{\rm{obs}}}})]}}$$
(2)

For the location component, two parts of the measure are calculated: one parameter (L 1) determines the distance between the observed and predicted domain-wide center of mass (X in Eq. 3), normalized by the use of the diagonal length of the domain (d in Eqs. 3 and 4). On the other hand, the L 2 parameter measures the observed and predicted average distance between the objects center of mass and the domain overall center of mass. For binary data, center of mass is simply the geometrical center (denoted as r; Eq. 4). The L component is defined as the sum of L 1 and L 2 (Wernli et al. 2008)

$${L_1} = \frac{{\left| {X({C_{{\rm{mod}}}}) - X({C_{{\rm{obs}}}})} \right|}}{d}$$
(3)
$${L_2} = 2\left[ {\frac{{\left| {r({C_{{\rm{mod}}}}) - r({C_{{\rm{obs}}}})} \right|}}{d}} \right]$$
(4)

The results of object-based verification are then presented on SAL diagrams, which show the values of all components and relationship between them (Fig. 5). Because the values of S and A components have the same range of values, they are represented on the axes, whereas the value of L is represented by the color of the data points. Dotted lines denote mean values of S and A and the sides of the rectangle are the first and third quartiles. These elements facilitate interpretation of the diagram, as the closer the dotted lines are to the center of the diagram and the smaller the rectangle, the more accurate is the forecast.

2.7 Evaluation of Meteorological Variables and Ozone Concentration

Besides cloud cover, the impact of the microphysics scheme on three surface meteorological variables was analyzed: air temperature and relative humidity at 2 m, and wind speed at 10 m. Three statistical metrics were calculated for each parameter for all model runs based on observational data from synoptic stations: Mean Error (ME), Mean Absolute Error (MAE), and Index of Agreement (IOA). Mean Error was selected to show how much the model under- or overestimates measured values, whereas Mean Absolute Error shows the absolute value of errors. Index of Agreement is a standardized measure of the overall model-measurement agreement (Willmott 1981). The formulas and value range of the above statistics are presented in Table 4.

Table 4 Error statistics calculated for temperature, relative humidity, wind speed and ozone concentrations

After the analysis of meteorological model simulations, the best and the worst simulations were selected for the WRF-Chem model runs. For these simulations, spatial distribution of mean O3 concentration and the differences between model runs are presented. For three air quality measurement stations representing different environments, temporal variability of measured and modeled 1-h average concentrations were compared.

3 Results and Discussion

3.1 Cloud Cover

Figures 3 and 4 present example cloud mask images from the satellite product and four WRF simulations for morning (9 AM UTC, 11 AM local time) and afternoon (3 PM UTC, 5 PM local time) hours. In both cases, the locations of modeled cloudy areas correspond to the satellite-derived image, but total cloud amount in the domain is smaller (39 % for SIM4 compared to 59 % on satellite image), particularly in the afternoon. Differences between simulations are much less pronounced than those between the model and satellite product, which suggest that the selection of the microphysics scheme has limited impact on the cloud mask results. The modeled clouds form patches of small cells rather than one vast cloudy area, like the satellite image—every simulation gives at least twice as many cloud cells as satellite. There are two reasons for this. It is related to the fact that cloud mask product generated from Meteosat images has coarser spatial resolution over Poland than the WRF model domains. After resampling to the spatial resolution of the WRF model, the number of cells marked as cloudy might increase. It is also possibly the main reason why for all model simulations a set of orographic clouds in the Carpathians is visible in the morning, which is shown as a single cloud patch in the satellite image. The second reason is that the entire WRF model grid cell has to reach saturation level before it is marked as cloudy. Considering summer convective condition this might be unlikely, therefore the WRF model provides lower number of grid cells with clouds, if compared with satellite data. This is also supported by the larger differences between the WRF and satellite cloud mask for the afternoon hours, if compared to morning (Figs. 3, 4).

Fig. 3
figure 3

An example of MSG satellite cloud mask product and WRF simulation results for 20 June 2008, 9 AM UTC

Fig. 4
figure 4

An example of MSG satellite cloud mask product and WRF simulation results for 20 June 2008, 3 PM UTC

Considering the differences between simulations, they are much smaller than differences between any of the simulations and the satellite cloud mask. SIM4 produces the largest cloud amount and SIM1 the smallest. Another noticeable thing is a distinct quantitative difference between SIM4 and other simulations—cloud cells are larger and cover more area, which is supported by the value of FBI (Table 5).

Table 5 Categorical verification measures calculated for all WRF runs (TS Threat score, POD Probability of Detection, FAR false alarm ratio, FBI Frequency Bias Index)

3.2 The SAL Method

The results of the simulations evaluated with the SAL method are shown in Fig. 5. It shows that for all simulations both cloud size and total cloud amount, represented by S and A components, are underestimated by the model, as the majority of data points lie in lower left quadrants of the plots. The main cause is the fact that WRF does not account for subgrid-scale cumulus clouds in the cloud fraction output, which leads to underestimation of modeled cloud cover, as the whole grid cell needs to be saturated to produce cloud. Satellite cloud mask, on the other hand, is a pessimistic field, which means that only the cells which pass all tests can be flagged as cloud-free, which increases the discrepancy between modeled and satellite-derived cloud cover. The best S and A values are for SIM4, as the rectangle limited by S and A first and third quartiles is small and located closest to the center of the diagram. It may be explained by the fact that Morrison Double-Moment is the most sophisticated of the selected microphysics options and the only double-moment scheme. SIM3 and SIM2 present similar performance, whereas SIM1 underestimates both cloud amount and size the most. For all simulations, the points with S and A components close to zero have generally also small L values; however, there are some exceptions—particularly in the lower right quadrant. There is a high density of data points with large location component and at the same time structure is significantly underestimated and amplitude is close to the median value. There are very few points with overestimated cloud amount and size, and most of them have small to moderate L component value. There are almost no data points with underestimated amount and overestimated cloud size at the same time. This is expected because grid cells on the edges of clouds are less likely to reach saturation, which causes decrease in both cloud size and total cloud cover. A study conducted by Crocker and Mittermaier ( 2013) for the United Kingdom shows that UK4 and UKV models tend to overestimate cloud cover.

Fig. 5
figure 5

SAL diagrams for all WRF simulations, with Structure (Eq. 1) and Amplitude (Eq. 2) values are given by the position of the point on the diagram and Location (Eqs. 3 and 4) value is given by its color. Dotted lines indicate median values and the rectangles enclose points within 1st and 3rd quartiles of Structure and Amplitude

3.3 Categorical Verification

Table 5 shows four categorical verification measures. The results present poor model performance, with TS not exceeding 0.5. As this measure is sensitive to both misses and false alarms, it is essential to examine which element had the most influence on the results. The values of POD are very low, which indicates large fraction of missed events, therefore it can be concluded that forecasting clear sky when cloud cover is present is a major issue, which is caused by subgrid-scale cloudiness not being resolved by microphysics schemes in WRF. It also shows that nearly half of the observed cloudy grid points are not resolved by the model. SIM4 simulation gives the best result in terms of TS and FBI, which is very close to unity, but False Alarm Ratio is also higher here than for the remaining simulations. It suggests that the reason of high threat score is that this model run forecasts more cloud than other simulations, but otherwise it is not necessarily attributed to model skill.

The values of POD averaged for each hour of day are presented in Fig. 6. All simulations present a similar trend, with the lowest value shortly after sunrise and highest for late afternoon (above 0.6 for SIM4). It indicates that the WRF model is more skilled in resolving afternoon than morning cloudiness. However, one has to be careful in drawing direct conclusions, since most skill scores depend on total observed or modeled cloud amount. SIM4 has the highest values of all simulations for all but 1 h and the differences are the largest for 17:00–19:00 (up to 0.04). The results are much poorer for SIM1 and SIM2, where this parameter falls below 0.4. However, a better POD score is usually associated with larger FAR, because POD may be improved by overforecasting, as the number of hits (to which POD is sensitive) is larger, but the number of false alarms, to which FAR is sensitive, also rises (Fig. 7).

Fig. 6
figure 6

Hourly values of Probability of Detection (POD) averaged for the study period for each of the four simulations

Fig. 7
figure 7

Hourly values of False Alarm Ratio (FAR) averaged for the study period for each of the four simulations

3.4 Meteorological Variables

Modeled temperature, relative humidity and wind speed are evaluated based on hourly data from synoptic stations located in Poland. The results are summarized in Table 6. Temperature and humidity are overestimated and wind speed is underestimated by all model runs, which is shown by Mean Error. The differences in Mean Absolute Error between simulations are also small. Model-measurements agreement of wind speed, represented by IOA, shows a significant advantage of SIM2 over the other simulations. Generally, there are small differences between the WRF model running with different microphysics schemes, with SIM2 showing slightly better performance. Because the study period is dominated by stagnant anticyclone with low wind speed and no precipitation, the differences between model runs with different microphysics schemes are not pronounced. However, studies conducted for longer and more diverse periods show that Morrison Double-Moment scheme provides the most consistency with observations for meteorological variables and aerosol concentrations (Baró et al. 2015).

Table 6 Error statistics calculated for WRF simulations of three meteorological variables: temperature (T2), relative humidity (RH2), and wind speed (WSPD)

3.5 Ozone Concentrations

Figure 8 presents average 1-h ozone concentrations in the innermost WRF model domain for SIM1 and Fig. 9 for SIM4. Both maps show similar spatial pattern, with O3 increasing toward the south-west, reaching 90 µg m−3 in the Czech Republic. The concentrations modeled with SIM4 are generally higher than SIM1, particularly for areas with higher O3 levels and over the Baltic Sea in the north, where the differences between SIM4 and SIM1 exceed 7 µg m−3 (Fig. 10). The differences between model runs are confirmed by the time series charts in Fig. 11, which show that SIM4 produces higher O3 levels for all sites. Better performance of the simulation running with Morrison microphysics may be a result of the fact that it is a double-moment scheme that takes into account aerosol direct effects. However, both simulations capture the daily ozone cycle in the urban environment, although the amplitude of changes is much lower than observed. This could be linked to constant temporal emission profile applied, since it does not account for diurnal or weekly changes in anthropogenic emission, mainly from transport (e.g. morning and afternoon peaks in NO x emission). Another possible source of errors may be inadequate chemistry scheme, underestimating the rate of O3 formation and destruction processes. Both of these reasons may be verified by changing emission input data or applying a different chemical mechanism. Model errors are on similar level to the study by Forkel et al. (2015); however, it should be noted that the study period here is shorter. For rural station O3 concentration is underestimated for the entire period by both simulations, which may be explained by underestimated background concentrations (default values used with WRF-Chem).

Fig. 8
figure 8

Mean O3 concentration for the episode of 30 June–4 July 2008 (SIM1, Purdue Lin)

Fig. 9
figure 9

Mean O3 concentration for the episode of 30 June–4 July 2008 (SIM4, Morrison 2-Moment)

Fig. 10
figure 10

Differences in mean O3 concentration between SIM4 (Purdue Lin) and SIM1 (Morrison 2-Moment)

Fig. 11
figure 11

Temporal variability of modeled and measured O3 concentrations at WRK, WRB, and SNI station

4 Conclusions

Although categorical verification of cloud cover forecast provides valuable information about model performance, it may falsely understate model skill in cases when clouds are even slightly dislocated. However, this type of verification can capture the model tendency to underestimate total cloud amount within the domain and enables the identification of possible sources of uncertainties. Objective verification methods may serve as a supplement to categorical approach, as it provides additional information on the structure of model-measurements discrepancies. The objective approach provides both direct information on whether the total cloudiness in the domain is over- or underestimated and to what extent, and also brings more detailed information on the size and location of modeled cloud patches compared to the observed ones. By analyzing objects (i.e. cloudy areas) instead of individual grid points it also eliminates the double penalty problem, which becomes a large issue with high spatial resolution of meteorological models; therefore, model performance is not underestimated, as in the case of categorical verification method.

Both methods are consistent with the conclusion that all WRF simulations underestimate the amount of cloud cover. This may have further consequences on e.g. overestimation of the summer air temperature by the WRF model which was shown by Kryza et al. (2015, this issue) for Central and Eastern Europe. One important factor is that satellite cloud mask is a pessimistic field, meaning that only a grid point that passed all cloud detection tests can be classified as cloud-free. Although these data are consistent with MODIS and point surface observations, it will rather present more than less clouds (Crocker and Mittermaier 2013). Another issue is the resolution of data—satellite cloud mask has similar, but not the same grid size as the model. Coarser resolution results in presenting a set of small cloud cells (e.g. Altocumulus floccus) as one wide patch, whereas the model resolves it differently. It may result in false underestimation of cloud cover and the average size of cloud cells, which may be the case here. Additionally, both methods are agreeable that SIM4 provides the best results of cloud cover and SIM1 presents significantly poorer performance. It refers to all analyzed cloud properties—SIM4 has the least underestimation of cloud size and total cloud amount, as well as its location within the domain. The difference is not as significant for surface meteorological variables, as only one performance measure for wind speed responds to the change in microphysics parameterization. However, the change of microphysics scheme has significant impact on WRF-Chem modeled ozone concentrations, particularly for high ozone conditions. This could be attributed to the fact that cloud cover is used as input for photolysis schemes. It is important for risk assessment of critical ozone levels exceedance and its prediction. Therefore, the Morrison Double-Moment microphysics parameterization will be used in further research regarding the modeling of ozone concentrations during summer episodes in Poland and Central Europe.