Complex networks for climate model evaluation with application to statistical versus dynamical modeling of South American climate

In this study we introduce two new node-weighted difference measures on complex networks as a tool for climate model evaluation. The approach facilitates the quantification of a model’s ability to reproduce the spatial covariability structure of climatological time series. We apply our methodology to compare the performance of a statistical and a dynamical regional climate model simulating the South American climate, as represented by the variables 2 m temperature, precipitation, sea level pressure, and geopotential height field at 500 hPa. For each variable, networks are constructed from the model outputs and evaluated against a reference network, derived from the ERA-Interim reanalysis, which also drives the models. We compare two network characteristics, the (linear) adjacency structure and the (nonlinear) clustering structure, and relate our findings to conventional methods of model evaluation. To set a benchmark, we construct different types of random networks and compare them alongside the climate model networks. Our main findings are: (1) The linear network structure is better reproduced by the statistical model statistical analogue resampling scheme (STARS) in summer and winter for all variables except the geopotential height field, where the dynamical model CCLM prevails. (2) For the nonlinear comparison, the seasonal differences are more pronounced and CCLM performs almost as well as STARS in summer (except for sea level pressure), while STARS performs better in winter for all variables.


Introduction
Almost a decade after the introduction of complex network methods into climate science (Tsonis and Roebber 2004), network-based model validation techniques are still few and far between. Climate networks associate geographic locations with nodes (also called vertices) of a mathematical object called network or graph (Newman 2009). The connections between nodes (called links or edges) represent similarities between climatological time series at those locations, derived mostly from reanalyses or remote sensing data. The mathematical field of complex network theory has thrived over the past decades (Strogatz 2001) and now offers a variety of methods to uncover different aspects of the topological structure of networks (Newman 2003;Cohen and Havlin 2008).
Applied to the climate system, these methods have already lead to several substantial new insights: From the identification of dynamical transitions (Tsonis et al. 2007) and teleconnections (Donges et al. 2009b), via the study of El Niño (Yamasaki et al. 2008;Gozolchiani et al. 2011) and monsoon systems (Malik et al. 2011;Boers et al. 2013), to actual predictive power (Ludescher et al. 2013). It has been demonstrated that data-based climate networks show remarkable versatility. The model-based branch of the framework is far less developed, although recent attempts by Steinhaeuser and Tsonis (2013) and Fountalis et al. (2013) show the growing interest in the use of network methods for climate model intercomparison and climate model analysis (van der Mheen et al. 2013).
From a theoretical point of view, evaluating the network structure of modeled climate data constitutes a promising extension to conventional evaluation techniques like the comparison of annual cycles or seasonal means. While the latter approaches investigate properties of time series at each geographical location individually, climate networks describe their covariability and thus represent an essentially different aspect of spatio-temporal climate variability.
The intention of this paper is to propose a new method to evaluate climate models by means of complex networks. For this purpose, we compare the structural similarity of climate networks obtained from models to those obtained from reanalysis data (the reference networks). The differences between them are considered proxies for the quality of the underlying modeling, here the simulation of the South American climate as performed by a dynamical (CCLM) and a statistical (STARS) regional climate model (RCM). This work is to be seen as both an introduction of the methodology, and as a case study on the feasibility of the application of CCLM and STARS to South American climate.
With our methodology, we aim at a direct comparison of the network structure, which has the advantage of including all available information about the complex networks under study. Any kind of preprocessing of the networks, e.g. a clustering of nodes, bears the inherent danger of adding spurious information or diminishing the complexity of the network structure, possibly stripping it of relevant features.
One of the applied network difference measures is a modification of the Hamming distance, which is rooted in information theory (Hamming 1950) and has found plenty of applications, often in combination with complex network analysis (Donges et al. 2009a;Zhou et al. 2006;Ciliberti et al. 2007). We also compare the clustering structures (Watts and Strogatz 1998) of the observed and modeled networks by computing the root-mean-square distance of their respective fields of local clustering coefficients in order to evaluate the recreation of nonlinear dependencies by the models.
It should be noted that, comparing the spatial statistical interdependency structure within climatological fields, our method is related to approaches based on empirical orthogonal functions and teleconnection patterns (Handorf and Dethloff 2012;Stoner et al. 2009), yet distinct due to the inclusion of information about nonlinear interrelations (Donges et al. 2013b).
In the next section, we outline the key features of the two particular RCMs under study and describe the simulation setup (Sect. 2). The methodology of network-based model evaluation is presented in Sect. 3 and its results are given in Sect. 4, where we compare the output of simulations of the regional climate of South America, followed by a discussion on the robustness of the method. Finally, we draw conclusions and give an outlook on possible further applications in Sect. 5.

Regional climate modeling
Simulating meteorological processes on the mesoscale and below, regional climate models bridge the gap between general circulation models (GCMs), which operate on a global scale at rather coarse horizontal resolution (*100 km), and climate impact models, which focus on specific processes or features in a confined region such as hydrology, agricultural production, forestry, etc. (Gutsch et al. 2011;Reyer et al. 2013). For climate projections, impact models are typically driven by RCMs, which in turn downscale GCM data (Stocker et al. 2013). Although, with ever growing computational power, the border between global and regional modeling might become blurry and eventually disappear, for the time being, RCMs are still frequently found in the impact modeling chain. To illustrate our evaluation method we examine RCM simulations. Since this work is about the evaluation procedure rather than climate projections, we only produce hindcasts driven by the ERA-Interim reanalysis data (Dee et al. 2011).

The statistical approach: STARS
The statistical analogue resampling scheme (STARS) was originally developed in order to provide climate realizations for impact models (Werner and Gerstengarbe 1997), but has since been successfully applied for regional climate projections (Orlowsky and Fraedrich 2008;Orlowsky et al. , 2010Lutz et al. 2013). The general idea is to stochastically resample meteorological data according to a given trend of some meteorological variable. Typically, temperature is chosen as the trend variable since this is the natural choice in the context of global warming and since trends of other variables like precipitation are often less robust.
A sketch of the model's workflow is shown in Fig. 1. At first, as a form of biased bootstrapping, observations (or, in this case, reanalysis data) are resampled as entire years in order to approximately match a prescribed trend line. The match is further improved by replacing blocks of 12 days in an iterative process. The resulting date-to-date mapping is then applied to all variables and all locations. Thereby, the simulation output is guaranteed to be physically consistent, within the limits of the input data's consistency.
STARS can only produce output which has been observed rather than create entirely new situations like a dynamical model. For example, no new extreme values can be simulated. It should also be noted that uncertainties in the input data will propagate into the simulation. Apart from the quality, the availability of data is also a key constraint on the applicability of STARS. As a rule of thumb, the length of the simulation period should not be longer than the observation period in order to prevent unnaturally low variability in the model output. Thus, since the ERA-Interim dataset starts in 1979, we used the input data from 1979 to 1995 to simulate the period from 1996 to 2011, prescribing the temperature trend of the reanalysis during the simulation period.
Due to the statistical nature and low computational demands of STARS, it is possible to obtain ensembles of climate realizations in very short time, which makes this model ideal for studying a whole range of possible scenarios. For this study, we generated an ensemble of 200 realizations using STARS version 2.4.

The dynamical approach: CCLM
The COSMO-CLM (CCLM, ) is the climate version of the COSMO-Model (Baldauf et al. Complex networks for climate model evaluation 1569 2011), which is the operational numerical weather prediction model of the German Weather Service and other members of the COSMO consortium. The development of CCLM is steered by the CLM Community which has more than 50 member institutions from Europe, Asia, Africa and America. The model has been extensively applied to European domains (e.g. Jaeger et al. 2008;Zahn and Storch 2008;Hohenegger et al. 2009;Davin and Seneviratne 2011) but also to the Indian subcontinent (Dobler and Ahrens 2010), to CORDEX-East-Asia (Fischer et al. 2013), and to CORDEX-Africa (Nikulin et al. 2012). One of the very first applications was to South America (Böhm et al. 2003) but it has been run there rarely afterwards (Rockel and Geyer 2008;Wagner et al. 2011). CCLM is dynamical in the sense that it solves thermohydrodynamical equations describing the atmospheric circulation. The equations are discretized on a three-dimensional grid based on a rotated geographical coordinate system. In this study the CCLM version 4.25.3 was used. Deviating from its default configuration, the model was run with 40 vertical levels, reaching up to 30 km above sea level and a Rayleigh damping height of 18 km, as has been suggested for tropical regions (Panitz et al. 2013). We set the bottom of the deepest hydrologically active soil layer to 8 m, since rain forest roots go down to such depths (Baker et al. 2008). The numerical integration was performed with a total variation diminishing Runge-Kutta scheme (Liu et al. 1994) and a Bott advection scheme (Bott 1989), since both are supposedly more accurate than their default alternatives. We employ an implementation of the EC-MWF IFS Cy33r1 convection scheme (Bechtold et al. 2008) and diagnose subgrid-scale clouds by a normalized saturation deficit criterion (Sommeria and Deardorff 1977). Additionally, a few tuning parameters were adjusted during preceding sensitivity experiments. Particularly, changing the convective parametrization and the subgrid-scale cloud scheme led to major improvements of the model performance over South America. These findings are presented in detail in a separate paper (Lange et al. 2014).
We run the model on the CORDEX-South-America domain (Giorgi et al. 2009) as displayed in Fig. 2. This implies a horizontal resolution 0.44°and 166 9 187 grid points including a 10 grid points wide sponge frame. The simulation covers the years 1979-2011 where the first 17 years serve as spinup time, since the STARS output is only available from 1996.

Common domain of evaluation
In order to construct climate networks of the same spatial embedding, both model outputs were to match in resolution and geographical boundaries. We chose a section of the native ERA-Interim grid, encompassing the South American mainland (Fig. 2): 82.3°W-33.8°W and 13.7°N-55.8°S. The resolution is approximately 0.7°in both latitude / and longitude k. This makes for a bounding box of N ¼ N / Â N k ¼ 100 Â 70 ¼ 7;000 grid cells, which will be represented by nodes in the subsequently constructed climate networks. Since this is a regular Gaussian grid of considerable latitudinal extent, grid cells at different latitudes represent differently sized areas (about 78 km Â 78 km ¼ 6;084 km 2 at the equator and 44 km Â 78 km ¼ 3;432 km 2 at the southern boundary)-an effect we take into account by introducing area-proportional node weights (cf. Sect. 3.3).
Since STARS only resamples the input data, its output is already on the native ERA-Interim grid. CCLM output was remapped, conservatively (Jones 1999) in case of precipitation and bilinearly otherwise.

Methodology
After some introductory definitions from complex network theory (Newman 2003;Boccaletti et al. 2006), we move on to present the two essential parts of our methodology: The construction of spatially embedded networks and their comparison.

Basic definitions
A network (Newman 2009) or graph G ¼ ðN; EÞ consists of a node set N ¼ f1; . . .; Ng of nodes or vertices, potentially pairwise connected by links or edges, constituting an edge set E ffi; jg : i 6 ¼ j 2 Ng. Connected nodes are called neighbors. The full information about G is contained in its binary adjacency matrix ða ij Þ i;j2N , with a ij ¼ 1, if the nodes i and j are connected and a ij ¼ 0, otherwise.
These definitions imply that, within the scope of this article, we only work with undirected, unweighted, simple graphs (i.e. no directed links, no link weights, and no selfloops). We do, however, apply node weights w i to enable the use of area-weighted measures (Sect. 3.3).
The number of connections of any node i is called the degree k i of node i, with It is, in other words, the number of neighbors of that particular node, or the cardinality of its neighborhood N i ¼ fj 2 N : a ij ¼ 1g, the set of nodes, connected to i. A measure of higher order is the (local) clustering coefficient c i , which estimates the likelihood of any two neighbors of node i also being connected (Watts and Strogatz 1998): Apart from k i and c i , there are many more measures describing different aspects of network topology, but in this study we are going to restrict our analyses to the aforementioned ones.

Network construction
We demonstrate two methods of constructing networks: data-based climate networks from the model outputs and reanalysis data, and three types of random-based surrogate networks which inherit different features from the reference network in order to function as null models. The latter shall provide a useful benchmark for the results by imitating climate models which only reproduce certain features of the reference network structure.

Climate networks: model output and reanalysis data
The data-based climate networks are constructed from three sources: CCLM model output, STARS model output, and ERA-Interim reanalysis data, in each case for the daily means of four different variables: 2 m temperature (T2M), total precipitation (PREC), geopotential height at 500 hPa (GEO500), and sea level pressure (SLP). The temperature and precipitation variables represent major features of the climate system with high impact on biosphere and human society, as well as great importance for impact modeling. The sea level pressure and geopotential variables are representatives of the circulation system on the surface and in higher altitudes, respectively, and their faithful reproduction is equally essential for accurate climate modeling. In all cases the general procedure of network construction is the same, following (Donges et al. 2009a, b): 1. Choose a time frame, such as austral summer (DJF), austral winter (JJA), or any other interesting time span within the evaluation period, to get N time series, one for each grid cell. 2. For each time series apply a moving average filter with a sliding window of length l. The default value used here was l ¼ 7 days. 3. Produce climatological anomalies from each time series, i.e. remove the seasonality such that the resulting time series are approximately stationary in mean and variance. 4. Calculate similarities for all pairs of time series using, e.g. Pearson correlation or rank correlation, to obtain an ðN Â NÞ-similarity matrix. 5. In the similarity matrix, set those values to 1 which are greater than a chosen threshold and set those below to 0. 6. Use the thresholded similarity matrix as adjacency matrix of the climate network. Set the main diagonal to zero to exclude self-loops. 7. Finally, assign to each node i a node weight w i proportional to the geographic area it represents, i.e. w i / cosð/ i Þ (Heitzig et al. 2012;Wiedermann et al. 2013).
Steps 2-4 in the construction procedure need some special attention. To obtain the similarity matrix, we calculate all correlations at lag zero. Usually, weather phenomena distribute over a larger area in a matter of several days, so a time lag might be considered. We instead chose to apply a moving average to the daily values to account for this fact, avoiding the application of a fixed time lag. Unless stated otherwise, we average across of l ¼ 7 days and discuss the sensitivity of our results with respect to l in Sect. 4.4. Other similarity measures, such as event synchronization (Malik  Boers et al. 2013) also allow for dynamical time lags but are not subject of this work. In order to conduct step 3 we need an approximation of the seasonal cycle in daily resolution, which we construct by calculating the long-term mean for each day and smoothing the resulting time series by a Gaussian filter to account for the rather short evaluation time of 16 years.
For the unbound variables T2M, SLP, and GEO500 we obtained approximately Gaussian-distributed anomaly time series by subtracting the seasonal cycle from the smoothed daily values. Approximate homoscedasticity is assumed for these variables given that our analysis will be constrained to seasonal time series. The similarities are estimated by Pearson correlation.
Other variables, in our case PREC, have a natural lower bound of zero, and thus their probability distribution is more complicated, clearly non-Gaussian (frequently modeled as a Weibull-, Gamma-, or mixed-exponential distribution, (Li et al. 2012). In this case, we apply Spearman's rank correlation, which also works for non-Gaussian variables, to estimate the similarities. Also, to compute the anomaly values and approach homoscedasticity, we divide the smoothed daily values by the seasonal cycle instead of subtracting it. This yields an expectation value of 1 mm/ day at all times and leaves zeros at zero. Simply subtracting the daily means would transform zeros into values which rise as the climatology falls. To avoid dividing by zero we define a minimal value of 0.1 mm/day which we divide by whenever it is underrun. We chose this value since it is usually referred to as the minimally measurable daily rain amount. In our data, this case actually occurs only in northern Chile (Atacama desert) and the adjacent part of the Pacific ocean.
The threshold for the similarity matrix (step 5) was applied adaptively such that a desired link density q in the resulting networks was achieved. Unless stated otherwise, we used q ¼ 0:01, meaning that we included only the 1 % strongest correlations in our analysis, which is considered an effectual trade-off between structural richness and statistical significance (Donges et al. 2009a).
It should be noted that climate networks are often constructed by thresholding the matrix of the absolute values of correlation coefficients (Tsonis and Roebber 2004;Donges et al. 2009a, b). In the context of network comparison however, this could lead to the problematic situation, in which, for two networks with adjacency matrices ða A ij Þ and ða B ij Þ, a A ij ¼ 1 is due to a positive correlation while a B ij ¼ 1 is due to a negative correlation. Hence, although the relation of i to j is of wholly different nature in the two networks, a comparison of them would yield agreement. In order to prevent this case we focus on positive correlations here.
Using the above recipe, we construct climate networks from each of the 200 STARS realizations, as well as one from the CCLM simulation, and one from the ERA-Interim data. The latter network is assumed to be a close approximation of the real-world network structure and will thus be used as the reference network for all others to be compared to (Fig. 3).

Surrogate networks: bootstraps and random models
The large ensemble of networks from STARS output allows for assumptions being made on variations in the quality of the statistical modeling. Due to the considerable computational demand of the dynamical modeling, we have only one CCLM simulation available. In order to still be able to estimate the uncertainty of the dynamical modeling, a technique from the bootstrapping family of methods (Efron 1979), also known as case resampling, is applied: We bootstrap the CCLM output by randomly drawing entire seasons with replacement, such that the length of the bootstrapped time series equals the original, and apply this reordering synchronously to the whole output time series field to preserve spatial patterns and correlations. Repeating this procedure 200 times, we get a set of realizations, each of which we create a climate network from.
The same technique is applied to the reanalysis dataset, yielding 200 surrogate networks, which are supposed to closely resemble the reference network and thus form an upper bound for the performance of the climate models.
To also create lower bounds and thus add a sense of scale to our comparison, we further extend our comparison to include three different kinds of random models as surrogates to the reference network. Each type of random model demonstrates what performance we could expect if our climate models would only reproduce a specific feature of the reference network: • Erd} os-Rényi model (ER) This most basic type of random network (Erdös and Rényi 1959) only conserves the total number of links (or equivalently, the link density q) of the reference network. Its edges are rewired completely at random. This can be seen as a worst-case model.

• Configuration model (CM)
Here the degree k i of each node i is the same as in the reference network, while its neighborhood N i is randomized (Newman 2003). While more sophisticated than the ER approach, this model should still perform worse than the RCMs. • Spatially embedded random network model (SERN) This model was introduced to estimate the effects of spatial embedding on connectivity in random networks (Barnett et al. 2007). It was also recently used to study the effect of boundaries on measures in regional climate networks (Rheinwalt et al. 2012). The algorithm constructs random networks with approximately the same distribution of geographic link lengths as the reference network and the same link density.
For each random network type we generated ensembles of 200 realizations. We now have six sets of networks: Two from climate models (including 200 bootstraps from the CCLM simulation), three from random network models, and one from reanalysis bootstraps, all sharing the same spatial embedding, link density and node weights. By assessing their similarity to the reference network we can estimate the quality of the underlying modeling processes from the climate network perspective (Fig. 3).

Comparison of spatially embedded networks
One way to quantify the dissimilarity of graphs is the Hamming distance (Hamming 1950). For two unweighted simple graphs A and B with adjacency matrices ða A ij Þ and ða B ij Þ and a common set of N nodes, its normalized form is the fraction of edges that have to be changed in one graph in order to convert it into the other (Donges et al. 2009a): More generally defined as a distance measure on the set of binary strings of length L (here L ¼ N 2 ), the Hamming distance is a metric (Hamming 1950).
To be able to compare spatially embedded networks of nodes representing differently sized grid cells, we apply the framework proposed by Heitzig et al. (2012), called node splitting invariance (n. s. i.). For example, while the degree k i only counts the number of nodes connected to i, by summing up the area-proportional weights of the connected nodes, we can construct a measure which accounts for the area connected to i: the extended neighborhood of i. Likewise, there is an n. s. i. version of the clustering coefficient: Here we have to sum over a þ jk , the entries of the extended adjacency matrix with a þ i6 ¼j ¼ a ij ; a þ ii ¼ 1. For details on why to use N þ and ða þ ij Þ to abide the concept of nodesplitting invariance, please refer to Heitzig et al. (2012). The n. s. i. clustering coefficient takes into account the area, represented by each neighbor of node i, and weights their contribution to the clustering coefficient accordingly.
To estimate how well the clustering fields of two given networks A and B match (in our case reference and model network), we determine their differences via the nodeweighted root-mean-square error (RMSE) of clustering coefficients where W ¼ P N i¼1 w i is the total node weight. This measure is just one example of how to compare the nonlinear properties of two networks. Analogously, one could compare other measures of higher order, such as betweenness or closeness. As a linear measure of mutual differences in the neighborhood structure, we introduce the node-weighted Hamming distance With C Ã and H Ã we now have two difference measures for spatially embedded networks, which account for differently weighted nodes, as in the case of nodes representing geographic areas of varying size.

Conventional area-weighted difference measures
To complete our methodology and relate to previous works of climate model evaluation, we apply area-weighted versions of two standard difference measures, the root-meansquare error E of the climatological mean field l of a variable and the logarithmic root-mean-square factor F (Golding 1998) of the respective standard deviation field r. We define the area-weighted versions of E and F as and Complex networks for climate model evaluation 1573 where A and B denote two distinct datasets to be compared. Altogether, the measures C Ã , H Ã , E Ã and F Ã are comparable in that they are equal to zero in case of perfect agreement and grow with disagreement. Yet they are complementary in that they are based on distinct features of the underlying time series, namely the mean and variance fields in case of E Ã and F Ã , and the correlation matrix in case of C Ã and H Ã .
4 Application: regional climate modeling over South America The climate of South America (SA) is very diverse and includes the humid Amazon rain forest in the centralnorthern lowlands, which is contrasted by semi-arid regions like the Sertão in northeastern Brazil, deserts like La Guajira in northern Columbia or the Atacama in northern Chile, and permanent ice fields in the south of Patagonia. The Amazon rain forest is also a hotspot of biodiversity and a major carbon sink, whose role as one of the key tipping elements in the global climate system has been well established (Lenton et al. 2008;Boulton et al. 2013). During austral summer (DJF), the South American Monsoon System is responsible for extensive moisture transport from the southward-shifted Intertropical Convergence Zone (ITCZ) via the Amazon basin and further south towards the extratropics (Vera et al. 2006). The channeling of the easterly trade winds by the Andes and the Brazilian Highlands, also called the South American Low-Level Jet (SALLJ, Marengo and Soares 2004), leads to high precipitation from the Altiplano Plateau to the La Plata basin or, via the South Atlantic Convergence Zone, in southeast SA and the adjacent South Atlantic (Carvalho et al. 2004).
In austral winter (JJA), the ITCZ is shifted northwards. The moisture transport via SALLJ is considerably lower and more moisture from the Atlantic is fed into the then active North American Monsoon System. A major influence on the weather in the southern part of SA is the formation and movement of extratropical cyclones, which during winter tend to be more frequent and of higher complexity (Mendes et al. 2009).
Another prominent influence on the South American climate is the El Niñ o phenomenon (Trenberth 1997), the appearance of a band of unusually warm ocean water in the East Pacific, off the coast of Peru. El Niño or, in a wider context, the El Niño Southern Oscillation (ENSO), occurs highly erratically on an interannual scale and, depending on its intensity, can impose drastic effects on the SA climate and the ecosystem in general, e.g. by disturbing oceanic food chains, which are sensitive to alterations in the water temperature (Stenseth et al. 2002). Reliable mechanisms for the predictability of ENSO are still highly sought-after in contemporary climate and ocean research (Schneider et al. 2003;Ludescher et al. 2013), and assessing the impacts of climate change on ENSO remains challenging, especially in the long run (Stevenson et al. 2012).
There have been multiple attempts on modeling the regional climate of SA, recently in a coordinated study following the CORDEX conventions (Solman et al. 2013) or with stronger focus on climate change impacts on Amazonia (Cook et al. 2012). Previous works include studies of the long term effects of climate change (Marengo et al. , 2011 and coupled RCM-vegetation modeling (Cook and Vizy 2008). While the agreement of the applied models is often quite high, all of the above studies focus on reproducing seasonal cycles or mean conditions over extended periods, along with variance analyses and observations on the frequency distributions of extremes. Fig. 4 Estimated probability density functions of the node-weighted Hamming distances between the model networks and the reference network. Time series from T2M in austral summer (DJF). The networks made from resampled reanalysis data (black) bear the closest resemblance to the reference network. STARS (blue) performs better than CCLM (red dashed line). The networks from CCLM bootstraps (red shaded area) give an impression of the uncertainty of the CCLM modeling, sometimes giving better and sometimes worse results than the actual run. The random models (SERN: yellow, configuration model: green, Erd} os-Rényi: pink) perform worse than the RCMs Being based on comparing the network structure, and, hence, spatial correlation patterns of model output and reanalysis data, our methodology is complementary to this established agenda.

Linear network comparison: H Ã
The first step in our analysis concerns the reproduction of the adjacency structure of the reference network by the RCMs and the random models. As an introduction we discuss the probability density functions (PDFs) of the node-weighted Hamming distance H Ã (Eq. 8; Fig. 4), derived from networks of T2M time series in austral summer.
As expected, the bootstrapped ERA-Interim data yields networks with the closest resemblance to the reference network and thus the smallest Hamming distance while the random models produce networks with less resemblance, SERN performing best, followed by the configuration model, and Erd} os-Rényi being worst. The latter is no surprise due to the complete randomness of the ER model, the expectation value for the unweighted Hamming distance being 2qð1 À qÞ ¼ 0:0198 (Donges et al. 2009a). The slightly better performance of the configuration model is rooted in the model's conservation of the degree distribution, giving each node the same degree as the corresponding node in the reference network, thus enhancing the probability of successfully reproducing links. SERN performs much better because it conserves an important link property that is shared by all networks considered here: The probability of finding a link between two nodes is the greater the closer they are to each other geographically. In comparison to ER and CM, this strongly reduces the randomness of link positioning and renders the SERN networks much more similar to the reference network. All random network models produce very narrow distributions due to the rather low link density of 1 % and the high number of 7,000 nodes (law of large numbers). For a better visualization, the cusps of these distributions are cut off in Fig. 4.
Additionally, we find that of the RCMs, STARS outperforms CCLM. There are remarkable differences in the shape of the distributions, those from bootstraps (ERA and CCLM) being wider than the distribution of the STARS ensemble. This can possibly be attributed to the fact that, due to the operating mode of STARS (cf. Sect. 2.1 or , many constraints are imposed on the selection of blocks of consecutive days during the resampling of temperature time series, thus lowering the variability between realizations and resulting in a narrower For the other variables (Fig. 5), there is no such clear difference in variability, presumably because these are only indirectly affected by the resampling procedure, via their climatological interrelation to temperature. We have left out the Hamming distance distributions of the random networks here, because their position hardly differs between variables.
Comparing the overall picture, we observe that Hamming distances are greatest for PREC and least for GEO500 across datasets and seasons. This indicates that, out of those variables considered in this study, the dynamics of precipitation are hardest and those of the 500 hPa geopotential are easiest to model. The special position of GEO500 comes as no surprise as it is the only upper level variable, i.e. undisturbed by orographic or other ground-based influences, and since its dynamics have been found relatively easy to model before (Steinhaeuser and Tsonis 2013). Also the outstanding complexity of precipitation dynamics is well-known (Huff and Shipp 1969;Matsoukas et al. 2000;Peters et al. 2001).
Comparing the individual distributions, we find that the resampling-based model STARS has a rather constant relative distance to the ERA bootstraps (the practical upper bound to model accuracy), which simply reflects the model's functional principle. In contrast, the performance of the dynamical model CCLM varies strongly between variables. Compared to STARS, it generates T2M, PREC, and SLP networks which are less similar and a GEO500 network which is more similar to the respective reanalysis reference. This reflects that model physics differences have a larger impact at the surface than at upper levels where in turn dynamical simulations bear greater similarity to their boundary forcings.
Performance differences between seasons are smaller than those between models. In austral winter (JJA, Fig. 5, right), the modeling of GEO500 by CCLM is slightly less accurate than in summer (still only 8 % of the STARS realizations perform better than the CCLM run). This might be attributed to a higher complexity of the extratropical cyclogenesis during winter (Mendes et al. 2009) and its relatively greater influence on the South American climate due to the JJA northward displacement of general circulation patterns.

Nonlinear network comparison: C Ã
The comparison of the higher-order structure of the networks, here represented by C Ã (Eq. 7; Fig. 6), confirms in Fig. 6 Node-weighted clustering RMSE of networks on modeled and bootstrapped data with respect to the reference network. All variables in austral summer (DJF, left) and winter (JJA, right), coloring as in Fig. 5. For the higher-order comparison, the seasonal differences are more pronounced and CCLM performs comparably to STARS in austral summer (except SLP), while STARS performs better in austral winter for all variables. ERA-bootstraps are always superior to both RCMs. The random network models perform worse and are omitted here parts the results of the linear comparison: The ERA bootstraps score best, followed by the RCMs and the random models. The latter are omitted in the figures, their ranking being the same as for the linear comparison, with the random networks' distributions lying even farther out than in Fig. 4. Apparently, reproducing the clustering structure is more challenging for a random model than reproducing the adjacency structure.
Another qualitative similarity is the extensive dominance of the statistical model, most notably in austral winter, where all of the STARS simulations are closer to the reference than the CCLM run. For the summer months, this dominance is equally pronounced only for SLP, but less pronounced otherwise. In case of T2M, PREC, and GEO500, 80, 49 and 90 % of the STARS ensemble perform better than CCLM, respectively.
Moreover, the clustering RMSE of the CCLM single run is often separated from the bootstrap distribution, the latter scoring worse, most prominently for T2M and PREC. Although visible in the linear difference measure H Ã as well (Fig. 5), this feature is more pronounced in the nonlinear case (Fig. 6), which implies that the higher-order network structure is more easily disturbed by the bootstrapping procedure than the adjacency structure, as measured by the Hamming distance. The complementarity of network-based and conventional difference measures is reflected by the models' ability to simulate SLP and GEO500 in DJF. One of the models may perform better according to both scores (STARS in panel M, CCLM in panel J) or prevail according to H Ã but not according to E Ã or F Ã , respectively (CCLM in panel I, STARS in panel N).
Finally, we observe a consistent difference in the rankings according to the mean-and the variance-based measure. While E Ã favors STARS in all cases but one (panel K), the result for F Ã is less clear with CCLM even outperforming STARS in both pressure variables. These findings are in line with the statistical model's presumably too low variability of H Ã values for T2M as discussed in Sect. 4.1.

Sensitivity to network construction parameters
We have presented extensive results for only one set of network construction parameters as stated in Sect. 3.2.1, namely l ¼ 7 days for the length of the sliding window and q ¼ 1 % for the link density of all networks involved. However, calculations were carried out for a wider range of these parameters. We found that varying l 2 ½3; 11 and q 2 ½0:5 %; 2 % did not alter the results qualitatively, which confirms the robustness of the demonstrated methodology.

Conclusions
In this study we have introduced a novel approach to climate model evaluation based on complex networks. To this end, we have defined two node-weighted difference measures H Ã and C Ã , which compare adjacency matrices and clustering coefficient fields, respectively. We applied our methodology to evaluate the performance of a statistical vs. a dynamical RCM in simulating the climate of South America.
We have evaluated daily means of 2 m temperature, precipitation, sea level pressure, and geopotential height at 500 hPa, comparing the respective model outputs to our ground truth, the forcing ERA-Interim data. For each variable, climate networks have been constructed based on cross-correlations of the time series, compared using H Ã and C Ã , and the findings have been related to the classic mean-and variance-based difference measures E Ã and F Ã .
For the linear network comparison (H Ã ), we have found that the statistical model STARS is better at reproducing the network structure of the temperature, precipitation, and pressure time series, while the dynamical model CCLM performed better for the geopotential. In the higher-order comparison (C Ã ), STARS is superior for all variables in austral winter (JJA), while CCLM scores almost comparably in DJF, except for sea level pressure.
While in most cases the conventional difference measures have been in agreement with H Ã , there were also cases in which the network structure was better reproduced by a model which was less favored by a conventional measure or vice versa, most notably in the pressure and geopotential variables. Although the construction of climate networks, representing statistical associations within climatological fields, takes more effort than applying rather simple measures like E Ã and F Ã , these complementary findings demonstrate the novelty and justification for our approach.
The finding of STARS being superior to CCLM for surface variables, not only according to traditional but also to the new network-based difference measures, demonstrates the physical consistency of the statistical model. However, it should be noted that the outcome of this study does not imply a general superiority of statistical to dynamical climate modeling. If, for instance, the reference networks had not been constructed from the driving ERA-Interim reanalysis but from independent observational data, the model ranking might have been different. A study on this matter is underway.
This work was meant to highlight the potential of the methodology. To demonstrate the robustness of our results concerning the comparison of RCM performances, an inclusion of further statistical and dynamical RCMs as well as applications to other regions are required. Such a study could be carried out e.g. in the CORDEX framework. Also the focus on ensemble runs with varying model parameters could be interesting, possibly revealing the influence of specific model components on the network structure of the output and in turn allowing for model improvements.
Further development of the methodology will include the application of additional difference measures, e.g. other metrics on adjacency matrices and the comparison of other network measures like betweenness or closeness to reveal more features of the higher-order network structure differences. Finally, local versions of network difference measures shall be developed in order to add spatial detail to the comparison.