Performance evaluation of global hydrological models in six large Pan-Arctic watersheds

Global Water Models (GWMs), which include Global Hydrological, Land Surface, and Dynamic Global Vegetation Models, present valuable tools for quantifying climate change impacts on hydrological processes in the data scarce high latitudes. Here we performed a systematic model performance evaluation in six major Pan-Arctic watersheds for different hydrological indicators (monthly and seasonal discharge, extremes, trends (or lack of), and snow water equivalent (SWE)) via a novel Aggregated Performance Index (API) that is based on commonly used statistical evaluation metrics. The machine learning Boruta feature selection algorithm was used to evaluate the explanatory power of the API attributes. Our results show that the majority of the nine GWMs included in the study exhibit considerable difficulties in realistically representing Pan-Arctic hydrological processes. Average APIdischarge (monthly and seasonal discharge) over nine GWMs is > 50% only in the Kolyma basin (55%), as low as 30% in the Yukon basin and averaged over all watersheds APIdischarge is 43%. WATERGAP2 and MATSIRO present the highest (APIdischarge > 55%) while ORCHIDEE and JULES-W1 the lowest (APIdischarge ≤ 25%) performing GWMs over all watersheds. For the high and low flows, average APIextreme is 35% and 26%, respectively, and over six GWMs APISWE is 57%. The Boruta algorithm suggests that using different observation-based climate data sets does not influence the total score of the APIs in all watersheds. Ultimately, only satisfactory to good performing GWMs that effectively represent cold-region hydrological processes (including snow-related processes, permafrost) should be included in multi-model climate change impact assessments in Pan-Arctic watersheds.


Introduction
The rapid environmental changes occurring in the Pan-Arctic have triggered increased attention from the scientific community. Such changes include observed decreasing extent and duration of snow cover (Pulliainen et al. 2020), permafrost thaw (Biskaborn et al. 2019), and related changes in soil active layer depth (Walvoord and Kurylyk 2016), increased melting rates of glaciers (Zemp et al. 2019), and changing partitioning of surface and groundwater (Walvoord and Striegl 2007), all of which affect the hydrological processes in Pan-Arctic watersheds. In addition, increasing discharge and subsequent freshwater transport to the Arctic Ocean have been documented (Ahmed et al. 2020), which impact bio-geophysical processes such as sea ice growth (Morison et al. 2012) and ocean circulation (Holliday et al. 2020). The observed changes, and more importantly their rate of change, have the potential for strong feedbacks to terrestrial ecosystems, the global climate system (McGuire et al. 2018;Post et al. 2019), and global freshwater circulation (Bring et al. 2016). Despite the increased scientific attention, our current understanding of the hydrologic cycle in the high latitudes and its linkages to other parts of the earth system still remains limited.
Pan-Arctic hydrological processes are largely controlled by the presence of permafrost, the strong climate seasonality, and the wide fluctuations in surface energy balance (Ge 2013). Annual peak discharge generally occurs following snowmelt, which presents the major hydrological event in Pan-Arctic watersheds, and is often associated with large-scale flooding (Bowling et al. 2003). Most of the snowmelt becomes overland flow as the ground is still frozen constraining infiltration. Hydrological processes in the Pan-Arctic are highly susceptible to climate change, particularly due to the freezing point threshold. To increase our understanding of Pan-Arctic hydrological processes, Global Water Models (GWMs), here including Global Hydrological Models (GHMs), Land Surface Models (LSMs), and Dynamic Global Vegetation model (DGVMs), could provide valuable tools for obtaining estimates of hydrological variables where data availability is poor both spatially and temporally. GWMs simulate the entire water cycle and make use of globally available datasets. Thereby, GWMs can complement the sparse observation records and support climate change impact assessments. A thorough performance evaluation is essential prior to applying models for climate change impact assessments in this region.
Previous model evaluation studies focusing on the Pan-Arctic differ from ours in terms of (i) the number and type of GWMs included, (ii) the spatial area/watersheds covered, (iii) the hydrological indicator(s) analyzed, and (iv) evaluation methods. Slater et al. (2007), for example, evaluated the performance of five LSMs for the period 1980-2001 across the Pan-Arctic drainage system including the Lena, Yenisei, Ob, and Mackenzie watersheds. Their results show that large differences in model performance exist across LSMs in terms of snow hydrological processes, water balance partitioning, discharge seasonality, and baseflow. Similarly, Andresen et al. (2019) found that LSMs tend to agree on decadal discharge trends but underestimate discharge volume when compared to gauge data across the major Arctic watersheds. Zaherpour et al. (2018) highlight the difficulty of GWMs in capturing the timing of the seasonal discharge cycle in northern regions effectively. In a multi-model evaluation study of daily runoff estimates, Beck et al. (2017) found that uncalibrated GWMs outperform, on average, uncalibrated LSMs in snow-dominated regions.
Global to continental scale multi-model climate change impact assessments are generally performed with GWMs disregarding model performance under historical conditions (e.g., Gosling et al. (2017)). A central tendency of the multi-model ensemble (mean or median) is often assumed as a good predictor due to large variations in performance of individual models and in their projections. Zaherpour et al. (2018) used a novel integrated evaluation method to show, however, that the ensemble mean fails to outperform best individual models for different hydrological indicators that represent mean and extreme discharge conditions. Therefore, using the ensemble mean and not carrying out a thorough model performance evaluation is not recommended. Krysanova et al. (2018) proposed guidelines consisting of 5 steps for effective evaluation of GHMs to be used prior to climate change impact assessments. Such a thorough model evaluation may suggest applying weighting coefficients to individual models in order to constrain the ensemble to the best performing members instead of using the ensemble mean approach (see Krysanova et al. (2020)). Thereby, confidence in projected impacts under climate change may potentially be increased.
The objective of our study is to contribute to the understanding of how GWMs, LSMs, and a DGVM perform in Pan-Arctic watersheds for different hydrological indicators, including monthly and seasonal discharge, extremes, trends (or lack of), and snow water equivalent (SWE), evaluated via a novel "Aggregated model Performance Index" (API). To reach this objective, we, firstly, systematically evaluated the performance of five global GHMs, three LSMs, and one DGVM using commonly used statistical evaluation metrics for six large watersheds in the Pan-Arctic based on the guidelines for GHM evaluation by Krysanova et al. (2018). After that, we assigned rating scores to each hydrological indicator based on thresholds defined for the statistical evaluation metrics. We calculated three APIs in total: API discharge , API extreme , and API SWE . The API combines the rating scores for every hydrological indicator in one index. We also applied the machine learning feature selection algorithm Boruta to evaluate the explanatory power of the API attributes (climate forcing, GWM, hydrological indicators, etc.). Our approach is easily interpretable and transferable to other model evaluations and inter-comparisons, and has a potential to deliver more robust multimodel climate change impact assessments.

Overview of study basins
The six largest watersheds located in the Pan-Arctic serve as a study area for the multi-model GWM performance evaluation: Kolyma, Lena, Yenisei, Ob, Mackenzie, und Yukon ( Fig. 1, Table 1). Watershed sizes range between 526,000 and 2,950,000 km 2 . The combined discharge from these watersheds is the single largest freshwater source to the Arctic Ocean (Yukon via the Bering Strait). Permafrost covers large parts of the studied watersheds (Fig. 1). Total permafrost coverage, which includes proportions of continuous, discontinuous, sporadic, and/or isolated permafrost, ranges between 34 (Ob) and 100% (Kolyma and Lena). Continuous permafrost covers only 3% in the Ob but the entire (100%) Kolyma watershed (Table 1, based on Brown et al. (1997)). In the northern, continuous permafrost zone, tundra vegetation dominates, while boreal forests are characteristic for the southern, mostly discontinuous permafrost zone. The climate ranges from polar in the high latitudes to subpolar and continental towards the lower latitudes. Arctic rivers are generally ice-covered for longer than 6 months of the year. Snow covers the Arctic landscapes for most of the year (e.g., 8 months in Arctic Alaska (end of September to May)) and contains a considerable amount of the total annual precipitation at the end-of-winter (Kane et al. 1991). Consequently, snow hydrological processes play an important role in the Arctic hydrological cycle. Population density is low in the study area (Kummu and Varis 2011).

Models and data
Model evaluation was based on measured discharge at 18 gaging stations (two to four gauging stations in each watershed: Fig. 1, Table 1) and estimates of SWE. Discharge measurements were retrieved from "The Global Streamflow Indices and Metadata Archive" (GSIM) Gudmundsson et al. 2018). Additionally, daily discharge data, used for the extreme discharge analysis, was provided by GRDC (Global Runoff Data Centre, 56068 Koblenz, Germany) at the outlet stations (highlighted in italics in Table 1). Estimates of total monthly Fig. 1 Overview of study area including watershed outlines, gauges, and permafrost extent and type (Brown et al. 1997). Watersheds (number) and gauging stations (letters) are detailed in Table 1 SWE were obtained from the remote sensing product GlobSnow-2 (Metsämäki et al. 2015) for the period 1980-2000. The SWE estimates were produced using a combination of passive microwave radiometer and ground-based weather station data.
Model performance is evaluated for nine GWMs (4 GHMs, 4 LSMs, 1 DGVM) that participated in the global water sector of ISIMIP2a (Gosling et al. 2019): the GHMs WaterGAP2, H08, MPI-HM, PCR-GLOBWB, the DGVM LPJmL, and the LSMs DBH, JULES-W1, MATSIRO, and ORCHIDEE (here all referred to as GWMs). The participating GWMs and their main characteristics are detailed in Table 2. The simulations are based on a common modeling protocol (ISIMIP2a 2018) which guarantees, as far as possible, consistent spatial (0.5°) and temporal model resolution as well as input and output datasets. All GWMs simulate the major global terrestrial hydrological processes, though using different algorithms and mathematical formulae (Table 2). Simulated daily discharge was available from all nine GWMs, and total monthly snow water equivalent (SWE) from six GWMs. MATSIRO, JULES-W1, and LPJmL represent permafrost temperatures and soil freeze and thaw processes that affect hydrological processes such as infiltration and water flow through permafrost. Three other GWMs (WaterGAP2, PCR-GLOBWB, MPI-HM) present permafrost coverage statically (fixed in space/time, by, e.g., reducing the maximum water holding capacity of the soil) without dynamic feedbacks/linkages to hydrology. We evaluated the simulations that do not consider the human influences on the water cycle, such as irrigation and dams. Apart from WaterGAP2, the GWMs were not calibrated. The calibration of WaterGAP2 solely focused on matching average long-term annual observed discharge by varying up to three parameters. Additional information can be found in the respective model description papers (Table 2) and Table 1 Study area details and gauging stations used (italicized ones represent the outlet/most downstream stations). Permafrost coverage in each watershed was calculated based on permafrost extent in Brown et al. (1997). Total permafrost coverage includes continuous, discontinuous, isolated, and sporadic permafrost. The locations (longitude and latitude of the gauging stations as represented in the models is displayed in Table S1 in the supplementary material). Watershed number and gauging station letter are in accordance with Fig. 1 Watersheds (numbered as in Fig. 1  Global Vegetation Model (DGVM)). Input climate variables include precipitation (P), mean air temperature (T), maximum air temperature (Tmax), minimum air temperature (Tmin), longwave downward radiation (LW), longwave net radiation (LWnet), shortwave downward radiation (SW), relative humidity (Q), surface pressure (SP), near-surface wind speed (W), snowfall rate (S), potential evapotranspiration (PET). For MPI-HM, potential evapotranspiration was computed during pre-processing based on LW, SW, T, W, SP, and Q.   (Table S2): Global Soil Wetness Project Phase 3 (GSWP3), Princeton, WATCH, and WFDEI. Müller Schmied et al. (2016) provided a more detailed description of the climate forcing datasets for hydrological studies. The GWM JULES-W1, however, provided simulation results for only three of the four climate forcing data (WATCH missing). In total, 35 model simulation combinations (4 forcing data sets for 8 GWMs and 3 forcing data sets for JULES-W1) were available for the hydrological model performance assessments.

Hydrological indicators
The hydrological indicators used in this study are detailed in Table 3. Monthly discharge, longterm mean monthly discharge (seasonal dynamics), and mean annual discharge were computed based on measured monthly and simulated daily discharge records. For 10 gauging stations, which include the outlet stations, the analysis period covers 30 years . For the remaining eight stations, the measured discharge record is shorter (between 20 and 29 years, Table S3). The calculation of the flow percentiles for high and low flows (based on daily measured and simulated discharge) is based on the daily 30-year record and was limited to the outlet gauging stations due to the data availability. Estimated (GlobSnow-2) and simulated total monthly SWE data were used to calculate long-term total monthly SWE .

Evaluating model performance
Our GWM performance evaluation approach for Pan-Arctic watersheds, based on guidelines provided in Krysanova et al. (2018), is summarized in Fig. 2. The model performance was evaluated for 14 different hydrological indicators (Table 3) at different locations within the watersheds in order to check internal consistency of the simulated hydrological processes. Table 3 Overview of the hydrological indicators used in this study and the statistical evaluation metrics applied (NSE Nash Sutcliffe Efficiency, PBIAS percent bias, SD standard deviation). Discharge related indicators were calculated for the time period 1971-2000 (or shorter, depending on data availability, see Table S3) and SWE for 1980-2000. The indicators monthly, seasonal, and annual discharge were evaluated at 18 gaging stations and the extremes at 6 gauging stations (outlets, Table 1). Seasonal SWE was evaluated at 4 points in each watershed (24 in total, locations defined in Table S4 (Table 4). A rating score of 1 is associated with good model performance, 0.5 with weak/satisfactory, and 0 with poor model performance. The statistical evaluation metrics used include percent bias (PBIAS), bias in standard deviation (bias in SD), and Nash and Sutcliffe Efficiency (NSE) (Nash and Sutcliffe 1970). The NSE (Eq. (1) in the supplementary material), a dimensionless model efficiency criterion, assesses overall model fit and is not very sensitive towards over-and underestimation (details in Krause et al. (2005)). Therefore, the monthly discharge performance evaluation was complemented by PBIAS (Eq. (2) in the supplementary material). The bias in SD (Eq. (3) in the supplementary material) assesses the standard deviation of the mean annual cycle between measured and simulated time series (MMD) and is therefore a suitable metric to evaluate model performance in terms of reproducing the seasonality (amplitude). The thresholds for the statistical evaluation metrics were initially oriented on widely used recommendations by Moriasi et al. (2007) and Moriasi et al. (2015) and by considering suggestions of Krysanova et al. (2018). In this study, the thresholds for the statistical performance were adjusted that means we made them less strict for GWMs. For example, the NSE and PBIAS thresholds in Moriasi et al. (2015) for good performance of monthly runoff in hydrological models are NSE ≥ 0.70 and PBIAS < ± 10, and for satisfactory performance, NSE > 0.55 and ± 10 ≤ PBIAS ≤ ± 15. In this study, we defined a good model performance of monthly runoff simulated by GWMs when NSE ≥ 0.5 and PBIAS is within ± 25%.  Fig. 2 Overview of study approach: a set of hydrological indicators were calculated based on observed (measured discharge, SWE from remote sensing product) and simulated discharge/SWE. Statistical evaluation metrics (NSE: Nash-Sutcliffe Efficiency, PBIAS: percent bias, and bias in SD (standard deviation)) are used to evaluate model performance for each hydrological indicator. Based on threshold values for each statistical evaluation metrics, rating scores are assigned for each climate forcing/model/gauging station/indicator for good, weak, and poor performance. The individual scores are aggregated to obtain an overall aggregated performance index (API). Aggregation is carried out separately for hydrological indicators related to monthly discharge, longterm mean monthly discharge, extremes (high and low flows), and SWE term mean monthly (seasonal) discharge. The rating scores were computed for each model simulation (nine GWMs forced by four climate datasets each) at 18 gaging stations and for four different metrics. An example of how the statistical evaluation metric NSE monthly is translated into a rating score is presented in Table S5 for the gauging station Kusur, Lena basin. In total, 2592 discharge rating scores were computed. For JULES-W1, we averaged over the statistical evaluation metrics (NSE monthly , PBIAS, NSE seasonal , BIAS in SD) of the three available climate forcing data sets to represent the missing WATCH-JULES-W1 for consistency. We then summed up the rating scores for each climate forcing (maximum score 4 for each model), for each gauging station within a watershed, and for all statistical evaluation criteria. For each watershed, between 288 and 576 rating scores, depending on the number of gauging stations (2-4), form the basis of the watershed specific API discharge . The rating scores were aggregated to 54 rating scores (9 rating scores for each watershed) which were then divided by the maximum possible score and transferred in % to get the API discharge for each model and watershed. An API discharge of 100% for one model means that for monthly discharge, NSE monthly is > 0.5 and PBIAS is within ± 25% and that for long-term mean monthly discharge NSE seasonal is > 0.7 and BIAS in standard deviation is within ± 25% at all gauging stations within a watershed. The API extreme was computed based on the statistical evaluation metric PBIAS for 10 percentile values (5 for low and 5 for high flow conditions) from the flow duration curve (Table 3), similarly as presented in Liersch et al. (2018). The percentiles were calculated based on daily measured and simulated discharge for a 30-year period (1971-2000) at the outlet stations. The magnitude of daily discharge that is exceeded 10%, 5%, 1%, 0.1%, and 0.01% and 90%, 95%, 99%, 99.9%, and 99.99% of the time corresponds to high flows and low flows, respectively. The assignment of rating scores was done by computing the PBIAS for each flow percentile individually. As a result, a total of 1080 scores for high and low flow were calculated (6 gauging stations (only outlets), 4 climate forcing datasets, 9 GWMs, 5 flow percentiles for high and low flow each, 1 statistical evaluation metric). For each watershed, 180 scores were aggregated to 54 model performance indices (9 for each watershed) for high and flow flows each.
The API SWE consists of the BIAS in SD and NSE seasonal between total monthly estimated (GlobSnow-2) and simulated SWE at four to five representative grid cells covering all cardinal directions in each watershed (Table S4). The location of the points is shown in Fig. S1. For SWE, rating scores were computed for 4 climate forcing datasets, 6 (out of 9) GWMs (Tables 2, 4 GHMs, 2 LSM), at 24 locations (4-5 locations in 6 watersheds), and 2 scores (NSE seasonal , BIAS in SD), totaling to 1152 scores. For each watershed, 192 to 240 scores (depending on the number of points) were aggregated to 36 model performance indices. Model Table 4 Rating scores and thresholds used for the statistical performance criteria. Discharge was analyzed in terms monthly (NSE monthly , PBIAS) and long-term mean monthly (seasonal dynamics, NSE seasonal , and BIAS in SD) temporal resolution. Snow water equivalent (SWE) was only evaluated for long-term mean monthly (seasonal dynamics, NSE seasonal , and bias in SD) temporal resolution. A rating score of 1 corresponds to a good, 0.5 a weak/satisfactory, and 0 to a poor performance. The values presented in the brackets and italics show the thresholds suggested by Moriasi et al. (2007) and (Moriasi et al. 2015) Rating scores NSE monthly NSE seasonal PBIAS and bias in SD analysis was restricted to the period 1980-2000, due to the data availability of the GlobSnow-2 product. The Boruta feature selection algorithm (Kursa et al. 2010) was used to estimate the relevance of each attribute to the total score of the APIs (API discharge , API extreme , API SWE ).
The attributes consisted of: climate forcing data (4) -GWMs (6 for API SWE , 9 for API discharge and API extreme ) statistical performance criteria (4 for API discharge , 1 for each percentile for API extreme , 2 for API SWE ) gauging station per watershed (2-4 for API discharge , 1 for API extreme )/SWE location (4-5)) For this purpose, we used the Boruta package in R. The analysis was carried out for each API and watershed separately.
In addition, the observed and simulated mean annual discharge time series were analyzed for possible trends (or lack of trend) using a simple linear regression analysis with a significance level of 0.05. Simulations for time periods without available measurements were excluded for consistency. The linear trend analysis is not part of the APIs, but a separate analysis step in accordance with the approach suggested by Krysanova et al. (2018).

Mean monthly discharge and seasonal dynamics
The performance of the GWMs regarding the statistical evaluation metrics NSE monthly (Fig. 3), PBIAS (Fig. 4), and NSE seasonal (Fig. S2) and BIAS in SD (Fig. S3) shows large differences across GWMs and climate forcing data set. When averaged over all climate forcing data and GWMs at all gauging stations, NSE monthly varies between 0.94 (WFDEI-WaterGAP2 at Igarka (Yenisei)) and − 28 (WATCH-LPJmL at Yukon (Eagle)), averaging to − 0.22. NSE seasonal averages to − 0.29 with a maximum of 0.98 (WATCH-MPI-HM at Hatyrik-Homo (Lena)) and a minimum of − 28 (WATCH-LPJmL at Eagle AK (Yukon)). Systematic under-/overestimation (PBIAS monthly discharge) varies between + 150% (WATCH-DBH at Hanti-Mansisk (Ob)) and − 87% (Princeton-ORCHIDEE at Nenana AK (Yukon)), averaging to 31%. The bias in SD averages to 50%, ranging from + 420% (WATCH-LPJmL at Hanti-Mansisk (Ob)) to − 99% (Princeton-ORCHIDEE at Pilot Point AK (Yukon)). Performance is, on average (over all statistical evaluation metrics, GWMs, and in all watersheds), not higher at outlet compared to upstream stations. Variability in discharge across GWMs is larger compared to the climate forcing data. No climate forcing data set consistently outperforms the other for all statistical metrics in all basins, though our analysis suggests that GWMs forced by GSWP3 show better results for bias in SD and PBIAS compared to when forced by the other climate data sets. GWMs forced by Princeton are more likely to perform poorer regarding PBIAS and NSE monthly .
Based on the assigned rating scores (Table 4) for each statistical evaluation metric, model performance regarding discharge (monthly and seasonal) was summarized for each watershed (Fig. 5(a)) and each GWM (Fig. 5(d)) via the API discharge . WaterGAP2 outperformed the other GWMs in all basins except in Kolyma. The API discharge of WaterGAP2 ranged between 38 (Kolyma) and 93% (Yukon) and averaging to 72% (Table S6). MATSIRO and MPI-HM also had an average API discharge above 50%, exceeding 60% in four basins. ORCHIDEE, JULES-W1, and the DGVM LPJmL have rather low average API discharge of 25%, 16%, and 32% respectively. For JULES-W1, API discharge was below 32% in all basins, averaging to 16% (Fig. 5(a)).
Considering that reaching a API discharge of 50% can be treated as an "acceptable model," 6 GWMs in Kolyma basin, 4 GWMs in Lena basin, 5 GWMs in Ob basin, 3 GWMs in Yenisei basin, 3 GWMs in Mackenzie basin, and 2 GWMs in Yukon basin meet the criterion. The Fig. 3 Model performance evaluated using the statistical evaluation metric "Nash-Suitcliffe Efficiency (NSE)" based on simulated and measured monthly discharge for each GWM forced by four observation-based climate datasets (GSWP3, Princeton, WATCH, WFDEI). Each row presents the results for one watershed (row 1: Ob; row 2: Yenisei; row 3: Lena; row 4: Kolyma; row 5: Yukon; row 6: Mackenzie) and each letter (a-r) refers to one gauging station from the outlet (left column) to the upstream basins (Table 1). The dotted lines at 0.3 and 0.5 present the thresholds for assigning rating scores. The y-axis was adjusted to only represent the range 0-1 average GWM performance is best for Kolyma basin (API discharge = 55%), followed by Lena, Ob, Yenisei, and Mackenzie (API discharge = 40%). In the Yukon watershed, API discharge is 30%. WaterGAP2 and MATSIRO demonstrated good or acceptable performance in five, MPI-HM in four, DBH in three, and H08 and PCR-GLOBWB in two basins. ORCHIDEE and LPJmL each performed well in only one basin, and all JULES-W1 results were below the acceptable level of 50% in all six basins (Fig. 5(b)). (2) based on simulated and measured monthly discharge for each GWM forced by four observation-based climate datasets (GSWP3, Princeton, WATCH, WFDEI). Each row presents the results for one watershed (row 1: Ob; row 2: Yenisei; row 3: Lena; row 4: Kolyma; row 5: Yukon; row 6: Mackenzie) and each letter (a-r) refers to one gauging station from the outlet (left column) to the upstream basins (Table 1). The dotted lines at ± 25 and ± 50 present the thresholds for assigning rating scores. The y-axis was adjusted to only represent the range − 100 to + 100% Figure 6 displays the observed and simulated mean seasonal discharge of the two best performing and two worst performing GWMs, based on the API, for each watershed. In the six watersheds, WaterGAP2 is four times among the best performing models, MATSIRO and MPI-HM twice, and DBH once. ORCHIDEE belongs to the poorest performing models in all watersheds, expect in Ob, followed by JULES-W1 (four times), DBH (twice), and LPJmL (once). The best performing models reproduce the seasonal dynamics satisfactorily, although the snow melt peak is, in the majority of the cases, underestimated and late summer discharge overestimated. The poorly performing GWMs do not reproduce the snowmelt peak neither in terms of timing (lag (DBH Mackenzie), lead (LPJmL in Ob)), nor magnitude (e.g., ORCHIDEE, JULES-W1, overestimation although timing is correct (DBH in Ob)). Consequently, the seasonal dynamic of the Pan-Arctic watersheds is not represented well by the GWMs as reflected in high absolute values of the BIAS in standard deviation (Fig. S3). Figure 6 also shows that the uncertainty caused by the choice of climate forcing datasets (shaded area around the mean) is highly variable across watersheds and GWMs.

Extremes
The API extreme aggregated for high and low flows, each including 5 percentiles, is summarized in Fig. 5(b, c, e, f). The API extreme is displayed separately for each percentile in Table 5 for high  Table S6 summarizes the underlying values for API discharge and Table S7 for API extreme and in Table S7 for low flow, each containing 270 values. Model performance is lower (average 35% for high and 26% for low flows) compared to mean discharge (43%, Fig. 5). The API extreme , on average, decreases from the less (Q 10 , Q 90 ) to the most (Q 0.01 , Q 99.99 ) extreme flow percentiles ( Table 5, Table S7). For Q 10 , for example, API extreme is > 50% for 36 out of 54 GWM and watershed combinations, while for Q 0.01 , it is only in 15 out of 54 cases. Similarly, for low flows, the number of cases that API extreme is > 50% reduces from 19 out of 54 (Q 90 ) to 12 out of 54 (Q 99.99 ). Among all GWMs, only MATSIRO has, on average, over all flow percentiles, an API extreme > 50% for both high and low flows. LPJmL reaches, consistently across all high flow percentiles, an API extreme of 100% and of > 50% in the Yenisei and Lena basin, respectively (Table 5). High and low flows are, on average over all high and low flow percentiles, best represented in the Ob watershed, with an API extreme > 50%. In all other basins, average API extreme ranges between 26 and 38% for high flows (Fig. 5(b)). Average API extreme for low flows ranges between 4 (Kolyma) and 51% (Ob) when categorized by watershed and between 0 (DBH) and 54% (MATSIRO) when categorized by GWM (Fig. 5(c, f)).

Trends mean annual discharge
Trends in measured mean annual discharge are found to be significant (p < 0.05) only at two stations: Igarka (Yenisei) and Sredne-Kolymsk (Kolyma) (Fig. S4). At Igarka, all simulations, except WFDEI-PCR-GLOBWB, agree with the measurements in simulating a negative trend in mean annual discharge despite difference in the magnitude (Fig. S4c). All simulated trends are also significant except for WFDEI-PCR-GLOBWB and LPJmL (for all climate data forcing sets) (Fig. S4 a). At Sredne-Kolymsk (Kolyma), all simulations, except WFDEI- Table 5 The API extreme for the high flow discharge percentiles (Q 10 , Q 5 , Q 1 , Q 0.1 , Q 0.01 ) for each GWM and watershed as visualized in Fig. 5. The darker blue color highlights an API > 50%, the lighter blue color an API = 50%, API < 50% are presented in gray. The yellow color highlights a performance in >5 0% averaged over all GWMs (per watershed, last column) and over all watersheds (per GWM, last row). The orange color highlights the overall average (over all watersheds and GWMs) PCR-GLOBWB, agree on a negative trend in mean annual discharge (Fig. S4d) but only 17 (out of 40, 42.5%) are also significant (Fig. S4b). At all other gauging stations, trends in measured mean annual discharge are not significant.

Snow water equivalent
The performance index regarding seasonal SWE (API SWE ) is displayed in Fig. S5 and the corresponding values in Table S8. Average API SWE is 57%. An API SWE > 50% is reached in 27 out of 36 cases. These numbers are higher compared to the averages for discharge and extremes, but it cannot be directly compared to the API discharge , as only six (compared to nine) models provided SWE output, only four grid cells are considered in each basin, and the analysis period differs slightly. GWMs reproduce SWE best in Mackenzie watershed (72%), followed by Lena (62%) and poorest in the Yukon basin (44%). All GWMs reach an API SWE ≥ 50%. The simulated seasonal dynamics of SWE is compared to the observations for each watershed in Fig. S6 to Fig. S11.

Boruta feature selection
For all APIs (API discharge , API extreme , API SWE ), the climatic forcing was consistently detected as not relevant by the Boruta algorithm across all watersheds. This implies that forcing the GWMs with four different (instead of only one) observation-based climate forcing data sets has a low relevance for the overall API score in this study. All other attributes, e.g., the GWMs and the statistical performance criteria, are confirmed relevant for the overall API score. For API SWE , other attributes, such as the statistical performance criteria, were, in some cases, also found unimportant in addition to the climate forcing data. The GWM is identified the most important attribute for API SWE in all watersheds except in Lena and Mackenzie, where the GWMs is, however, still among the three most important attributes (out of 10 in total). For the calculation of API SWE and API discharge in the Kolyma watershed, the data available to train the Boruta algorithm was likely not sufficient (for API SWE only 6 GWMs, for API discharge Kolyma only two gauging stations).

Discussion
The GWMs often have a considerable bias (mostly systematic underestimation) and difficulties in reproducing the seasonal discharge cycle when compared against observations in Pan-Arctic watersheds. Overall GWM performance, assessed for different hydrological indicators with several statistical evaluation metrics for up to four gauges in each watershed, ranges from satisfactory to poor. However, in some cases, API is larger than 70% (9 of 54 for the monthly and seasonal discharge, 10 of 54 for high flows and 8 of 54 for low flow, 3 out of 36 for SWE).
No GWM consistently outperforms the other models in all watersheds and for all indicators, and model performance, on average, does not increase with basin size. This is in line with other model inter-comparison studies (e.g., Slater et al. (2007)), where also no model was the best or worst performing when compared to a range of observations and in different watersheds across the Pan-Arctic. Our results, satisfactory to poor performance of GWMs, are also consistent with global studies that also include watersheds located in temperate and tropical climates (Krysanova et al. n.d.). In the study by Krysanova et al. (n.d.), the best (WaterGAP2 and MATSIRO) and poorest (e.g., LPJmL) performing GWMs match with our study while two GWMs, H08 and DBH, performed slightly better in the Artic compared to other climate zones.
We also demonstrate that the variability across the observation-based climate forcing data is smaller compared to that across GWMs. This is also confirmed by the feature selection using the Boruta algorithm. The large variability of performance across GWMs is most likely related to model structural differences and/or lack of physical process representation for some processes, difficulties to represent some processes with a relatively coarse resolution of 0.5°, and missing calibration (except WaterGAP2, which was calibrated) as well as no targeted model setup/parameterization focusing on Arctic hydrological processes.
Most GWMs struggle to simulate the snowmelt peak, the most important hydrological event in (sub) Arctic rivers, both in terms of absolute discharge amount and timing. This is directly linked to the GWMs rather simple representation of snow hydrological processes including the onset of snowmelt (isothermal phase change of the snowpack), the fate of snowmelt (infiltration into soils, refreezing over cold periods), snow compaction, and redistribution of snow on the landscape. Additionally, processes related to and affecting river routing, such as ice jams and dams, are highly complex and often are not considered or only included very simplistic in GWMs. Dams are not considered in the runs without human influences that we analyzed here. This likely explains the relatively poor model performance in the Ob (particularly at Hanti-Mansisk (Irtysch River)) and in the Yenisei watershed where the impact of dams on changes in the seasonal discharge has been documented (Adam et al. 2007). Concurrently, general errors in the forcing data, which are consistent across all datasets, such as snowfall underestimation (Beck et al. 2017;Hancock et al. 2014) and uncertainties in wind speed, amplify the rather poor simulation of snowmelt peak flow. Strong winds that are characteristic for Arctic tundra environments enhance sublimation and could therefore add to the general underestimation of the snowmelt peak by GWMs.
Except for many GWMs in the Ob watersheds (particularly at gauges Salekhard and Hanti-Mansisk) and for DBH and H08 in Mackenzie and Lena (gauge: Hatyrik-Homo), the GWMs have a tendency to underestimate measured monthly discharge in this region. This phenomena has already been documented by others, e.g., Andresen et al. (2019) and Lohmann et al. (2004). Lohmann et al. (2004) highlighted that measured discharge is underestimated by LSMs in areas with significant snowfall, and that snowmelt peak timing can be off by up to 4 months. Beck et al. (2017) and Hancock et al. (2014) attributed an early bias in spring snowmelt peak to precipitation underestimation that leads to insufficient snow accumulation and subsequently to too rapid snow melt. In our study, GWMs forced by Princeton underestimate, on average, discharge (and snowmelt peak) more significantly. The GWMs forced by WFDEI, WATCH, and GSWP3 perform better, as precipitation is corrected for snow undercatch and scaled to the monthly precipitation sums of Global Precipitation Climatology Centre (GPCC). GWM improvements, particularly, related to snow hydrological processes are, however, limited by sparse data availability and the challenges in measuring snow-related processes effectively over larger spatial and temporal scales.
Under historic climate conditions, the GHMs, on average, performed better than the LSMs in the Pan-Arctic watersheds, with the exception (in many cases) of MATSIRO. Beck et al. (2017) suggest that the differences in model performance are caused by the snow routines with the simple conceptual degree-day approach (GHMs) outperforming the physically based energy balance approach (LSMs). In our case, however, DBH performed reasonably well regarding SWE (Fig. S5), despite relying on the more complex energy balance approach. This suggests that the reasons for better/poorer performance are more complex. For example, the overall annual flow volume differs in many cases considerably between GWMs and observations as well as across GWMs. We also suggest to additionally evaluate the entire water balance as well as the sensitivity of water balance components to changes in the climatic drivers to gain a deeper understanding of the GWM structural differences in Arctic watersheds. Hattermann et al. (2017) have for example shown that the climate sensitivity of uncalibrated GWMs in the historical period is comparable to calibrated regional models. However, the differences in processes implementation (particularly permafrost) could make larger differences under climate change conditions. The consideration of permafrost temperatures and annual soil freeze/thaw processes in JULES-W1, MATSIRO, and LPJmL add additional complexity but is imperative for climate change impact assessments as increased surface and subsurface water interactions will have the potential to change the discharge regimes of highlatitude watersheds with large-scale consequences for land-atmosphere biochemical processes.
Model performances concerning monthly and long-term mean monthly discharge are poorer for the watersheds located in North America (Yukon followed by Mackenzie). These watersheds are characterized by a higher percentage of glacier coverages compared to the watersheds located in Asia. Glaciers, even with a low overall coverage, influence the discharge regime considerably, especially under drought/low rainfall conditions (Huss 2011). None of the GWMs evaluated here explicitly simulate glacier-related hydrological processes, which most likely also contributes to their weak or poor performance in these watersheds.
WaterGAP2 (calibrated only for the long-term average annual discharge) outperformed the other GWMs when analyzing hydrological indicators based on monthly and long-term mean monthly discharge, which underscores the importance of targeted calibration for the hydrological indicator of interest. However, our study showed, similarly to Zaherpour et al. (2018), that WaterGAP2's performance for extremes is only average. For SWE, the performance of WaterGAP2 (API SWE = 48%) is below the average (API SWE = 57%). MATSIRO outperforms the other GWMs for discharge extremes (API extreme = 51% for high and 54% for low flows), and is the second best performing model for discharge (API discharge = 58%). Zaherpour et al. (2018) highlighted the superior, physically based snow and soil scheme of MATSIRO. In our analysis, MATSIRO did not outperform the other GWMs regarding SWE. However, we have to admit that our SWE evaluation approach was rather simplistic, and may not represent the model performance accurately enough, since we compared the simulated SWE output to the GlobSnow-2 product at only four randomly chosen grid cells within the entire watershed. Additionally, GLobSnow-2, a remote sensing product, is known to inherit uncertainties (Metsämäki et al. 2015).
DGVMs are developed to explain the changes in vegetation dynamics and associated impacts on and linkages to the hydrological and carbon cycles, while LSMs are designed to simulate the exchange of water, carbon, and energy in General Circulation and Earth System Models. Therefore, hydrological processes per se and the interaction with snow and permafrost were not the main objectives for neither their development nor their setup/parameterization. This may explain a weak or poor performance of ORCHIDEE, LPJmL, and JULES-W1 for discharge and snow seasonality in many cases. Updated model versions of ORCHIDEE, such as reported in Wang et al. (2013) and Guimberteau et al. (2018), improved model performance regarding Artic hydrological processes considerably (such as snow depth, SWE, snow albedo, and snowmelt runoff) which highlights the potential of targeted model development to improve model performance independently of explicit parameter calibration.

Conclusions and outlook
Our results show that the majority of the nine GWMs included in the study exhibit considerable difficulties in realistically representing Pan-Arctic hydrological processes. The performance evaluation showed that no GWM outperformed other GWMs in all watersheds and for all hydrological indicators. This also implies that we could not identify any links between the model efficiency and model structure/parameterization. We also highlighted that no climate forcing dataset was better suited for all indicators and statistical evaluation metrics, though results based on Princeton showed a larger PBIAS on average. Thus, there is an urgent need to refine the representation of cold-region hydrological processes in GWMs to improve their performance, as changes in these processes will govern the fate of the Arctic. This includes direct process representation (snow, permafrost, glaciers, river routing accounting for ice cover/different flow velocities) and a more appropriate model setup and parameterization. The GHM WaterGAP2 outperformed the more complex LSMs for mean flow, while the LSM MATSIRO showed a better performance than other models for extremes. The GHMs often lack relevant cold-region processes (permafrost, glaciers), have simple snow schemes, and represent vegetation statically. Many LSMs, on the other hand, represent vegetation dynamically which is highly relevant under climate change as vegetation cover strongly influences the partitioning of precipitation into evapotranspiration and discharge.
Considering the relatively weak or poor performance of most GWMs in most watersheds under current climate conditions, a comprehensive model evaluation is recommended before conducting climate impact assessments. Only GWMs that show good or satisfactory evaluation results (i.e., API > 0.5) and that represent cold-region hydrological processes (snow hydrological processes, permafrost) effectively should be included in multi-model climate change impact assessments in Pan-Arctic watersheds. Models meeting these criteria include WaterGAP2 (though requiring a dynamic permafrost module), MATSIRO, MPI-HM for monthly/seasonal dynamics, MATSIRO and LPJmL for high flows, and MATSIRO for low flows. All other models require performance improvement and all except JULES-W1 the incorporation of dynamic permafrost. Tools, such as the API presented here, are promising to define weighting coefficients based on model performance, and to identify GWMs that potentially could be applied for impact assessment, and GWMs that should be excluded from the ensemble. Model exclusion should trigger an analysis about the model's shortcomings and result in model refinement. Therefore, a close collaboration between impact modelers and the model development teams is crucial. Ultimately, the model weighting approach could provide more trustworthy results of impact assessment and reduce large uncertainty ranges which are generally characteristic for multi-model climate change impact assessments. Further, the API makes the interpretation of model performance involving a large ensemble of participating GWMs transparent and easily understandable by a wide range of audience.
Our study can be extended by defining model weighting coefficients based on the API and comparing this approach to the traditional ensemble mean approach in climate change impact assessments. Another interesting study could be done by comparing GWM performances with and without including human influences, which could be also done using our suggested API approach. Other statistical evaluation metrics and additional hydrological indicators could be also included in the evaluation. Considering the rapid rate of environmental changes occurring in the Pan-Arctic, we urgently need to increase our understanding of the hydrological cycle and its linkages to other parts of the earth system, and GWMs are suitable tools to do so.